1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-30 08:57:19 +02:00
awesome-web-scraping/python.md
2020-09-14 13:17:38 +03:00

26 KiB

Python Web Scraping

This list contains python libraries related to web scraping and data processing

Contents

Network

Network : General

  • urllib - network library (stdlib)
  • requests - network library
  • grab - network library (pycurl based)
  • pycurl - network library (binding to libcurl)
  • urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
  • httplib2 - Small, fast HTTP client library. Features persistent connections, cache, and Google App Engine support.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • mechanize - Stateful programmatic web browsing.
  • socket low-level networking interface (stdlib)
  • Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
  • hyper - HTTP/2 Client for Python
  • PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.

Network : Asynchronous

  • treq - requests like API (twisted based)
  • aiohttp - http client/server for asyncio (PEP-3156)

Network : Low Level

  • dpkt - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
  • pyOpenSSL - A Python wrapper around the OpenSSL library
  • tlslite-ng - TLS implementation in pure python
  • scapy - powerful Python-based interactive packet manipulation program and library

Web Scraping

Web Scraping : Frameworks

  • grab - web-scraping framework (pycurl/multicurl based)
  • scrapy - web-scraping framework (twisted based).
  • pyspider - A powerful spider system.
  • cola - A distributed crawling framework.
  • ruia - Async Python 3.6+ web scraping micro-framework based on asyncio
  • ioweb - Web scraping framework based on gevent and lxml
  • autoscraper - A smart, automatic and lightweight web scraper

Web Scraping : Tools

  • portia - Visual scraping for Scrapy.
  • restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
  • requests-html - Pythonic HTML Parsing for Humans.
  • ScrapydWeb - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.

Web Scraping : Bypass Protection

  • cloudscraper - A Python module to bypass Cloudflare's anti-bot page.

HTML/XML

HTML/XML : General

  • lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
  • cssselect - working with DOM tree with CSS selectors
  • pyquery - working with DOM tree with jQuery-like selectors
  • BeautifulSoup - slow HTML/XMl processing library, written in pure python
  • html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
  • feedparser - parsing of RSS/ATOM feeds.
  • MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
  • xmltodict - Working with XML feel like you are working with JSON.
  • xhtml2pdf - HTML/CSS to PDF converter.
  • untangle - Converts XML documents to Python objects for easy access.
  • hodor - Configuration driven wrapper around lxml and cssselect.
  • chopper - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
  • selectolax - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
  • parsel - Lets you extract data from XML/HTML documents using XPath or CSS selectors.

HTML/XML : Sanitizing

  • Bleach - cleaning of HTML (requires html5lib)
  • sanitize - Bringing sanity to world of messed-up data.

HTML/XML : Metadata

  • extruct - A library for extracting embedded metadata from HTML markup.

Text Processing

Libraries for parsing and manipulating plain texts.

Text Processing : General

  • difflib - (Python standard library) Helpers for computing deltas.
  • Levenshtein - Fast computation of Levenshtein distance and string similarity.
  • fuzzywuzzy - Fuzzy String Matching.
  • esmre - Regular expression accelerator.
  • ftfy - Makes Unicode text less broken and more consistent automagically.

Text Processing : Transliteration

  • unidecode - ASCII transliterations of Unicode text.

Text Processing : Character Encoding

  • uniout - Print readable chars instead of the escaped string.
  • chardet - Python 2/3 compatible character encoding detector.
  • xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
  • pangu.py - Spacing texts for CJK and alphanumerics.
  • cchardet - cChardet is high speed universal character encoding detector. - binding to uchardet.

Text Processing : Slugify

  • awesome-slugify - A Python slugify library that can preserve unicode.
  • python-slugify - A Python slugify library that translates unicode to ASCII.
  • unicode-slugify - A slugifier that generates unicode slugs.
  • pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)

Text Processing : General Parser

  • PLY - Implementation of lex and yacc parsing tools for Python
  • pyparsing - A general purpose framework for generating parsers.

Text Processing : Human Names

Text Processing : Phone Number

  • phonenumbers - Parsing, formatting, storing and validating international phone numbers.

Text Processing :: User-Agent strings

Text Processing : robots.txt

  • reppy - Modern robots.txt Parser for Python

Text Processing :: Date and Time

  • dateutil - Useful extensions to the standard Python datetime features

Text Processing :: Price and Currency

  • price-parser - a small library for extracting price and currency from raw text strings.

Structured Formats

Libraries for parsing and manipulating specific text formats.

Structured Formats : General

  • tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
  • textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
  • messytables - Tools for parsing messy tabular data
  • rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)

Structured Formats : Office

  • python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
  • xlwt / xlrd - Writing and reading data and formatting information from Excel files.
  • XlsxWriter - A Python module for creating Excel .xlsx files.
  • xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
  • openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
  • Marmir - Takes Python data structures and turns them into spreadsheets.

Structured Formats : PDF

  • PDFMiner - A tool for extracting information from PDF documents.
  • PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
  • ReportLab - Allowing Rapid creation of rich PDF documents.
  • pdftables - Extract tables from PDF files directly

Structured Formats : Markdown

  • Python-Markdown - A Python implementation of John Gruber’s Markdown.
  • Mistune - Fastest and full featured pure Python parsers of Markdown.
  • markdown2 - A fast and complete Python implementation of Markdown
  • mistletoe - A fast, extensible and spec-compliant Markdown parser in pure Python

Structured Formats : YAML

  • PyYAML - YAML implementations for Python.

Structured Formats : CSS

Structured Formats : ATOM/RSS

Structured Formats : SQL

  • sqlparse - A non-validating SQL parser.

Structured Formats : HTTP

  • http-parser - HTTP request/response parser for python in C
  • httptools - a Python binding for nodejs HTTP parser

Structured Formats : Microformats

  • opengraph - A Python module to parse the Open Graph Protocol tags

Structured Formats : Portable Executable

  • pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.

Structured Formats : PSD

Structured Formats : Bookmarks File

Serialization

  • orjson - Fast, correct Python JSON library supporting dataclasses and datetimes
  • ujson - Ultra fast JSON decoder and encoder written in C with Python bindings

Natural Language Processing

Libraries for working with human languages.

  • NLTK - A leading platform for building Python programs to work with human language data.
  • spacy - Enables using State-of-the-Art Deep Learning models for common NLP tasks.
  • fastai - Deep Learning library with free video tutorials + active forum community, downside of lib: GPU needed
  • gensim - library for topic modeling, document indexing and similarity retrieval with large corpora
  • Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
  • TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
  • jieba - Chinese Words Segmentation Utilities.
  • SnowNLP - A library for processing Chinese text.
  • loso - Another Chinese segmentation library.
  • genius - A Chinese segment base on Conditional Random Field.
  • langid.py - Stand-alone language identification system.
  • Korean - A library for Korean morphology.
  • pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
  • PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
  • langdetect - Port of Google's language-detection library to Python

Browser Automation

Browser Automation : Browsers

  • selenium - automating real browsers (Chrome, Firefox, Opera, IE)
  • Ghost.py - wrapper of QtWebKit (requires PyQT)
  • Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
  • Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
  • Requestium - Integration layer between Requests and Selenium for automation of web actions.
  • Splash - Lightweight, scriptable browser as a service with an HTTP API.
  • pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)

Browser Automation : Tools

  • xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

  • threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
  • multiprocessing - standard python library to run processes.
  • concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

  • asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
  • Twisted - An event-driven networking engine.
  • Tornado - A Web framework and asynchronous networking library.
  • pulsar - Event-driven concurrent framework for Python.
  • diesel - Greenlet-based event I/O Framework for Python.
  • gevent - A coroutine-based Python networking library that uses greenlet.
  • eventlet - Asynchronous framework with WSGI support.
  • Tomorrow - Magic decorator syntax for asynchronous code.
  • grequests - Make asynchronous HTTP Requests easily.

Job Queue

  • celery - An asynchronous task queue/job queue based on distributed message passing.
  • huey - Little multi-threaded task queue.
  • mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
  • RQ - lightweight task queue manager based on redis
  • simpleq - A simple, infinitely scalable, Amazon SQS based queue.
  • python-gearman - python API for Gearman

Message Queue

  • kombu - Messaging library for Python

Cloud Computing

Email

Libraries for parsing email.

  • flanker - A email address and Mime parsing library.
  • Talon - Mailgun library to extract message quotations and signatures.

URL and Network Address

Libraries for parsing/modifying URLs, network addresses, domain names.

URL and Network Address : URL

  • furl - A small Python library that makes manipulating URLs simple.
  • purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
  • urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)

URL and Network Address : Network Address

  • netaddr - A Python library for representing and manipulating network addresses.
  • micawber - A small library for extracting rich content from URLs.

Domain Names

  • tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
  • find_domains - a library to search for domain names in text data

Web Content Extraction

Libraries for extracting web contents.

  • newspaper - News extraction, article extraction and content curation in Python.
  • python-goose - HTML Content/Article Extractor.
  • scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
  • htmldate - Find creation date using common structural patterns or text-based heuristics.
  • lassie - Web Content Retrieval for Humans.
  • html2text - Convert HTML to Markdown-formatted text.
  • libextract - Extract data from websites.
  • python-readability - Fast Python port of arc90's readability tool.
  • sumy - A module for automatic summarization of text documents and HTML pages.
  • Haul - An Extensible Image Crawler.
  • you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
  • youtube-dl - A small command-line program to download videos from YouTube.
  • WikiTeam - Tools for downloading and preserving wikis.
  • linkchecker - check links in web documents or full websites
  • python-sitemap - Mini website crawler to make sitemap from a website.
  • trafilatura - Fast extraction of main text and comments along with structure, conversion to TXT, CSV & XML.
  • advertools - A customizable crawler to analyze SEO and content of pages and websites.

WebSocket

Libraries for working with WebSocket.

  • Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
  • AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
  • WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

  • dnspython - a powerful DNS toolkit for python
  • dnsyo - Check your DNS against over 1500 global DNS servers.
  • pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

  • OpenCV - Open Source Computer Vision Library.
  • SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
  • mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

  • scylla - Intelligent proxy pool for Humans
  • ProxyBroker - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
  • shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
  • tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Whois

  • python-whois - A python module for retrieving and parsing WHOIS data

Website Specific Scraper

JavaScript Engine Bindings

  • Js2Py - JavaScript to Python Translator & JavaScript interpreter written in 100% pure Python
  • v8eval - Multi-language bindings to JavaScript engine V8

Other python lists