awesome/awesome-web-scraping

Fork 0

mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-24 08:32:19 +02:00

Gregory Petukhov 64e44f26e9

Use github links for some of packages in the list

2020-03-27 17:01:52 +03:00

26 KiB

Raw Blame History

Python Web Scraping

This list contains python libraries related to web scraping and data processing

Network
Web Scraping
HTML/XML
Text processing
Structured Formats
Serialization
Natural Language Processing
Browser automation
Multiprocessing
Job Queue
Message Queue
Cloud Computing
Email
URL and Network Address
Web Content Extraction
Asynchronous
WebSocket
DNS Resolving
Computer Vision
Proxy Server
Whois
Website Specific Scraper
JavaScript Engine Bindings
Other Python Lists

Network

Network : General

urllib - network library (stdlib)
requests - network library
grab - network library (pycurl based)
pycurl - network library (binding to libcurl)
urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
httplib2 - Small, fast HTTP client library. Features persistent connections, cache, and Google App Engine support.
RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
MechanicalSoup - A Python library for automating interaction with websites.
mechanize - Stateful programmatic web browsing.
socket low-level networking interface (stdlib)
Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
hyper - HTTP/2 Client for Python
PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.

Network : Asynchronous

treq - requests like API (twisted based)
aiohttp - http client/server for asyncio (PEP-3156)

Network : Low Level

dpkt - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
pyOpenSSL - A Python wrapper around the OpenSSL library
tlslite-ng - TLS implementation in pure python
scapy - powerful Python-based interactive packet manipulation program and library

Web Scraping

Web Scraping : Frameworks

grab - web-scraping framework (pycurl/multicurl based)
scrapy - web-scraping framework (twisted based).
pyspider - A powerful spider system.
cola - A distributed crawling framework.
ruia - Async Python 3.6+ web scraping micro-framework based on asyncio
ioweb - Web scraping framework based on gevent and lxml

Web Scraping : Tools

portia - Visual scraping for Scrapy.
restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
requests-html - Pythonic HTML Parsing for Humans.
ScrapydWeb - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.

Web Scraping : Bypass Protection

cloudscraper - A Python module to bypass Cloudflare's anti-bot page.

HTML/XML

HTML/XML : General

lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
cssselect - working with DOM tree with CSS selectors
pyquery - working with DOM tree with jQuery-like selectors
BeautifulSoup - slow HTML/XMl processing library, written in pure python
html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
feedparser - parsing of RSS/ATOM feeds.
MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
xmltodict - Working with XML feel like you are working with JSON.
xhtml2pdf - HTML/CSS to PDF converter.
untangle - Converts XML documents to Python objects for easy access.
hodor - Configuration driven wrapper around lxml and cssselect.
chopper - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
selectolax - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
parsel - Lets you extract data from XML/HTML documents using XPath or CSS selectors.

HTML/XML : Sanitizing

Bleach - cleaning of HTML (requires html5lib)
sanitize - Bringing sanity to world of messed-up data.

HTML/XML : Metadata

extruct - A library for extracting embedded metadata from HTML markup.

Text Processing

Libraries for parsing and manipulating plain texts.

Text Processing : General

difflib - (Python standard library) Helpers for computing deltas.
Levenshtein - Fast computation of Levenshtein distance and string similarity.
fuzzywuzzy - Fuzzy String Matching.
esmre - Regular expression accelerator.
ftfy - Makes Unicode text less broken and more consistent automagically.

Text Processing : Transliteration

unidecode - ASCII transliterations of Unicode text.

Text Processing : Character Encoding

uniout - Print readable chars instead of the escaped string.
chardet - Python 2/3 compatible character encoding detector.
xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
pangu.py - Spacing texts for CJK and alphanumerics.
cchardet - cChardet is high speed universal character encoding detector. - binding to uchardet.

Text Processing : Slugify

awesome-slugify - A Python slugify library that can preserve unicode.
python-slugify - A Python slugify library that translates unicode to ASCII.
unicode-slugify - A slugifier that generates unicode slugs.
pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)

Text Processing : General Parser

PLY - Implementation of lex and yacc parsing tools for Python
pyparsing - A general purpose framework for generating parsers.

Text Processing : Human Names

python-nameparser - Parsing human names into their individual components.

Text Processing : Phone Number

phonenumbers - Parsing, formatting, storing and validating international phone numbers.

Text Processing :: User-Agent strings

HTTP Agent Parser - Python HTTP Agent Parser
uap-python - Python implementation of ua-parser
python-user-agents - Browser user agent parser.
fake-useragent - Python user agent string faker, based on world statistic of browsers
user_agent - Generator of User-Agent data

Text Processing : robots.txt

reppy - Modern robots.txt Parser for Python

Text Processing :: Date and Time

dateutil - Useful extensions to the standard Python datetime features

Text Processing :: Price and Currency

price-parser - a small library for extracting price and currency from raw text strings.

Structured Formats

Libraries for parsing and manipulating specific text formats.

Structured Formats : General

tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
messytables - Tools for parsing messy tabular data
rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)

Structured Formats : Office

python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
xlwt / xlrd - Writing and reading data and formatting information from Excel files.
XlsxWriter - A Python module for creating Excel .xlsx files.
xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
Marmir - Takes Python data structures and turns them into spreadsheets.

Structured Formats : PDF

PDFMiner - A tool for extracting information from PDF documents.
PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
ReportLab - Allowing Rapid creation of rich PDF documents.
pdftables - Extract tables from PDF files directly

Structured Formats : Markdown

Python-Markdown - A Python implementation of John Gruber’s Markdown.
Mistune - Fastest and full featured pure Python parsers of Markdown.
markdown2 - A fast and complete Python implementation of Markdown
mistletoe - A fast, extensible and spec-compliant Markdown parser in pure Python

Structured Formats : YAML

PyYAML - YAML implementations for Python.

Structured Formats : CSS

cssutils - A CSS library for Python.

Structured Formats : ATOM/RSS

feedparser - Universal feed parser.

Structured Formats : SQL

sqlparse - A non-validating SQL parser.

Structured Formats : HTTP

http-parser - HTTP request/response parser for python in C
httptools - a Python binding for nodejs HTTP parser

Structured Formats : Microformats

opengraph - A Python module to parse the Open Graph Protocol tags

Structured Formats : Portable Executable

pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.

Structured Formats : PSD

psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures.

Structured Formats : Bookmarks File

bookmarks-parser - Parses Firefox/Chrome HTML bookmarks files

Serialization

orjson - Fast, correct Python JSON library supporting dataclasses and datetimes
ujson - Ultra fast JSON decoder and encoder written in C with Python bindings

Natural Language Processing

Libraries for working with human languages.

NLTK - A leading platform for building Python programs to work with human language data.
spacy - Enables using State-of-the-Art Deep Learning models for common NLP tasks.
fastai - Deep Learning library with free video tutorials + active forum community, downside of lib: GPU needed
gensim - library for topic modeling, document indexing and similarity retrieval with large corpora
Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
jieba - Chinese Words Segmentation Utilities.
SnowNLP - A library for processing Chinese text.
loso - Another Chinese segmentation library.
genius - A Chinese segment base on Conditional Random Field.
langid.py - Stand-alone language identification system.
Korean - A library for Korean morphology.
pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
langdetect - Port of Google's language-detection library to Python

Browser Automation

Browser Automation : Browsers

selenium - automating real browsers (Chrome, Firefox, Opera, IE)
Ghost.py - wrapper of QtWebKit (requires PyQT)
Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
Requestium - Integration layer between Requests and Selenium for automation of web actions.
Splash - Lightweight, scriptable browser as a service with an HTTP API.
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)

Browser Automation : Tools

xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
multiprocessing - standard python library to run processes.
concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
Twisted - An event-driven networking engine.
Tornado - A Web framework and asynchronous networking library.
pulsar - Event-driven concurrent framework for Python.
diesel - Greenlet-based event I/O Framework for Python.
gevent - A coroutine-based Python networking library that uses greenlet.
eventlet - Asynchronous framework with WSGI support.
Tomorrow - Magic decorator syntax for asynchronous code.
grequests - Make asynchronous HTTP Requests easily.

Job Queue

celery - An asynchronous task queue/job queue based on distributed message passing.
huey - Little multi-threaded task queue.
mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
RQ - lightweight task queue manager based on redis
simpleq - A simple, infinitely scalable, Amazon SQS based queue.
python-gearman - python API for Gearman

Message Queue

kombu - Messaging library for Python

Cloud Computing

picloud - executing python-code in cloud
dominoup.com - executing R, Python и matlab code in cloud
minigun-requests - Web scraping API to outsource tons of GET & xpath to cloud computing
pythonista-chromeless - AWS lambda which execute given python code on selenium

Email

Libraries for parsing email.

flanker - A email address and Mime parsing library.
Talon - Mailgun library to extract message quotations and signatures.

URL and Network Address

Libraries for parsing/modifying URLs and network addresses.

URL and Network Address : URL

furl - A small Python library that makes manipulating URLs simple.
purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

URL and Network Address : Network Address

netaddr - A Python library for representing and manipulating network addresses.
micawber - A small library for extracting rich content from URLs.

Web Content Extraction

Libraries for extracting web contents.

newspaper - News extraction, article extraction and content curation in Python.
python-goose - HTML Content/Article Extractor.
scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
htmldate - Find creation date using common structural patterns or text-based heuristics.
lassie - Web Content Retrieval for Humans.
html2text - Convert HTML to Markdown-formatted text.
libextract - Extract data from websites.
python-readability - Fast Python port of arc90's readability tool.
sumy - A module for automatic summarization of text documents and HTML pages.
Haul - An Extensible Image Crawler.
you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
youtube-dl - A small command-line program to download videos from YouTube.
WikiTeam - Tools for downloading and preserving wikis.
linkchecker - check links in web documents or full websites
python-sitemap - Mini website crawler to make sitemap from a website.
trafilatura - Fast extraction of main text and comments along with structure, conversion to TXT, CSV & XML.

WebSocket

Libraries for working with WebSocket.

Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

dnspython - a powerful DNS toolkit for python
dnsyo - Check your DNS against over 1500 global DNS servers.
pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

OpenCV - Open Source Computer Vision Library.
SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

scylla - Intelligent proxy pool for Humans
ProxyBroker - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Whois

python-whois - A python module for retrieving and parsing WHOIS data

Website Specific Scraper

twitter-scraper - Scrape the Twitter Frontend API without authentication
Ultimate-Facebook-Scraper - A bot which scrapes almost everything about a Facebook user's profile
instagram-scraper - Scrapes an instagram user's photos and videos

JavaScript Engine Bindings

Js2Py - JavaScript to Python Translator & JavaScript interpreter written in 100% pure Python
v8eval - Multi-language bindings to JavaScript engine V8

26 KiB Raw Blame History