awesome/awesome-web-scraping

Fork 0

mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-21 17:17:03 +02:00

Marco Vinciguerra 68b5304658

Update python.md

2024-10-15 13:00:02 +02:00

28 KiB

Raw Permalink Blame History

Python Web Scraping

This list contains python libraries related to web scraping and data processing

Network
Web Scraping
HTML/XML
Text processing
Structured Formats
Serialization
Natural Language Processing
Browser Automation
Multiprocessing
Job Queue
Message Queue
Cloud Computing
URL and Network Address
Web Automation
Asynchronous
WebSocket
DNS Resolving
Computer Vision
Proxy Server
Whois
JavaScript Engine Bindings
Captcha Solving
Other Python Lists

Network

Network : General

urllib - network library (stdlib)
requests - network library
pycurl - network library (binding to libcurl)
urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
httplib2 - Small, fast HTTP client library. Features persistent connections, cache, and Google App Engine support.
RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
MechanicalSoup - A Python library for automating interaction with websites.
mechanize - Stateful programmatic web browsing.
socket low-level networking interface (stdlib)
Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
hyper - HTTP/2 Client for Python
PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.

Network : Asynchronous

treq - requests like API (twisted based)
aiohttp - http client/server for asyncio (PEP-3156)

Network : Low Level

dpkt - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
pyOpenSSL - A Python wrapper around the OpenSSL library
tlslite-ng - TLS implementation in pure python
scapy - powerful Python-based interactive packet manipulation program and library
impacket - low-level programmatic access to the packets of network protocols

Web Scraping

Web Scraping : Frameworks

scrapy - web-scraping framework (twisted based).
pyspider - A powerful spider system.
autoscraper - A smart, automatic and lightweight web scraper
ruia - Async Python 3.6+ web scraping micro-framework based on asyncio
cola - A distributed crawling framework.
frontera - A scalable frontier for web crawlers
dude - A simple framework for writing web scrapers using decorators.
ScrapegrphAI - Web scraping framework that uses AI for extracting data

Web Scraping : Tools

portia - Visual scraping for Scrapy.
restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
requests-html - Pythonic HTML Parsing for Humans.
ScrapydWeb - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.
Starbelly - Starbelly is a user-friendly and highly configurable web crawler front end.
Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Web Scraping : Bypass Protection

cloudscraper - A Python module to bypass Cloudflare's anti-bot page.

HTML/XML

HTML/XML : General

lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
cssselect - working with DOM tree with CSS selectors
pyquery - working with DOM tree with jQuery-like selectors
BeautifulSoup - slow HTML/XMl processing library, written in pure python
html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
feedparser - parsing of RSS/ATOM feeds.
MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
xmltodict - Working with XML feel like you are working with JSON.
xhtml2pdf - HTML/CSS to PDF converter.
untangle - Converts XML documents to Python objects for easy access.
hodor - Configuration driven wrapper around lxml and cssselect.
chopper - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
selectolax - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
parsel - Lets you extract data from XML/HTML documents using XPath or CSS selectors.
html5-parser - Fast C based HTML 5 parsing for python.
gazpacho - A simple, fast, and modern web scraping library.

HTML/XML : Sanitizing

Bleach - cleaning of HTML (requires html5lib)
sanitize - Bringing sanity to world of messed-up data.

HTML/XML : Metadata

extruct - A library for extracting embedded metadata from HTML markup.

Text Processing

Libraries for parsing and manipulating plain texts.

Text Processing : General

difflib - (Python standard library) Helpers for computing deltas.
Levenshtein - Fast computation of Levenshtein distance and string similarity.
fuzzywuzzy - Fuzzy String Matching.
esmre - Regular expression accelerator.
ftfy - Makes Unicode text less broken and more consistent automagically.

Text Processing : Transliteration

unidecode - ASCII transliterations of Unicode text.

Text Processing : Character Encoding

uniout - Print readable chars instead of the escaped string.
chardet - Python 2/3 compatible character encoding detector.
xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
pangu.py - Spacing texts for CJK and alphanumerics.
cchardet - cChardet is high speed universal character encoding detector. - binding to uchardet.

Text Processing : Slugify

awesome-slugify - A Python slugify library that can preserve unicode.
python-slugify - A Python slugify library that translates unicode to ASCII.
unicode-slugify - A slugifier that generates unicode slugs.
pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)

Text Processing : General Parser

PLY - Implementation of lex and yacc parsing tools for Python
pyparsing - A general purpose framework for generating parsers.

Text Processing : Human Names

python-nameparser - Parsing human names into their individual components.

Text Processing : Phone Number

phonenumbers - Parsing, formatting, storing and validating international phone numbers.

Text Processing :: User-Agent strings

HTTP Agent Parser - Python HTTP Agent Parser
uap-python - Python implementation of ua-parser
python-user-agents - Browser user agent parser.
fake-useragent - Python user agent string faker, based on world statistic of browsers
user_agent - Generator of User-Agent data

Text Processing : robots.txt

reppy - Modern robots.txt Parser for Python

Text Processing :: Date and Time

dateutil - Useful extensions to the standard Python datetime features
dateparser - python parser for human readable dates
ciso8601 - converts ISO 8601 or RFC 3339 date time strings into Python datetime objects

Text Processing :: Price and Currency

price-parser - a small library for extracting price and currency from raw text strings.

Structured Formats

Libraries for parsing and manipulating specific text formats.

Structured Formats : General

tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
messytables - Tools for parsing messy tabular data
rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)

Structured Formats : Office

python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
xlwt / xlrd - Writing and reading data and formatting information from Excel files.
XlsxWriter - A Python module for creating Excel .xlsx files.
xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
Marmir - Takes Python data structures and turns them into spreadsheets.

Structured Formats : PDF

PDFMiner - A tool for extracting information from PDF documents.
PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
ReportLab - Allowing Rapid creation of rich PDF documents.
pdftables - Extract tables from PDF files directly

Structured Formats : Markdown

Python-Markdown - A Python implementation of John Gruber’s Markdown.
Mistune - Fastest and full featured pure Python parsers of Markdown.
markdown2 - A fast and complete Python implementation of Markdown
mistletoe - A fast, extensible and spec-compliant Markdown parser in pure Python

Structured Formats : YAML

PyYAML - YAML implementations for Python.

Structured Formats : CSS

cssutils - A CSS library for Python.

Structured Formats : ATOM/RSS

feedparser - Universal feed parser.

Structured Formats : SQL

sqlparse - A non-validating SQL parser.

Structured Formats : HTTP

http-parser - HTTP request/response parser for python in C
httptools - a Python binding for nodejs HTTP parser

Structured Formats : Microformats

opengraph - A Python module to parse the Open Graph Protocol tags

Structured Formats : Portable Executable

pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.

Structured Formats : PSD

psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures.

Structured Formats : Bookmarks File

bookmarks-parser - Parses Firefox/Chrome HTML bookmarks files

Structured Formats : JavaScript Object

chompjs - Parsing JavaScript objects into Python dictionaries

Structured Formats : Email

flanker - A email address and Mime parsing library.
Talon - Mailgun library to extract message quotations and signatures.

Serialization

orjson - Fast, correct Python JSON library supporting dataclasses and datetimes
ujson - Ultra fast JSON decoder and encoder written in C with Python bindings
msgspec - A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
msgpack - MessagePack serializer implementation for Python
padantic - Data validation using Python type hints
cloudpickle - Extended pickling support for Python objects

Natural Language Processing

Libraries for working with human languages.

NLTK - A leading platform for building Python programs to work with human language data.
spacy - Enables using State-of-the-Art Deep Learning models for common NLP tasks.
fastai - Deep Learning library with free video tutorials + active forum community, downside of lib: GPU needed
gensim - library for topic modeling, document indexing and similarity retrieval with large corpora
Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
jieba - Chinese Words Segmentation Utilities.
SnowNLP - A library for processing Chinese text.
loso - Another Chinese segmentation library.
genius - A Chinese segment base on Conditional Random Field.
langid.py - Stand-alone language identification system.
Korean - A library for Korean morphology.
pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
langdetect - Port of Google's language-detection library to Python

Browser Automation

Browser Automation : Drivers

selenium - automating real browsers (Chrome, Firefox, Opera, IE)
Ghost.py - wrapper of QtWebKit (requires PyQT)
Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
Splinter - universal API to browser emulators (selenium webdrivers, django client, zope)
Requestium - Integration layer between Requests and Selenium for automation of web actions.
Splash - Lightweight, scriptable browser as a service with an HTTP API.
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
Playwright - Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API
seleniumbase - Python framework for Web/UI testing + RPA. 🤖 🏰 Fast, easy, and reliable.

Browser Automation : Frameworks

botasaurus - all-in-one web scraping framework
crawlee - A web scraping and browser automation library for Python to build reliable crawlers

Browser Automation : Tools

xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
multiprocessing - standard python library to run processes.
concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
Twisted - An event-driven networking engine.
Tornado - A Web framework and asynchronous networking library.
pulsar - Event-driven concurrent framework for Python.
diesel - Greenlet-based event I/O Framework for Python.
gevent - A coroutine-based Python networking library that uses greenlet.
eventlet - Asynchronous framework with WSGI support.
Tomorrow - Magic decorator syntax for asynchronous code.
grequests - Make asynchronous HTTP Requests easily.

Job Queue

celery - An asynchronous task queue/job queue based on distributed message passing.
huey - Little multi-threaded task queue.
mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
RQ - lightweight task queue manager based on redis
simpleq - A simple, infinitely scalable, Amazon SQS based queue.
python-gearman - python API for Gearman

Message Queue

kombu - Messaging library for Python

Cloud Computing

picloud - executing python-code in cloud
dominoup.com - executing R, Python и matlab code in cloud
minigun-requests - Web scraping API to outsource tons of GET & xpath to cloud computing
pythonista-chromeless - AWS lambda which execute given python code on selenium

URL and Network Address

Libraries for parsing/modifying URLs, network addresses, domain names.

URL and Network Address : URL

furl - A small Python library that makes manipulating URLs simple.
purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
urllib.parse - interface to break URL strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL.

URL and Network Address : Network Address

netaddr - A Python library for representing and manipulating network addresses.
micawber - A small library for extracting rich content from URLs.

Domain Names

tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
find_domains - a library to search for domain names in text data

Web Automation

Tools to automate multiple actions on a website.

Web Automation :: Content Extraction

newspaper - News extraction, article extraction and content curation in Python.
python-goose - HTML Content/Article Extractor.
scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
htmldate - Find creation date using common structural patterns or text-based heuristics.
lassie - Web Content Retrieval for Humans.
html2text - Convert HTML to Markdown-formatted text.
libextract - Extract data from websites.
python-readability - Fast Python port of arc90's readability tool.
sumy - A module for automatic summarization of text documents and HTML pages.
Haul - An Extensible Image Crawler.
you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
youtube-dl - A small command-line program to download videos from YouTube.
WikiTeam - Tools for downloading and preserving wikis.
linkchecker - check links in web documents or full websites
python-sitemap - Mini website crawler to make sitemap from a website.
trafilatura - Fast extraction of main text and comments along with structure, conversion to TXT, CSV & XML.
advertools - A customizable crawler to analyze SEO and content of pages and websites.
photon - Incredibly fast crawler designed for OSINT
extractnet - Machine Learning based content and metadata extraction in Python 3

Web Automation : Account Creation

ninjemail - Python library for automated email account creation for different providers.

WebSocket

Libraries for working with WebSocket.

Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

dnspython - a powerful DNS toolkit for python
dnsyo - Check your DNS against over 1500 global DNS servers.
pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

OpenCV - Open Source Computer Vision Library.
SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

scylla - Intelligent proxy pool for Humans
ProxyBroker - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Whois

python-whois - A python module for retrieving and parsing WHOIS data

JavaScript Engine Bindings

Js2Py - JavaScript to Python Translator & JavaScript interpreter written in 100% pure Python
v8eval - Multi-language bindings to JavaScript engine V8

Captcha Solving

captcha_solver - Universal python API to captcha solving services
python-anticaptcha - Client library for solve captchas with anti-captcha.com support
python3-anticaptcha - Python library for anti-captcha services
unicaps - a unified Python API for CAPTCHA solving services

28 KiB Raw Permalink Blame History