awesome/awesome-web-scraping

Fork 0

mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-30 08:57:19 +02:00

Gregory Petukhov 4973812455 Update python.md

2015-08-16 20:48:37 +05:00

16 KiB

Raw Blame History

Python Web Scraping

This list contains python libraries related to web scraping and data processing

Python Web Scraping

Network

urllib - network library (stdlib)
requests - network library
grab - network library (pycurl based)
pycurl - network library (binding to libcurl)
urllib3 - network library
httplib2 - network library
treq - requests like API (twisted based)
RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
MechanicalSoup - A Python library for automating interaction with websites.
mechanize - Stateful programmatic web browsing.
socket low-level networking interface (stdlib)
grequests - GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

Web-Scraping Frameworks

grab - web-scraping framework (pycurl/multicurl based)
scrapy - web-scraping framework (twisted based). Does not support Python3.
portia - Visual scraping for Scrapy.
pyspider - A powerful spider system.
cola - A distributed crawling framework.

HTML/XML Parsing

lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
cssselect - working with DOM tree with CSS selectors
pyquery - working with DOM tree with jQuery-like selectors
BeautifulSoup - slow HTML/XMl processing library, written in pure python
html5lib - building DOM of HTML/XML парсинг according to WHATWG spec. That spec is used in all modern browsers.
feedparser - parsing of RSS/ATOM feeds.
Bleach - cleaning of HTML (requires html5lib)
MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
xmltodict - Working with XML feel like you are working with JSON.
xhtml2pdf - HTML/CSS to PDF converter.
untangle - Converts XML documents to Python objects for easy access.

Text Processing

Libraries for parsing and manipulating plain texts.

General
- difflib - (Python standard library) Helpers for computing deltas.
- Levenshtein - Fast computation of Levenshtein distance and string similarity.
- fuzzywuzzy - Fuzzy String Matching.
- esmre - Regular expression accelerator.
- ftfy - Makes Unicode text less broken and more consistent automagically.
Transliteration
- unidecode - ASCII transliterations of Unicode text.
Character encoding
- uniout - Print readable chars instead of the escaped string.
- chardet - Python 2/3 compatible character encoding detector.
- xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
- pangu.py - Spacing texts for CJK and alphanumerics.
Slugify
- awesome-slugify - A Python slugify library that can preserve unicode.
- python-slugify - A Python slugify library that translates unicode to ASCII.
- unicode-slugify - A slugifier that generates unicode slugs.
General Parser
- PLY - Implementation of lex and yacc parsing tools for Python
- pyparsing - A general purpose framework for generating parsers.
Human names
- python-nameparser - Parsing human names into their individual components.
Phone Number
- phonenumbers - Parsing, formatting, storing and validating international phone numbers.
User-agent string
- python-user-agents - Browser user agent parser.

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

General
- tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
Office
- python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
- xlwt / xlrd - Writing and reading data and formatting information from Excel files.
- XlsxWriter - A Python module for creating Excel .xlsx files.
- xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
- openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
- Marmir - Takes Python data structures and turns them into spreadsheets.
PDF
- PDFMiner - A tool for extracting information from PDF documents.
- PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
- ReportLab - Allowing Rapid creation of rich PDF documents.
Markdown
- Python-Markdown - A Python implementation of John Gruber’s Markdown.
- Mistune - Fastest and full featured pure Python parsers of Markdown.
YAML
- PyYAML - YAML implementations for Python.
CSS
- cssutils - A CSS library for Python.
ATOM/RSS
- feedparser - Universal feed parser.
SQL
- sqlparse - A non-validating SQL parser.

Natural Language Processing

Libraries for working with human languages.

NLTK - A leading platform for building Python programs to work with human language data.
Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
jieba - Chinese Words Segmentation Utilities.
SnowNLP - A library for processing Chinese text.
loso - Another Chinese segmentation library.
genius - A Chinese segment base on Conditional Random Field.
langid.py - Stand-alone language identification system.
Korean - A library for Korean morphology.

Downloader

Libraries for downloading.

s3cmd - A command line tool for managing Amazon S3 and CloudFront.
s4cmd - Super S3 command line tool, good for higher performance.
youtube-dl - A small command-line program to download videos from YouTube.
you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
WikiTeam - Tools for downloading and preserving wikis.
subliminal - Library and command line tool to search and download subtitles.

Browser automation and emulation

selenium - automating real browsers (Chrome, Firefox, Opera, IE)
Ghost.py - wrapper of QtWebKit (requires PyQT)
Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)

Multiprocessing

threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
multiprocessing - standard python library to run processes.
celery - An asynchronous task queue/job queue based on distributed message passing.
concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
Twisted - An event-driven networking engine.
Tornado - A Web framework and asynchronous networking library.
pulsar - Event-driven concurrent framework for Python.
diesel - Greenlet-based event I/O Framework for Python.
gevent - A coroutine-based Python networking library that uses greenlet.
eventlet - Asynchronous framework with WSGI support.
Tomorrow - Magic decorator syntax for asynchronous code.

Queue

celery - An asynchronous task queue/job queue based on distributed message passing.
huey - Little multi-threaded task queue.
mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
RQ - lightweight task queue manager based on redis
simpleq - A simple, infinitely scalable, Amazon SQS based queue.

Cloud Computing

picloud - executing python-code in cloud
dominoup.com - executing R, Python и matlab code in cloud

Email

Libraries for parsing email.

flanker - A email address and Mime parsing library.
Talon - Mailgun library to extract message quotations and signatures.

URL Manipulation

Libraries for parsing URLs.

furl - A small Python library that makes manipulating URLs simple.
purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Web Content Extracting

Libraries for extracting web contents.

newspaper - News extraction, article extraction and content curation in Python.
html2text - Convert HTML to Markdown-formatted text.
python-goose - HTML Content/Article Extractor.
lassie - Web Content Retrieval for Humans.
micawber - A small library for extracting rich content from URLs.
sumy - A module for automatic summarization of text documents and HTML pages.
Haul - An Extensible Image Crawler.
python-readability - Fast Python port of arc90's readability tool.
opengraph - A Python module to parse the Open Graph Protocol
textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
sanitize - Bringing sanity to world of messed-up data.
scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

WebSocket

Libraries for working with WebSocket.

Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

dnsyo - Check your DNS against over 1500 global DNS servers.

Computer Vision

OpenCV - Open Source Computer Vision Library.
SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

16 KiB Raw Blame History