1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-28 08:48:58 +02:00
awesome-web-scraping/python.md
2015-08-14 23:25:26 +05:00

15 KiB

Python Web Scraping

This list contains python libraries and tools related to web scraping and data processing

Network Request

Stateful HTTP clients

  • grab - network library (pycurl based)
  • requests - network library
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • mechanize - Stateful programmatic web browsing.
  • cola - A distributed crawling framework.
  • pyspider - A powerful spider system.

Web-Scraping Frameworks

  • grab - web-scraping framework (pycurl/multicurl based)
  • scrapy - web-scraping framework (twisted based). Does not support Python3.
  • portia - Visual scraping for Scrapy.

HTML/XML Parsing

  • lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
  • cssselect - working with DOM tree with CSS selectors
  • pyquery - working with DOM tree with jQuery-like selectors
  • BeautifulSoup - slow HTML/XMl processing library, written in pure python
  • html5lib - building DOM of HTML/XML парсинг according to WHATWG spec. That spec is used in all modern browsers.
  • feedparser - parsing of RSS/ATOM feeds.
  • Bleach - cleaning of HTML (requires html5lib)
  • MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
  • xmltodict - Working with XML feel like you are working with JSON.
  • xhtml2pdf - HTML/CSS to PDF converter.
  • untangle - Converts XML documents to Python objects for easy access.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • difflib - (Python standard library) Helpers for computing deltas.
    • Levenshtein - Fast computation of Levenshtein distance and string similarity.
    • fuzzywuzzy - Fuzzy String Matching.
    • esmre - Regular expression accelerator.
    • shortuuid - A generator library for concise, unambiguous and URL-safe UUIDs.
    • ftfy - Makes Unicode text less broken and more consistent automagically.
    • unidecode - ASCII transliterations of Unicode text.
    • chardet - Python 2/3 compatible character encoding detector.
    • xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
    • pangu.py - Spacing texts for CJK and alphanumerics.
    • pyfiglet - An implementation of figlet written in Python.
    • uniout - Print readable chars instead of the escaped string.
  • Slugify
    • awesome-slugify - A Python slugify library that can preserve unicode.
    • python-slugify - A Python slugify library that translates unicode to ASCII.
    • unicode-slugify - A slugifier that generates unicode slugs with Django as a dependency.
  • Parser
    • PLY - Implementation of lex and yacc parsing tools for Python
    • phonenumbers - Parsing, formatting, storing and validating international phone numbers.
    • python-user-agents - Browser user agent parser.
    • sqlparse - A non-validating SQL parser.
    • python-nameparser - Parsing human names into their individual components.
    • pyparsing - A general purpose framework for generating parsers.
  • CSS
  • ATOM/RSS

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
  • Office
    • python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
    • xlwt / xlrd - Writing and reading data and formatting information from Excel files.
    • XlsxWriter - A Python module for creating Excel .xlsx files.
    • xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
    • openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
    • Marmir - Takes Python data structures and turns them into spreadsheets.
  • PDF
    • PDFMiner - A tool for extracting information from PDF documents.
    • PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
    • ReportLab - Allowing Rapid creation of rich PDF documents.
  • Markdown
    • Python-Markdown - A Python implementation of John Gruber’s Markdown.
    • Mistune - Fastest and full featured pure Python parsers of Markdown.
  • YAML
    • PyYAML - YAML implementations for Python.

Natural Language Processing

Libraries for working with human languages.

  • NLTK - A leading platform for building Python programs to work with human language data.
  • Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
  • TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
  • jieba - Chinese Words Segmentation Utilities.
  • SnowNLP - A library for processing Chinese text.
  • loso - Another Chinese segmentation library.
  • genius - A Chinese segment base on Conditional Random Field.
  • langid.py - Stand-alone language identification system.
  • Korean - A library for Korean morphology.

Downloader

Libraries for downloading.

  • s3cmd - A command line tool for managing Amazon S3 and CloudFront.
  • s4cmd - Super S3 command line tool, good for higher performance.
  • youtube-dl - A small command-line program to download videos from YouTube.
  • you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
  • coursera - Script for downloading Coursera.org videos and naming them.
  • WikiTeam - Tools for downloading and preserving wikis.
  • subliminal - Library and command line tool to search and download subtitles.

Browser automation and emulation

  • selenium - automating real browsers (Chrome, Firefox, Opera, IE)
  • Ghost.py - wrapper of QtWebKit (requires PyQT)
  • Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)

Multiprocessing

  • threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
  • multiprocessing - standard python library to run processes.
  • gevent - A coroutine-based Python networking library that uses greenlet.
  • eventlet - Asynchronous framework with WSGI support.
  • Tomorrow - Magic decorator syntax for asynchronous code.

Queue

  • celery - An asynchronous task queue/job queue based on distributed message passing.
  • huey - Little multi-threaded task queue.
  • mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
  • RQ - lightweight task queue manager based on redis
  • simpleq - A simple, infinitely scalable, Amazon SQS based queue.

Cloud Computing

  • picloud - executing python-code in cloud
  • dominoup.com - executing R, Python и matlab code in cloud

Email

Libraries for parsing email.

  • flanker - A email address and Mime parsing library.
  • Talon - Mailgun library to extract message quotations and signatures.

URL Manipulation

Libraries for parsing URLs.

  • furl - A small Python library that makes manipulating URLs simple.
  • purl - A simple, immutable URL class with a clean API for interrogation and manipulation.

Web Content Extracting

Libraries for extracting web contents.

  • newspaper - News extraction, article extraction and content curation in Python.
  • html2text - Convert HTML to Markdown-formatted text.
  • python-goose - HTML Content/Article Extractor.
  • lassie - Web Content Retrieval for Humans.
  • micawber - A small library for extracting rich content from URLs.
  • sumy - A module for automatic summarization of text documents and HTML pages.
  • Haul - An Extensible Image Crawler.
  • python-readability - Fast Python port of arc90's readability tool.
  • opengraph - A Python module to parse the Open Graph Protocol
  • textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
  • sanitize - Bringing sanity to world of messed-up data.

Asynchronous

Libraries for asynchronous networking programming.

  • asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
  • Twisted - An event-driven networking engine.
  • Tornado - A Web framework and asynchronous networking library.
  • pulsar - Event-driven concurrent framework for Python.
  • diesel - Greenlet-based event I/O Framework for Python.

WebSocket

Libraries for working with WebSocket.

  • Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
  • AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
  • WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

Other python lists