1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-28 08:48:58 +02:00
awesome-web-scraping/python.md
Gregory Petukhov b117a73a54 New stuff
2015-08-13 02:49:22 +06:00

2.5 KiB

Python Web Scraping Libraries

Network Request

Web-Scraping Frameworks

  • grab - web-scraping framework (pycurl/multicurl based)
  • scrapy - web-scraping framework (twisted based). Does not support Python3.

HTML/XML Parsing

  • lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
  • cssselect - working with DOM tree with CSS selectors
  • pyquery - working with DOM tree with jQuery-like selectors
  • BeautifulSoup - slow HTML/XMl processing library, written in pure python
  • html5lib - building DOM of HTML/XML парсинг according to WHATWG spec. That spec is used in all modern browsers.
  • feedparser - parsing of RSS/ATOM feeds.
  • Bleach - cleaning of HTML (requires html5lib)

Browser automation and emulation

  • selenium - automating real browsers (Chrome, Firefox, Opera, IE)
  • Ghost.py - wrapper of QtWebKit (requires PyQT)
  • Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)

Multiprocessing

  • threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
  • multiprocessing - standard python library to run processes.
  • celery - task queue manager
  • RQ - lightweight task queue manager based on redis

Cloud Computing

  • picloud - executing python-code in cloud
  • dominoup.com - executing R, Python и matlab code in cloud