mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-28 08:48:58 +02:00
2.6 KiB
2.6 KiB
Python Web Scraping Libraries
Network Request
- urllib - standard python network library
- requests - network library
- grab - network library (pycurl based)
- pycurl - network library (binding to libcurl)
- urllib3 - network library
- httplib2 - network library
Web-Scraping Frameworks
- grab - web-scraping framework (pycurl/multicurl based)
- scrapy - web-scraping framework (twisted based). Does not support Python3.
HTML/XML Parsing
- lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
- cssselect - working with DOM tree with CSS selectors
- pyquery - working with DOM tree with jQuery-like selectors
- BeautifulSoup - slow HTML/XMl processing library, written in pure python
- html5lib - building DOM of HTML/XML парсинг according to WHATWG spec. That spec is used in all modern browsers.
- feedparser - parsing of RSS/ATOM feeds.
- Bleach - cleaning of HTML (requires html5lib)
Browser automation and emulation
- selenium - automating real browsers (Chrome, Firefox, Opera, IE)
- Ghost.py - wrapper of QtWebKit (requires PyQT)
- Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
Multiprocessing
- threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
- multiprocessing - standard python library to run processes.
- celery - task queue manager
- RQ - lightweight task queue manager based on redis
Cloud Computing
- picloud - executing python-code in cloud
- dominoup.com - executing R, Python и matlab code in cloud