1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-28 08:48:58 +02:00

Extractor section reorganized + ref added

This commit is contained in:
Adrien Barbaresi 2018-01-22 18:23:36 +01:00
parent 08e550bfd4
commit 81de5d67c8

View File

@ -246,25 +246,29 @@ statistic of browsers
* [tldextract](https://github.com/john-kurkowski/tldextract) - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
* Network Address
* [netaddr](https://github.com/drkjam/netaddr) - A Python library for representing and manipulating network addresses.
* [micawber](https://github.com/coleifer/micawber) - A small library for extracting rich content from URLs.
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* Text and metadata from HTML pages
* [newspaper](https://github.com/codelucas/newspaper) - News extraction, article extraction and content curation in Python.
* [html2text](https://github.com/Alir3z4/html2text) - Convert HTML to Markdown-formatted text.
* [python-goose](https://github.com/grangier/python-goose) - HTML Content/Article Extractor.
* [lassie](https://github.com/michaelhelmick/lassie) - Web Content Retrieval for Humans.
* [micawber](https://github.com/coleifer/micawber) - A small library for extracting rich content from URLs.
* [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
* [Haul](https://github.com/vinta/Haul) - An Extensible Image Crawler.
* [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
* [scrapely](https://github.com/scrapy/scrapely) - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
* Metadata from HTML pages
* [htmldate](https://github.com/adbar/htmldate) - Find creation date using common structural patterns or text-based heuristics.
* [lassie](https://github.com/michaelhelmick/lassie) - Web Content Retrieval for Humans.
* Text/Data from HTML pages
* [html2text](https://github.com/Alir3z4/html2text) - Convert HTML to Markdown-formatted text.
* [libextract](https://github.com/datalib/libextract) - Extract data from websites.
* [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
* [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
* Images
* [Haul](https://github.com/vinta/Haul) - An Extensible Image Crawler.
* Video
* [youtube-dl](http://rg3.github.io/youtube-dl/) - A small command-line program to download videos from YouTube.
* [you-get](http://www.soimort.org/you-get/) - A YouTube/Youku/Niconico video downloader written in Python 3.
* [youtube-dl](http://rg3.github.io/youtube-dl/) - A small command-line program to download videos from YouTube.
* Wiki
* [WikiTeam](https://github.com/WikiTeam/wikiteam) - Tools for downloading and preserving wikis.