1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-28 08:48:58 +02:00

moving to HTML/XML parsing - General section

This commit is contained in:
Cyriac Thomas 2017-01-12 21:11:28 +05:30
parent b9688c8f79
commit faacd85957

View File

@ -67,10 +67,10 @@ This list contains python libraries related to web scraping and data processing
* [xmltodict](https://github.com/martinblech/xmltodict) - Working with XML feel like you are working with JSON.
* [xhtml2pdf](https://github.com/chrisglass/xhtml2pdf) - HTML/CSS to PDF converter.
* [untangle](https://github.com/stchris/untangle) - Converts XML documents to Python objects for easy access.
* [hodor](https://github.com/CompileInc/hodor) - Configuration driven wrapper around lxml and cssselect.
* Sanitizing
* [Bleach](http://bleach.readthedocs.org/en/latest/) - cleaning of HTML (requires html5lib)
* [sanitize](https://github.com/Alir3z4/sanitize) - Bringing sanity to world of messed-up data.
* [hodor](https://github.com/CompileInc/hodor) - Configuration driven wrapper around lxml and cssselect.
## Text Processing