Refactor markup

2024-11-24 08:32:19 +02:00 · 2019-07-09 03:04:16 +03:00 · 2019-07-09 03:04:16 +03:00 · 631f174466
commit 631f174466
parent 7b4fd573b7
2 changed files with 236 additions and 194 deletions
--- a/.gitignore
+++ b/.gitignore
@ -3,3 +3,4 @@
 *.orig

 html
+Pipfile.lock
--- a/python.md
+++ b/python.md
@ -4,19 +4,19 @@ This list contains python libraries related to web scraping and data processing

 * [Python Web Scraping](#python-web-scraping)
   * [Network](#network)
-   * [Web-scraping Frameworks](#web-scraping-frameworks)
-   * [HTML/XML Parsing](#htmlxml-parsing)
+   * [Web Scraping Frameworks](#web-scraping-frameworks)
+   * [HTML/XML](#html-xml)
   * [Text processing](#text-processing)
-   * [Specific Formats Processing](#specific-formats-processing)
+   * [Structured Formats](#specific-formats-processing)
   * [Natural Language Processing](#natural-language-processing)
-   * [Browser automation and emulation](#browser-automation-and-emulation)
+   * [Browser automation](#browser-automation)
   * [Multiprocessing](#multiprocessing)
   * [Job Queue](#job-queue)
   * [Message Queue](#message-queue)
   * [Cloud Computing](#cloud-computing)
   * [Email](#email)
-   * [URL and Network Address Manipulation](#url-and-network-address-manipulation)
-   * [Web Content Extracting](#web-content-extracting)
+   * [URL and Network Address](#url-and-network-address)
+   * [Web Content Extraction](#web-content-extraction)
   * [Asynchronous](#asynchronous)
   * [WebSocket](#websocket)
   * [DNS Resolving](#dns-resolving)
@ -28,166 +28,201 @@ This list contains python libraries related to web scraping and data processing
   * [Other Python Lists](#other-python-lists)

 ## Network
-* General
-  * [urllib](https://docs.python.org/3.4/library/urllib.html?highlight=urllib#module-urllib) - network library (stdlib)
-  * [requests](https://github.com/kennethreitz/requests) - network library
-  * [grab](https://github.com/lorien/grab) - network library (pycurl based)
-  * [pycurl](https://github.com/pycurl/pycurl) - network library (binding to [libcurl](http://curl.haxx.se/libcurl/))
-  * [urllib3](https://github.com/shazow/urllib3) - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
-  * [httplib2](https://github.com/jcgregorio/httplib2) - network library
-  * [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
-  * [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
-  * [mechanize](https://github.com/python-mechanize/mechanize) - Stateful programmatic web browsing.
-  * [socket](https://docs.python.org/3/library/socket.html) low-level networking interface (stdlib)
-  * [Unirest for Python](https://github.com/Mashape/unirest-python) - Unirest is a set of lightweight HTTP libraries available in multiple languages
-  * [hyper](https://github.com/Lukasa/hyper) - HTTP/2 Client for Python
-  * [PySocks](https://github.com/Anorov/PySocks) - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
-* Asynchronous
-  * [treq](https://github.com/dreid/treq) - requests like API (twisted based)
-  * [aiohttp](https://github.com/KeepSafe/aiohttp) - http client/server for asyncio (PEP-3156)
-* Low Level
-  * [dpkt](https://github.com/kbandla/dpkt) - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
-  * [pyOpenSSL](https://github.com/pyca/pyopenssl) - A Python wrapper around the OpenSSL library
-  * [tlslite-ng](https://github.com/tomato42/tlslite-ng) - TLS implementation in pure python

-## Web-Scraping Frameworks
-* Full Featured Crawlers
-  * [grab](http://docs.grablib.org/en/latest/#grab-spider-user-manual) - web-scraping framework (pycurl/multicurl based)
-  * [scrapy](http://scrapy.org/) - web-scraping framework (twisted based).
-  * [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
-  * [cola](https://github.com/chineking/cola) - A distributed crawling framework.
-* Other
-  * [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
-  * [restkit](https://github.com/benoitc/restkit) - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
-  * [requests-html](https://github.com/kennethreitz/requests-html) - Pythonic HTML Parsing for Humans.
-  * [demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.
-  * [ScrapydWeb](https://github.com/my8100/scrapydweb) - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.
+### Network : General

-## HTML/XML Parsing
+* [urllib](https://docs.python.org/3.4/library/urllib.html?highlight=urllib#module-urllib) - network library (stdlib)
+* [requests](https://github.com/kennethreitz/requests) - network library
+* [grab](https://github.com/lorien/grab) - network library (pycurl based)
+* [pycurl](https://github.com/pycurl/pycurl) - network library (binding to [libcurl](http://curl.haxx.se/libcurl/))
+* [urllib3](https://github.com/shazow/urllib3) - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
+* [httplib2](https://github.com/jcgregorio/httplib2) - network library
+* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
+* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
+* [mechanize](https://github.com/python-mechanize/mechanize) - Stateful programmatic web browsing.
+* [socket](https://docs.python.org/3/library/socket.html) low-level networking interface (stdlib)
+* [Unirest for Python](https://github.com/Mashape/unirest-python) - Unirest is a set of lightweight HTTP libraries available in multiple languages
+* [hyper](https://github.com/Lukasa/hyper) - HTTP/2 Client for Python
+* [PySocks](https://github.com/Anorov/PySocks) - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.

-* General
-  * [lxml](http://lxml.de) - effective HTML/XML processing library. Supports XPATH. Written in C.
-  * [cssselect](https://pythonhosted.org/cssselect) - working with DOM tree with CSS selectors
-  * [pyquery](http://pythonhosted.org//pyquery/) - working with DOM tree with jQuery-like selectors
-  * [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) - slow HTML/XMl processing library, written in pure python
-  * [html5lib](http://html5lib.readthedocs.org/en/latest/) - builds DOM of HTML/XML document according to [WHATWG spec](url=http://www.whatwg.org/). That spec is used in all modern browsers.
-  * [feedparser](http://pythonhosted.org/feedparser/) - parsing of RSS/ATOM feeds.
-  * [MarkupSafe](https://github.com/mitsuhiko/markupsafe) - Implements a XML/HTML/XHTML Markup safe string for Python.
-  * [xmltodict](https://github.com/martinblech/xmltodict) - Working with XML feel like you are working with JSON.
-  * [xhtml2pdf](https://github.com/chrisglass/xhtml2pdf) - HTML/CSS to PDF converter.
-  * [untangle](https://github.com/stchris/untangle) - Converts XML documents to Python objects for easy access.
-  * [hodor](https://github.com/CompileInc/hodor) - Configuration driven wrapper around lxml and cssselect.
-  * [chopper](https://github.com/jurismarches/chopper) - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
-  * [selectolax](https://github.com/rushter/selectolax) - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
-* Sanitizing
-  * [Bleach](http://bleach.readthedocs.org/en/latest/) - cleaning of HTML (requires html5lib)
-  * [sanitize](https://github.com/Alir3z4/sanitize) - Bringing sanity to world of messed-up data.
+### Network : Asynchronous
+
+* [treq](https://github.com/dreid/treq) - requests like API (twisted based)
+* [aiohttp](https://github.com/KeepSafe/aiohttp) - http client/server for asyncio (PEP-3156)
+
+### Network : Low Level
+
+* [dpkt](https://github.com/kbandla/dpkt) - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
+* [pyOpenSSL](https://github.com/pyca/pyopenssl) - A Python wrapper around the OpenSSL library
+* [tlslite-ng](https://github.com/tomato42/tlslite-ng) - TLS implementation in pure python
+
+## Web Scraping Frameworks
+
+### Web Scraping Frameworks : Full Featured Crawlers
+
+* [grab](http://docs.grablib.org/en/latest/#grab-spider-user-manual) - web-scraping framework (pycurl/multicurl based)
+* [scrapy](http://scrapy.org/) - web-scraping framework (twisted based).
+* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
+* [cola](https://github.com/chineking/cola) - A distributed crawling framework.
+
+### Web Scraping Frameworks : Other
+
+* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
+* [restkit](https://github.com/benoitc/restkit) - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
+* [requests-html](https://github.com/kennethreitz/requests-html) - Pythonic HTML Parsing for Humans.
+* [demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.
+* [ScrapydWeb](https://github.com/my8100/scrapydweb) - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.
+
+## HTML/XML
+
+### HTML/XML : General
+
+* [lxml](http://lxml.de) - effective HTML/XML processing library. Supports XPATH. Written in C.
+* [cssselect](https://pythonhosted.org/cssselect) - working with DOM tree with CSS selectors
+* [pyquery](http://pythonhosted.org//pyquery/) - working with DOM tree with jQuery-like selectors
+* [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) - slow HTML/XMl processing library, written in pure python
+* [html5lib](http://html5lib.readthedocs.org/en/latest/) - builds DOM of HTML/XML document according to [WHATWG spec](url=http://www.whatwg.org/). That spec is used in all modern browsers.
+* [feedparser](http://pythonhosted.org/feedparser/) - parsing of RSS/ATOM feeds.
+* [MarkupSafe](https://github.com/mitsuhiko/markupsafe) - Implements a XML/HTML/XHTML Markup safe string for Python.
+* [xmltodict](https://github.com/martinblech/xmltodict) - Working with XML feel like you are working with JSON.
+* [xhtml2pdf](https://github.com/chrisglass/xhtml2pdf) - HTML/CSS to PDF converter.
+* [untangle](https://github.com/stchris/untangle) - Converts XML documents to Python objects for easy access.
+* [hodor](https://github.com/CompileInc/hodor) - Configuration driven wrapper around lxml and cssselect.
+* [chopper](https://github.com/jurismarches/chopper) - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
+* [selectolax](https://github.com/rushter/selectolax) - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
+
+### HTML/XML : Sanitizing
+
+* [Bleach](http://bleach.readthedocs.org/en/latest/) - cleaning of HTML (requires html5lib)
+* [sanitize](https://github.com/Alir3z4/sanitize) - Bringing sanity to world of messed-up data.

 ## Text Processing

-*Libraries for parsing and manipulating plain texts.*
+Libraries for parsing and manipulating plain texts.

-* General
-    * [difflib](https://docs.python.org/3/library/difflib.html) - (Python standard library) Helpers for computing deltas.
-    * [Levenshtein](https://github.com/ztane/python-Levenshtein/) - Fast computation of Levenshtein distance and string similarity.
-    * [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching.
-    * [esmre](https://code.google.com/p/esmre/) - Regular expression accelerator.
-    * [ftfy](https://github.com/LuminosoInsight/python-ftfy) - Makes Unicode text less broken and more consistent automagically.
+### Text Processing : General

-* Transliteration
-  * [unidecode](https://pypi.python.org/pypi/Unidecode) - ASCII transliterations of Unicode text.
+* [difflib](https://docs.python.org/3/library/difflib.html) - (Python standard library) Helpers for computing deltas.
+* [Levenshtein](https://github.com/ztane/python-Levenshtein/) - Fast computation of Levenshtein distance and string similarity.
+* [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching.
+* [esmre](https://code.google.com/p/esmre/) - Regular expression accelerator.
+* [ftfy](https://github.com/LuminosoInsight/python-ftfy) - Makes Unicode text less broken and more consistent automagically.

-* Character encoding
-  * [uniout](https://github.com/moskytw/uniout) - Print readable chars instead of the escaped string.
-  * [chardet](https://github.com/chardet/chardet) - Python 2/3 compatible character encoding detector.
-  * [xpinyin](https://github.com/lxneng/xpinyin) - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
-  * [pangu.py](https://github.com/vinta/pangu.py) - Spacing texts for CJK and alphanumerics.
-  * [cchardet](https://github.com/PyYoshi/cChardet) - cChardet is high speed universal character encoding detector. - binding to uchardet.
+### Text Processing : Transliteration

-* Slugify
-    * [awesome-slugify](https://github.com/dimka665/awesome-slugify) - A Python slugify library that can preserve unicode.
-    * [python-slugify](https://github.com/un33k/python-slugify) - A Python slugify library that translates unicode to ASCII.
-    * [unicode-slugify](https://github.com/mozilla/unicode-slugify) - A slugifier that generates unicode slugs.
-    * [pytils](https://github.com/j2a/pytils) - Simple tools for processing strings in russian (including pytils.translit.slugify)
+* [unidecode](https://pypi.python.org/pypi/Unidecode) - ASCII transliterations of Unicode text.

-* General Parser
-    * [PLY](http://www.dabeaz.com/ply/) - Implementation of lex and yacc parsing tools for Python
-    * [pyparsing](http://pyparsing.wikispaces.com/) - A general purpose framework for generating parsers.
+### Text Processing : Character Encoding

-* Human names
-  * [python-nameparser](https://github.com/derek73/python-nameparser) - Parsing human names into their individual components.
+* [uniout](https://github.com/moskytw/uniout) - Print readable chars instead of the escaped string.
+* [chardet](https://github.com/chardet/chardet) - Python 2/3 compatible character encoding detector.
+* [xpinyin](https://github.com/lxneng/xpinyin) - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
+* [pangu.py](https://github.com/vinta/pangu.py) - Spacing texts for CJK and alphanumerics.
+* [cchardet](https://github.com/PyYoshi/cChardet) - cChardet is high speed universal character encoding detector. - binding to uchardet.

-* Phone Number
-    * [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Parsing, formatting, storing and validating international phone numbers.
+### Text Processing : Slugify

-* User-agent string
-    * [python-user-agents](https://github.com/selwin/python-user-agents) - Browser user agent parser.
-    * [HTTP Agent Parser](https://github.com/shon/httpagentparser) - Python HTTP Agent Parser
-    * [fake-useragent](https://github.com/hellysmile/fake-useragent) - Python user agent string faker, based on world statistic of browsers
-    * [user_agent](https://github.com/lorien/user_agent) - Generator of User-Agent data
+* [awesome-slugify](https://github.com/dimka665/awesome-slugify) - A Python slugify library that can preserve unicode.
+* [python-slugify](https://github.com/un33k/python-slugify) - A Python slugify library that translates unicode to ASCII.
+* [unicode-slugify](https://github.com/mozilla/unicode-slugify) - A slugifier that generates unicode slugs.
+* [pytils](https://github.com/j2a/pytils) - Simple tools for processing strings in russian (including pytils.translit.slugify)

-* robots.txt
-    * [reppy](https://github.com/seomoz/reppy) - Modern robots.txt Parser for Python
+### Text Processing : General Parser
+
+* [PLY](http://www.dabeaz.com/ply/) - Implementation of lex and yacc parsing tools for Python
+* [pyparsing](http://pyparsing.wikispaces.com/) - A general purpose framework for generating parsers.
+
+### Text Processing : Human Names
+
+* [python-nameparser](https://github.com/derek73/python-nameparser) - Parsing human names into their individual components.
+
+### Text Processing : Phone Number
+
+* [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Parsing, formatting, storing and validating international phone numbers.
+
+### Text Processing :: User-Agent strings
+
+* [python-user-agents](https://github.com/selwin/python-user-agents) - Browser user agent parser.
+* [HTTP Agent Parser](https://github.com/shon/httpagentparser) - Python HTTP Agent Parser
+* [fake-useragent](https://github.com/hellysmile/fake-useragent) - Python user agent string faker, based on world statistic of browsers
+* [user_agent](https://github.com/lorien/user_agent) - Generator of User-Agent data
+
+### Text Processing : robots.txt
+
+* [reppy](https://github.com/seomoz/reppy) - Modern robots.txt Parser for Python
    
-* Date and Time
-    * [dateutil](https://github.com/dateutil/dateutil) - Useful extensions to the standard Python datetime features
+### Text Processing :: Date and Time

-## Specific Formats Processing
+* [dateutil](https://github.com/dateutil/dateutil) - Useful extensions to the standard Python datetime features

-*Libraries for parsing and manipulating specific text formats.*
+## Structured Formats

-* General
-    * [tablib](https://github.com/kennethreitz/tablib) - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
-    * [textract](https://github.com/deanmalmgren/textract) - Extract text from any document, Word, PowerPoint, PDFs, etc.
-    * [messytables](https://github.com/okfn/messytables) - Tools for parsing messy tabular data
-    * [rows](https://github.com/turicas/rows) - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
+Libraries for parsing and manipulating specific text formats.

-* Office
-    * [python-docx](https://github.com/python-openxml/python-docx) - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
-    * [xlwt](https://github.com/python-excel/xlwt) / [xlrd](https://github.com/python-excel/xlrd) - Writing and reading data and formatting information from Excel files.
-    * [XlsxWriter](https://xlsxwriter.readthedocs.org/) - A Python module for creating Excel .xlsx files.
-    * [xlwings](http://xlwings.org/) - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
-    * [openpyxl](https://openpyxl.readthedocs.org/en/latest/) - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
-    * [Marmir](https://github.com/brianray/mm) - Takes Python data structures and turns them into spreadsheets.
+### Structured Formats : General

-* PDF
-    * [PDFMiner](https://github.com/euske/pdfminer) - A tool for extracting information from PDF documents.
-    * [PyPDF2](https://github.com/mstamy2/PyPDF2) - A library capable of splitting, merging and transforming PDF pages.
-    * [ReportLab](http://www.reportlab.com/opensource/) - Allowing Rapid creation of rich PDF documents.
-    * [pdftables](https://pypi.python.org/pypi/pdftables) - Extract tables from PDF files directly
+* [tablib](https://github.com/kennethreitz/tablib) - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
+* [textract](https://github.com/deanmalmgren/textract) - Extract text from any document, Word, PowerPoint, PDFs, etc.
+* [messytables](https://github.com/okfn/messytables) - Tools for parsing messy tabular data
+* [rows](https://github.com/turicas/rows) - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)

-* Markdown
-    * [Python-Markdown](https://github.com/waylan/Python-Markdown) - A Python implementation of John Gruber’s Markdown.
-    * [Mistune](https://github.com/lepture/mistune) - Fastest and full featured pure Python parsers of Markdown.
-    * [markdown2](https://pypi.python.org/pypi/markdown2) - A fast and complete Python implementation of Markdown
+### Structured Formats : Office

-* YAML
-    * [PyYAML](https://github.com/yaml/pyyaml) - YAML implementations for Python.
+* [python-docx](https://github.com/python-openxml/python-docx) - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
+* [xlwt](https://github.com/python-excel/xlwt) / [xlrd](https://github.com/python-excel/xlrd) - Writing and reading data and formatting information from Excel files.
+* [XlsxWriter](https://xlsxwriter.readthedocs.org/) - A Python module for creating Excel .xlsx files.
+* [xlwings](http://xlwings.org/) - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
+* [openpyxl](https://openpyxl.readthedocs.org/en/latest/) - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
+* [Marmir](https://github.com/brianray/mm) - Takes Python data structures and turns them into spreadsheets.

-* CSS
-    * [cssutils](https://pypi.python.org/pypi/cssutils/) - A CSS library for Python.
+### Structured Formats : PDF

-* ATOM/RSS
-    * [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
+* [PDFMiner](https://github.com/euske/pdfminer) - A tool for extracting information from PDF documents.
+* [PyPDF2](https://github.com/mstamy2/PyPDF2) - A library capable of splitting, merging and transforming PDF pages.
+* [ReportLab](http://www.reportlab.com/opensource/) - Allowing Rapid creation of rich PDF documents.
+* [pdftables](https://pypi.python.org/pypi/pdftables) - Extract tables from PDF files directly
+
+### Structured Formats : Markdown
+
+* [Python-Markdown](https://github.com/waylan/Python-Markdown) - A Python implementation of John Gruber’s Markdown.
+* [Mistune](https://github.com/lepture/mistune) - Fastest and full featured pure Python parsers of Markdown.
+* [markdown2](https://pypi.python.org/pypi/markdown2) - A fast and complete Python implementation of Markdown
+
+### Structured Formats : YAML
+
+* [PyYAML](https://github.com/yaml/pyyaml) - YAML implementations for Python.
+
+### Structured Formats : CSS
+
+* [cssutils](https://pypi.python.org/pypi/cssutils/) - A CSS library for Python.
+
+### Structured Formats : ATOM/RSS
+
+* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
+
+### Structured Formats : SQL

-* SQL
  * [sqlparse](https://sqlparse.readthedocs.org/) - A non-validating SQL parser.

-* HTTP
+### Structured Formats : HTTP
+
  * [http-parser](https://github.com/benoitc/http-parser) - HTTP request/response parser for python in C

-* Microformats
+### Structured Formats : Microformats
+
  * [opengraph](https://github.com/erikriver/opengraph) - A Python module to parse the Open Graph Protocol tags

-*  Portable Executable
+### Structured Formats :  Portable Executable
+
  *  [pefile](https://github.com/erocarrera/pefile) - A multi-platform module to parse and work with Portable Executable (aka PE) files.

-* PSD
+### Structured Formats : PSD
+
  * [psd-tools](https://github.com/kmike/psd-tools) - reading Adobe Photoshop PSD files (as described in [specification](https://www.adobe.com/devnet-apps/photoshop/fileformatashtml/PhotoshopFileFormats.htm)) to Python data structures.

 ## Natural Language Processing

-*Libraries for working with human languages.*
+Libraries for working with human languages.

 * [NLTK](http://www.nltk.org/) - A leading platform for building Python programs to work with human language data.
 * [Pattern](http://www.clips.ua.ac.be/pattern) - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
@ -202,29 +237,31 @@ This list contains python libraries related to web scraping and data processing
 * [PyPLN](https://github.com/NAMD/pypln.backend) - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
 * [langdetect](https://github.com/Mimino666/langdetect) - Port of Google's language-detection library to Python

-## Browser automation and emulation
-* Browsers
-  * [selenium](http://selenium-python.readthedocs.io/) - automating real browsers (Chrome, Firefox, Opera, IE)
-  * [Ghost.py](http://carrerasrodrigo.github.io/Ghost.py/) - wrapper of QtWebKit (requires PyQT)
-  * [Spynner](https://github.com/makinacorpus/spynner) - wrapper of QtWebKit QtWebKit (requires PyQT)
-  * [Splinter](https://github.com/cobrateam/splinter) - univeral API to browser emulators (selenium webdrivers, django client, zope)
-  * [Requestium](https://github.com/tryolabs/requestium) - Integration layer between Requests and Selenium for automation of web actions.
-  * [Splash](https://github.com/scrapinghub/splash) - Lightweight, scriptable browser as a service with an HTTP API.
-  * [pyppeteer](https://github.com/miyakogi/pyppeteer) - Headless chrome/chromium automation library (unofficial port of puppeteer)
+## Browser Automation

-* Headless tools
-  * [xvfbwrapper](https://github.com/cgoldberg/xvfbwrapper) - Python wrapper for running a display inside X virtual framebuffer (Xvfb)
+### Browser Automation : Browsers
+
+* [selenium](http://selenium-python.readthedocs.io/) - automating real browsers (Chrome, Firefox, Opera, IE)
+* [Ghost.py](http://carrerasrodrigo.github.io/Ghost.py/) - wrapper of QtWebKit (requires PyQT)
+* [Spynner](https://github.com/makinacorpus/spynner) - wrapper of QtWebKit QtWebKit (requires PyQT)
+* [Splinter](https://github.com/cobrateam/splinter) - univeral API to browser emulators (selenium webdrivers, django client, zope)
+* [Requestium](https://github.com/tryolabs/requestium) - Integration layer between Requests and Selenium for automation of web actions.
+* [Splash](https://github.com/scrapinghub/splash) - Lightweight, scriptable browser as a service with an HTTP API.
+* [pyppeteer](https://github.com/miyakogi/pyppeteer) - Headless chrome/chromium automation library (unofficial port of puppeteer)
+
+### Browser Automation : Tools
+
+* [xvfbwrapper](https://github.com/cgoldberg/xvfbwrapper) - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

 ## Multiprocessing
+
 * [threading](http://docs.python.org/3/library/threading.html) - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
 * [multiprocessing](http://docs.python.org/3/library/multiprocessing.html) - standard python library to run processes.
-* [celery](https://github.com/celery/celery) - An asynchronous task queue/job queue based on distributed message passing.
-* [rq](https://python-rq.org/) - Simple job queues for Python
 * [concurrent-futures](https://docs.python.org/3/library/concurrent.futures.html) - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

 ## Asynchronous

-*Libraries for asynchronous networking programming.*
+Libraries for asynchronous networking programming.

 * [asyncio](https://docs.python.org/3/library/asyncio.html) - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
 * [Twisted](https://twistedmatrix.com/trac/) - An event-driven networking engine.
@ -237,6 +274,7 @@ This list contains python libraries related to web scraping and data processing
 * [grequests](https://github.com/kennethreitz/grequests) - Make asynchronous HTTP Requests easily.

 ## Job Queue
+
 * [celery](http://www.celeryproject.org/) - An asynchronous task queue/job queue based on distributed message passing.
 * [huey](https://github.com/coleifer/huey) - Little multi-threaded task queue.
 * [mrq](https://github.com/pricingassistant/mrq) - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
@ -245,9 +283,11 @@ This list contains python libraries related to web scraping and data processing
 * [python-gearman](https://github.com/Yelp/python-gearman) - python API for Gearman

 ## Message Queue
+
 * [kombu](https://github.com/celery/kombu) - Messaging library for Python

 ## Cloud Computing
+
 * [picloud](http://docs.picloud.com/) - executing python-code in cloud
 * [dominoup.com](http://www.dominoup.com/) - executing R, Python и matlab code in cloud
 * [minigun-requests](https://github.com/umihico/minigun-requests) - Web scraping API to outsource tons of GET & xpath to cloud computing
@ -255,84 +295,85 @@ This list contains python libraries related to web scraping and data processing

 ## Email

-*Libraries for parsing email.*
+Libraries for parsing email.

 * [flanker](https://github.com/mailgun/flanker) - A email address and Mime parsing library.
 * [Talon](https://github.com/mailgun/talon) - Mailgun library to extract message quotations and signatures.

-## URL and Network Address Manipulation
+## URL and Network Address

-*Libraries for parsing/modifying URLs and network addresses.*
+Libraries for parsing/modifying URLs and network addresses.

-* URL
-  * [furl](https://github.com/gruns/furl) - A small Python library that makes manipulating URLs simple.
-  * [purl](https://github.com/codeinthehole/purl) - A simple, immutable URL class with a clean API for interrogation and manipulation.
-  * [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
-  * [tldextract](https://github.com/john-kurkowski/tldextract) - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
-* Network Address
-  * [netaddr](https://github.com/drkjam/netaddr) - A Python library for representing and manipulating network addresses.
-  * [micawber](https://github.com/coleifer/micawber) - A small library for extracting rich content from URLs.
+### URL and Network Address : URL

-## Web Content Extracting
+* [furl](https://github.com/gruns/furl) - A small Python library that makes manipulating URLs simple.
+* [purl](https://github.com/codeinthehole/purl) - A simple, immutable URL class with a clean API for interrogation and manipulation.
+* [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
+* [tldextract](https://github.com/john-kurkowski/tldextract) - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

-*Libraries for extracting web contents.*
+### URL and Network Address : Network Address

-* Text and metadata from HTML pages
-  * [newspaper](https://github.com/codelucas/newspaper) - News extraction, article extraction and content curation in Python.
-  * [python-goose](https://github.com/grangier/python-goose) - HTML Content/Article Extractor.
-  * [scrapely](https://github.com/scrapy/scrapely) - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
-* Metadata from HTML pages
-  * [htmldate](https://github.com/adbar/htmldate) - Find creation date using common structural patterns or text-based heuristics.
-  * [lassie](https://github.com/michaelhelmick/lassie) - Web Content Retrieval for Humans.
-* Text/Data from HTML pages
-  * [html2text](https://github.com/Alir3z4/html2text) - Convert HTML to Markdown-formatted text.
-  * [libextract](https://github.com/datalib/libextract) - Extract data from websites.
-  * [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
-  * [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
-* Images
-  * [Haul](https://github.com/vinta/Haul) - An Extensible Image Crawler.
-* Video
-  * [you-get](http://www.soimort.org/you-get/) - A YouTube/Youku/Niconico video downloader written in Python 3.
-  * [youtube-dl](http://rg3.github.io/youtube-dl/) - A small command-line program to download videos from YouTube.
-* Wiki
-  * [WikiTeam](https://github.com/WikiTeam/wikiteam) - Tools for downloading and preserving wikis.
-* Sitemap
-  * [linkchecker](https://github.com/wummel/linkchecker) - check links in web documents or full websites
-  * [python-sitemap](https://github.com/c4software/python-sitemap) - Mini website crawler to make sitemap from a website.
+* [netaddr](https://github.com/drkjam/netaddr) - A Python library for representing and manipulating network addresses.
+* [micawber](https://github.com/coleifer/micawber) - A small library for extracting rich content from URLs.
+
+## Web Content Extraction
+
+Libraries for extracting web contents.
+
+* [newspaper](https://github.com/codelucas/newspaper) - News extraction, article extraction and content curation in Python.
+* [python-goose](https://github.com/grangier/python-goose) - HTML Content/Article Extractor.
+* [scrapely](https://github.com/scrapy/scrapely) - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
+* [htmldate](https://github.com/adbar/htmldate) - Find creation date using common structural patterns or text-based heuristics.
+* [lassie](https://github.com/michaelhelmick/lassie) - Web Content Retrieval for Humans.
+* [html2text](https://github.com/Alir3z4/html2text) - Convert HTML to Markdown-formatted text.
+* [libextract](https://github.com/datalib/libextract) - Extract data from websites.
+* [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
+* [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
+* [Haul](https://github.com/vinta/Haul) - An Extensible Image Crawler.
+* [you-get](http://www.soimort.org/you-get/) - A YouTube/Youku/Niconico video downloader written in Python 3.
+* [youtube-dl](http://rg3.github.io/youtube-dl/) - A small command-line program to download videos from YouTube.
+* [WikiTeam](https://github.com/WikiTeam/wikiteam) - Tools for downloading and preserving wikis.
+* [linkchecker](https://github.com/wummel/linkchecker) - check links in web documents or full websites
+* [python-sitemap](https://github.com/c4software/python-sitemap) - Mini website crawler to make sitemap from a website.

 ## WebSocket

-*Libraries for working with WebSocket.*
+Libraries for working with WebSocket.

 * [Crossbar](https://github.com/crossbario/crossbar/) - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
 * [AutobahnPython](https://github.com/tavendo/AutobahnPython) - WebSocket & WAMP for Python on Twisted and [asyncio](https://docs.python.org/3/library/asyncio.html).
 * [WebSocket-for-Python](https://github.com/Lawouach/WebSocket-for-Python) - WebSocket client and server library for Python 2 and 3 as well as PyPy.

 ## DNS Resolving
-  * [dnsyo](https://github.com/samarudge/dnsyo) - Check your DNS against over 1500 global DNS servers.
-  * [pycares](https://github.com/saghul/pycares) -  interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously
+
+* [dnsyo](https://github.com/samarudge/dnsyo) - Check your DNS against over 1500 global DNS servers.
+* [pycares](https://github.com/saghul/pycares) -  interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

 ## Computer Vision
-  * [OpenCV](https://github.com/Itseez/opencv) - Open Source Computer Vision Library.
-  * [SimpleCV](https://github.com/sightmachine/SimpleCV) - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
-  * [mahotas](https://github.com/luispedro/mahotas) - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.
+
+* [OpenCV](https://github.com/Itseez/opencv) - Open Source Computer Vision Library.
+* [SimpleCV](https://github.com/sightmachine/SimpleCV) - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
+* [mahotas](https://github.com/luispedro/mahotas) - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

 ## Proxy Server
-  * [scylla](https://github.com/imWildCat/scylla) - Intelligent proxy pool for Humans
-  * [ProxyBroker](https://github.com/constverum/Proxybroker) - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
-  * [shadowsocks](https://github.com/shadowsocks/shadowsocks) - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
-  * [tproxy](https://github.com/benoitc/tproxy) - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python
+
+* [scylla](https://github.com/imWildCat/scylla) - Intelligent proxy pool for Humans
+* [ProxyBroker](https://github.com/constverum/Proxybroker) - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
+* [shadowsocks](https://github.com/shadowsocks/shadowsocks) - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
+* [tproxy](https://github.com/benoitc/tproxy) - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python
  
 ## Whois
-  * [python-whois](https://github.com/joepie91/python-whois) - A python module for retrieving and parsing WHOIS data
+
+* [python-whois](https://github.com/joepie91/python-whois) - A python module for retrieving and parsing WHOIS data

 ## Serialization
-  * [ujson](https://github.com/esnme/ultrajson) - Ultra fast JSON decoder and encoder written in C with Python bindings
+
+* [ujson](https://github.com/esnme/ultrajson) - Ultra fast JSON decoder and encoder written in C with Python bindings

 ## Other python lists

- * [awesome-python](https://github.com/vinta/awesome-python)
- * [pycrumbs](https://github.com/kirang89/pycrumbs/blob/master/pycrumbs.md)
- * [python-github-projects](https://github.com/checkcheckzz/python-github-projects)
- * [python_reference](https://github.com/rasbt/python_reference)
- * [pythonidae](https://github.com/svaksha/pythonidae)
+* [awesome-python](https://github.com/vinta/awesome-python)
+* [pycrumbs](https://github.com/kirang89/pycrumbs/blob/master/pycrumbs.md)
+* [python-github-projects](https://github.com/checkcheckzz/python-github-projects)
+* [python_reference](https://github.com/rasbt/python_reference)
+* [pythonidae](https://github.com/svaksha/pythonidae)