# Java Web Scraping This list contains Java libraries related to web scraping and data processing * [FooLanguage Web Scraping](#javascript-web-scraping) * [Network](#network) * [Web-scraping Frameworks](#web-scraping-frameworks) * [HTML/XML Parsing](#htmlxml-parsing) * [Text processing](#text-processing) * [Specific Formats Processing](#specific-formats-processing) * [Natural Language Processing](#natural-language-processing) * [Browser automation and emulation](#browser-automation-and-emulation) * [Multiprocessing](#multiprocessing) * [Queue](#queue) * [Email](#email) * [URL and Network Address Manipulation](#url-and-network-address-manipulation) * [Web Content Extracting](#web-content-extracting) * [Asynchronous](#asynchronous) * [WebSocket](#websocket) * [DNS Resolving](#dns-resolving) * [Computer Vision](#computer-vision) * [Proxy Server](#proxy-server) * [Other FooLanguage Lists](#other-foolanguage-lists) ## Network * General * [Apache HttpClient](https://hc.apache.org/) * [okhttp3](http://square.github.io/okhttp/) * Asynchronous * [Apache Async HttpClient](https://hc.apache.org/) * [AsyncHttpClient](https://github.com/AsyncHttpClient/async-http-client) ## Web-Scraping Frameworks * Full Featured Crawlers * [ACHE Crawler](https://github.com/ViDA-NYU/ache) * [Apache Nutch](http://nutch.apache.org/) * Other * [Crawler4j](https://github.com/yasserg/crawler4j) * [StormCrawler](https://github.com/DigitalPebble/storm-crawler) ## HTML/XML Parsing * [Apache Tika](https://tika.apache.org/) ## Text Processing *Libraries for parsing and manipulating plain texts.* * General * [Apache Tika](https://tika.apache.org/) ## Specific Formats Processing *Libraries for parsing and manipulating specific text formats.* * General * [Apache Tika](https://tika.apache.org/) * Something * TODO ## Natural Language Processing *Libraries for working with human languages.* * [Apache OpenNLP](https://opennlp.apache.org/) * [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) * [Apache Tika](https://tika.apache.org/) ## Browser automation and emulation * [htmlunit](http://htmlunit.sourceforge.net/) ## Multiprocessing * TODO ## Asynchronous *Libraries for asynchronous networking programming.* * TODO ## Queue * TODO ## Email *Libraries for parsing email.* * TODO ## URL and Network Address Manipulation *Libraries for parsing/modifying URLs and network addresses.* * URL * TODO * Network Address * TODO ## Web Content Extracting *Libraries for extracting web contents.* * Text and Meta Data from HTML pages * [Boilerpipe](https://github.com/kohlschutter/boilerpipe) * [Apache Tika](https://tika.apache.org/) ## WebSocket *Libraries for working with WebSocket.* * TODO ## DNS Resolving * [dnsjava](http://www.dnsjava.org/) * [spotify-dns-java](https://github.com/spotify/dns-java) ## Computer Vision * TODO ## Proxy Server * TODO ## Other FooLanguage lists * TODO