awesome-web-scraping/java.md

# Java Web Scraping

This list contains Java libraries related to web scraping and data processing

* [FooLanguage Web Scraping](#javascript-web-scraping)
   * [Network](#network)
   * [Web-scraping Frameworks](#web-scraping-frameworks)
   * [HTML/XML Parsing](#htmlxml-parsing)
   * [Text processing](#text-processing)
   * [Specific Formats Processing](#specific-formats-processing)
   * [Natural Language Processing](#natural-language-processing)
   * [Browser automation and emulation](#browser-automation-and-emulation)
   * [Multiprocessing](#multiprocessing)
   * [Queue](#queue)
   * [Email](#email)
   * [URL and Network Address Manipulation](#url-and-network-address-manipulation)
   * [Web Content Extracting](#web-content-extracting)
   * [Asynchronous](#asynchronous)
   * [WebSocket](#websocket)
   * [DNS Resolving](#dns-resolving)
   * [Computer Vision](#computer-vision)
   * [Proxy Server](#proxy-server)
   * [Other FooLanguage Lists](#other-foolanguage-lists)

## Network
* General
  * [Apache HttpClient](https://hc.apache.org/)
  * [okhttp3](http://square.github.io/okhttp/)
* Asynchronous
  * [Apache Async HttpClient](https://hc.apache.org/)
  * [AsyncHttpClient](https://github.com/AsyncHttpClient/async-http-client)

## Web-Scraping Frameworks
* Full Featured Crawlers
  * [ACHE Crawler](https://github.com/ViDA-NYU/ache)
  * [Apache Nutch](http://nutch.apache.org/)

* Other
  * [Crawler4j](https://github.com/yasserg/crawler4j)
  * [StormCrawler](https://github.com/DigitalPebble/storm-crawler)

## HTML/XML Parsing

  * [Apache Tika](https://tika.apache.org/)

## Text Processing

*Libraries for parsing and manipulating plain texts.*

* General
  * [Apache Tika](https://tika.apache.org/)

## Specific Formats Processing

*Libraries for parsing and manipulating specific text formats.*

* General
  * [Apache Tika](https://tika.apache.org/)

* Something
  * TODO
  
## Natural Language Processing

*Libraries for working with human languages.*

  * [Apache OpenNLP](https://opennlp.apache.org/)
  * [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
  * [Apache Tika](https://tika.apache.org/)

## Browser automation and emulation
  * [htmlunit](http://htmlunit.sourceforge.net/)

## Multiprocessing
  * TODO

## Asynchronous

*Libraries for asynchronous networking programming.*

  * TODO

## Queue
  * TODO

## Email

*Libraries for parsing email.*

  * TODO

## URL and Network Address Manipulation

*Libraries for parsing/modifying URLs and network addresses.*

* URL
  * TODO
* Network Address
  * TODO

## Web Content Extracting

*Libraries for extracting web contents.*

* Text and Meta Data from HTML pages
  * [Boilerpipe](https://github.com/kohlschutter/boilerpipe)
  * [Apache Tika](https://tika.apache.org/)


## WebSocket

*Libraries for working with WebSocket.*

  * TODO

## DNS Resolving
  * [dnsjava](http://www.dnsjava.org/)
  * [spotify-dns-java](https://github.com/spotify/dns-java)

## Computer Vision
  * TODO

## Proxy Server
  * TODO

## Other FooLanguage lists

 * TODO
Added Java language 2017-10-19 19:27:33 +02:00			`# Java Web Scraping`

			`This list contains Java libraries related to web scraping and data processing`

			`* [FooLanguage Web Scraping](#javascript-web-scraping)`
			`* [Network](#network)`
			`* [Web-scraping Frameworks](#web-scraping-frameworks)`
			`* [HTML/XML Parsing](#htmlxml-parsing)`
			`* [Text processing](#text-processing)`
			`* [Specific Formats Processing](#specific-formats-processing)`
			`* [Natural Language Processing](#natural-language-processing)`
			`* [Browser automation and emulation](#browser-automation-and-emulation)`
			`* [Multiprocessing](#multiprocessing)`
			`* [Queue](#queue)`
			`* [Email](#email)`
			`* [URL and Network Address Manipulation](#url-and-network-address-manipulation)`
			`* [Web Content Extracting](#web-content-extracting)`
			`* [Asynchronous](#asynchronous)`
			`* [WebSocket](#websocket)`
			`* [DNS Resolving](#dns-resolving)`
			`* [Computer Vision](#computer-vision)`
			`* [Proxy Server](#proxy-server)`
			`* [Other FooLanguage Lists](#other-foolanguage-lists)`

			`## Network`
			`* General`
			`* [Apache HttpClient](https://hc.apache.org/)`
			`* [okhttp3](http://square.github.io/okhttp/)`
			`* Asynchronous`
			`* [Apache Async HttpClient](https://hc.apache.org/)`
			`* [AsyncHttpClient](https://github.com/AsyncHttpClient/async-http-client)`

			`## Web-Scraping Frameworks`
			`* Full Featured Crawlers`
			`* [ACHE Crawler](https://github.com/ViDA-NYU/ache)`
			`* [Apache Nutch](http://nutch.apache.org/)`

			`* Other`
			`* [Crawler4j](https://github.com/yasserg/crawler4j)`
			`* [StormCrawler](https://github.com/DigitalPebble/storm-crawler)`

			`## HTML/XML Parsing`

			`* [Apache Tika](https://tika.apache.org/)`

			`## Text Processing`

			`Libraries for parsing and manipulating plain texts.`

			`* General`
			`* [Apache Tika](https://tika.apache.org/)`

			`## Specific Formats Processing`

			`Libraries for parsing and manipulating specific text formats.`

			`* General`
			`* [Apache Tika](https://tika.apache.org/)`

			`* Something`
			`* TODO`

			`## Natural Language Processing`

			`Libraries for working with human languages.`

			`* [Apache OpenNLP](https://opennlp.apache.org/)`
			`* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)`
			`* [Apache Tika](https://tika.apache.org/)`

			`## Browser automation and emulation`
			`* [htmlunit](http://htmlunit.sourceforge.net/)`

			`## Multiprocessing`
			`* TODO`

			`## Asynchronous`

			`Libraries for asynchronous networking programming.`

			`* TODO`

			`## Queue`
			`* TODO`

			`## Email`

			`Libraries for parsing email.`

			`* TODO`

			`## URL and Network Address Manipulation`

			`Libraries for parsing/modifying URLs and network addresses.`

			`* URL`
			`* TODO`
			`* Network Address`
			`* TODO`

			`## Web Content Extracting`

			`Libraries for extracting web contents.`

			`* Text and Meta Data from HTML pages`
			`* [Boilerpipe](https://github.com/kohlschutter/boilerpipe)`
			`* [Apache Tika](https://tika.apache.org/)`


			`## WebSocket`

			`Libraries for working with WebSocket.`

			`* TODO`

			`## DNS Resolving`
			`* [dnsjava](http://www.dnsjava.org/)`
			`* [spotify-dns-java](https://github.com/spotify/dns-java)`

			`## Computer Vision`
			`* TODO`

			`## Proxy Server`
			`* TODO`

			`## Other FooLanguage lists`

			`* TODO`