mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-24 08:32:19 +02:00
Added Java language
This commit is contained in:
parent
d963bb28e9
commit
cfed3bd430
128
java.md
Normal file
128
java.md
Normal file
@ -0,0 +1,128 @@
|
||||
# Java Web Scraping
|
||||
|
||||
This list contains Java libraries related to web scraping and data processing
|
||||
|
||||
* [FooLanguage Web Scraping](#javascript-web-scraping)
|
||||
* [Network](#network)
|
||||
* [Web-scraping Frameworks](#web-scraping-frameworks)
|
||||
* [HTML/XML Parsing](#htmlxml-parsing)
|
||||
* [Text processing](#text-processing)
|
||||
* [Specific Formats Processing](#specific-formats-processing)
|
||||
* [Natural Language Processing](#natural-language-processing)
|
||||
* [Browser automation and emulation](#browser-automation-and-emulation)
|
||||
* [Multiprocessing](#multiprocessing)
|
||||
* [Queue](#queue)
|
||||
* [Email](#email)
|
||||
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
|
||||
* [Web Content Extracting](#web-content-extracting)
|
||||
* [Asynchronous](#asynchronous)
|
||||
* [WebSocket](#websocket)
|
||||
* [DNS Resolving](#dns-resolving)
|
||||
* [Computer Vision](#computer-vision)
|
||||
* [Proxy Server](#proxy-server)
|
||||
* [Other FooLanguage Lists](#other-foolanguage-lists)
|
||||
|
||||
## Network
|
||||
* General
|
||||
* [Apache HttpClient](https://hc.apache.org/)
|
||||
* [okhttp3](http://square.github.io/okhttp/)
|
||||
* Asynchronous
|
||||
* [Apache Async HttpClient](https://hc.apache.org/)
|
||||
* [AsyncHttpClient](https://github.com/AsyncHttpClient/async-http-client)
|
||||
|
||||
## Web-Scraping Frameworks
|
||||
* Full Featured Crawlers
|
||||
* [ACHE Crawler](https://github.com/ViDA-NYU/ache)
|
||||
* [Apache Nutch](http://nutch.apache.org/)
|
||||
|
||||
* Other
|
||||
* [Crawler4j](https://github.com/yasserg/crawler4j)
|
||||
* [StormCrawler](https://github.com/DigitalPebble/storm-crawler)
|
||||
|
||||
## HTML/XML Parsing
|
||||
|
||||
* [Apache Tika](https://tika.apache.org/)
|
||||
|
||||
## Text Processing
|
||||
|
||||
*Libraries for parsing and manipulating plain texts.*
|
||||
|
||||
* General
|
||||
* [Apache Tika](https://tika.apache.org/)
|
||||
|
||||
## Specific Formats Processing
|
||||
|
||||
*Libraries for parsing and manipulating specific text formats.*
|
||||
|
||||
* General
|
||||
* [Apache Tika](https://tika.apache.org/)
|
||||
|
||||
* Something
|
||||
* TODO
|
||||
|
||||
## Natural Language Processing
|
||||
|
||||
*Libraries for working with human languages.*
|
||||
|
||||
* [Apache OpenNLP](https://opennlp.apache.org/)
|
||||
* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
|
||||
* [Apache Tika](https://tika.apache.org/)
|
||||
|
||||
## Browser automation and emulation
|
||||
* [htmlunit](http://htmlunit.sourceforge.net/)
|
||||
|
||||
## Multiprocessing
|
||||
* TODO
|
||||
|
||||
## Asynchronous
|
||||
|
||||
*Libraries for asynchronous networking programming.*
|
||||
|
||||
* TODO
|
||||
|
||||
## Queue
|
||||
* TODO
|
||||
|
||||
## Email
|
||||
|
||||
*Libraries for parsing email.*
|
||||
|
||||
* TODO
|
||||
|
||||
## URL and Network Address Manipulation
|
||||
|
||||
*Libraries for parsing/modifying URLs and network addresses.*
|
||||
|
||||
* URL
|
||||
* TODO
|
||||
* Network Address
|
||||
* TODO
|
||||
|
||||
## Web Content Extracting
|
||||
|
||||
*Libraries for extracting web contents.*
|
||||
|
||||
* Text and Meta Data from HTML pages
|
||||
* [Boilerpipe](https://github.com/kohlschutter/boilerpipe)
|
||||
* [Apache Tika](https://tika.apache.org/)
|
||||
|
||||
|
||||
## WebSocket
|
||||
|
||||
*Libraries for working with WebSocket.*
|
||||
|
||||
* TODO
|
||||
|
||||
## DNS Resolving
|
||||
* [dnsjava](http://www.dnsjava.org/)
|
||||
* [spotify-dns-java](https://github.com/spotify/dns-java)
|
||||
|
||||
## Computer Vision
|
||||
* TODO
|
||||
|
||||
## Proxy Server
|
||||
* TODO
|
||||
|
||||
## Other FooLanguage lists
|
||||
|
||||
* TODO
|
Loading…
Reference in New Issue
Block a user