mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-12-10 10:40:14 +02:00
a765a4ee19
add chromedp to Golang Browser automation and emulation
305 lines
18 KiB
Markdown
305 lines
18 KiB
Markdown
# Golang Web Scraping
|
|
|
|
This list contains Golang libraries related to web scraping and data processing
|
|
|
|
* [Golang Web Scraping](#javascript-web-scraping)
|
|
* [Network](#network)
|
|
* [Web-scraping Frameworks](#web-scraping-frameworks)
|
|
* [HTML/XML Parsing](#htmlxml-parsing)
|
|
* [Text processing](#text-processing)
|
|
* [Specific Formats Processing](#specific-formats-processing)
|
|
* [Natural Language Processing](#natural-language-processing)
|
|
* [Browser automation and emulation](#browser-automation-and-emulation)
|
|
* [Multiprocessing](#multiprocessing)
|
|
* [Queue](#queue)
|
|
* [Email](#email)
|
|
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
|
|
* [Web Content Extracting](#web-content-extracting)
|
|
* [Asynchronous](#asynchronous)
|
|
* [WebSocket](#websocket)
|
|
* [DNS Resolving](#dns-resolving)
|
|
* [Computer Vision](#computer-vision)
|
|
* [Proxy Server](#proxy-server)
|
|
* [Other Golang Lists](#other-Golang-lists)
|
|
|
|
## Network
|
|
* General
|
|
* [net](https://golang.org/pkg/net/) - built-in package manipulating networking
|
|
* [net/http](https://golang.org/pkg/net/http/) - build-in package capable of HTTP programming
|
|
* Asynchronous
|
|
* [goroutine](https://tour.golang.org/concurrency/1) - primitive green thread in Golang
|
|
|
|
## Web-Scraping Frameworks
|
|
* Full Featured Crawlers
|
|
* [Pholcus](https://github.com/henrylee2cn/pholcus) - Pholcus is a distributed, high concurrency and powerful web crawler software.
|
|
* [go_spider](https://github.com/hu17889/go_spider) - An flexible, modular and expansible Go concurrent Crawler(spider) framework.
|
|
* [ants-go](https://github.com/wcong/ants-go) - A distributed, restful crawler engine in golang.
|
|
* Full Featured Scrapers
|
|
* [colly](https://github.com/gocolly/colly) - Fast and elegant scraping framework
|
|
* [dataflow kit](https://github.com/slotix/dataflowkit) - Dataflow Kit - extract structured data from web sites.
|
|
* Other
|
|
* [ferret](https://github.com/MontFerret/ferret) - A web scraping tool with a declarative query language.
|
|
|
|
## HTML/XML Parsing
|
|
|
|
* [encoding/xml](https://golang.org/pkg/encoding/xml/) - A built-in package implements a simple XML 1.0 parser.
|
|
|
|
## Text Processing
|
|
|
|
*Libraries for parsing and manipulating plain texts.*
|
|
|
|
* General
|
|
* [regexp](https://golang.org/pkg/regexp/) - A built-in package implements regular expression search.
|
|
|
|
## Specific Formats Processing
|
|
|
|
*Libraries for parsing and manipulating specific text formats.*
|
|
|
|
* General
|
|
* [encoding/json](https://golang.org/pkg/encoding/json/) - A built-in package implements encoding and decoding of JSON as defined in RFC 4627.
|
|
* [allot](https://github.com/sbstjn/allot) - Placeholder and wildcard text parsing for CLI tools and bots
|
|
* [bbConvert](https://github.com/CalebQ42/bbConvert) - Converts bbCode to HTML that allows you to add support for custom bbCode tags
|
|
* [blackfriday](https://github.com/russross/blackfriday) - Markdown processor in Go
|
|
* [bluemonday](https://github.com/microcosm-cc/bluemonday) - HTML Sanitizer
|
|
* [editorconfig-core-go](https://github.com/editorconfig/editorconfig-core-go) - Editorconfig file parser and manipulator for Go
|
|
* [enca](https://github.com/endeveit/enca) - Minimal cgo bindings for [libenca](http://cihar.com/software/enca/).
|
|
* [genex](https://github.com/alixaxel/genex) - Count and expand Regular Expressions into all matching Strings
|
|
* [github_flavored_markdown](https://godoc.org/github.com/shurcooL/github_flavored_markdown) - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links.
|
|
* [go-humanize](https://github.com/dustin/go-humanize) - Formatters for time, numbers, and memory size to human readable format.
|
|
* [go-nmea](https://github.com/adrianmo/go-nmea) - NMEA parser library for the Go language.
|
|
* [go-pkg-rss](https://github.com/jteeuwen/go-pkg-rss) - This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs.
|
|
* [go-pkg-xmlx](https://github.com/jteeuwen/go-pkg-xmlx) - Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions.
|
|
* [go-runewidth](https://github.com/mattn/go-runewidth) - Functions to get fixed width of the character or string.
|
|
* [go-slugify](https://github.com/mozillazg/go-slugify) - Make pretty slug with multiple languages support.
|
|
* [go-vcard](https://github.com/emersion/go-vcard) - Parse and format vCard
|
|
* [gofeed](https://github.com/mmcdole/gofeed) - Parse RSS and Atom feeds in Go
|
|
* [gographviz](https://github.com/awalterschulze/gographviz) - Parses the Graphviz DOT language.
|
|
* [gommon/bytes](https://github.com/labstack/gommon/tree/master/bytes) - Format bytes to string.
|
|
* [gonameparts](https://github.com/polera/gonameparts) - Parses human names into individual name parts
|
|
* [GoQuery](https://github.com/PuerkitoBio/goquery) - GoQuery brings a syntax and a set of features similar to jQuery to the Go language.
|
|
* [goregen](https://github.com/zach-klippenstein/goregen) - A library for generating random strings from regular expressions.
|
|
* [gotext](https://github.com/leonelquinteros/gotext) - GNU gettext utilities for Go.
|
|
* [guesslanguage](https://github.com/endeveit/guesslanguage) - Functions to determine the natural language of a unicode text.
|
|
* [inject](https://github.com/facebookgo/inject) - Package inject provides a reflect based injector.
|
|
* [mxj](https://github.com/clbanning/mxj) - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
|
|
* [sh](https://github.com/mvdan/sh) - A shell parser and formatter
|
|
* [slug](https://github.com/gosimple/slug) - URL-friendly slugify with multiple languages support.
|
|
* [Slugify](https://github.com/avelino/slugify) - A Go slugify application that handles string.
|
|
* [toml](https://github.com/BurntSushi/toml) - TOML configuration format (encoder/decoder with reflection).
|
|
* [xpath](https://github.com/antchfx/xpath) - XPath package for Go.
|
|
* [xquery](https://github.com/antchfx/xquery) - XQuery lets you extract data from HTML/XML documents using XPath expression.
|
|
|
|
## Natural Language Processing
|
|
|
|
*Libraries for working with human languages.*
|
|
|
|
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
|
|
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
|
|
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
|
|
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
|
|
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
|
|
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
|
|
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
|
|
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
|
|
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
|
|
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
|
|
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
|
|
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
|
|
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
|
|
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
|
|
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
|
|
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
|
|
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
|
|
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
|
|
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
|
|
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
|
|
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
|
|
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
|
|
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
|
|
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
|
|
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
|
|
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
|
|
|
|
## Browser automation and emulation
|
|
|
|
* [chromedp](https://github.com/chromedp/chromedp) - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol
|
|
|
|
## Multiprocessing
|
|
|
|
* TODO
|
|
|
|
## Asynchronous
|
|
|
|
*Libraries for asynchronous networking programming.*
|
|
|
|
* TODO
|
|
|
|
## Queue
|
|
|
|
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
|
|
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
|
|
|
|
## Email
|
|
|
|
*Libraries for parsing email.*
|
|
|
|
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
|
|
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
|
|
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
|
|
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
|
|
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
|
|
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
|
|
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
|
|
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
|
|
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
|
|
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
|
|
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
|
|
|
|
## URL and Network Address Manipulation
|
|
|
|
*Libraries for parsing/modifying URLs and network addresses.*
|
|
|
|
* URL
|
|
* [net/url](https://golang.org/pkg/net/url/)
|
|
* Network Address
|
|
* TODO
|
|
|
|
## Web Content Extracting
|
|
|
|
*Libraries for extracting web contents.*
|
|
|
|
* Text and Meta Data from HTML pages
|
|
* [x/net/html](golang.org/x/net/html)
|
|
|
|
## WebSocket
|
|
|
|
*Libraries for working with WebSocket.*
|
|
|
|
* [gorilla/websocket](https://github.com/gorilla/websocket)
|
|
|
|
## DNS Resolving
|
|
|
|
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
|
|
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
|
|
|
|
## Computer Vision
|
|
|
|
* TODO
|
|
|
|
## Proxy Server
|
|
|
|
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
|
|
* [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
|
|
|
|
## Other Golang lists
|
|
|
|
* TODO
|
|
|
|
* Something
|
|
* TODO
|
|
|
|
## Natural Language Processing
|
|
|
|
*Libraries for working with human languages.*
|
|
|
|
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
|
|
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
|
|
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
|
|
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
|
|
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
|
|
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
|
|
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
|
|
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
|
|
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
|
|
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
|
|
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
|
|
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
|
|
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
|
|
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
|
|
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
|
|
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
|
|
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
|
|
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
|
|
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
|
|
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
|
|
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
|
|
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
|
|
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
|
|
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
|
|
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
|
|
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
|
|
|
|
## Browser automation and emulation
|
|
|
|
* TODO
|
|
|
|
## Multiprocessing
|
|
|
|
* TODO
|
|
|
|
## Asynchronous
|
|
|
|
*Libraries for asynchronous networking programming.*
|
|
|
|
* TODO
|
|
|
|
## Queue
|
|
|
|
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
|
|
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
|
|
|
|
## Email
|
|
|
|
*Libraries for parsing email.*
|
|
|
|
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
|
|
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
|
|
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
|
|
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
|
|
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
|
|
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
|
|
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
|
|
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
|
|
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
|
|
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
|
|
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
|
|
|
|
## URL and Network Address Manipulation
|
|
|
|
*Libraries for parsing/modifying URLs and network addresses.*
|
|
|
|
* URL
|
|
* [net/url](https://golang.org/pkg/net/url/)
|
|
* Network Address
|
|
* TODO
|
|
|
|
## Web Content Extracting
|
|
|
|
*Libraries for extracting web contents.*
|
|
|
|
* Text and Meta Data from HTML pages
|
|
* [x/net/html](golang.org/x/net/html)
|
|
|
|
## WebSocket
|
|
|
|
*Libraries for working with WebSocket.*
|
|
|
|
* [gorilla/websocket](https://github.com/gorilla/websocket)
|
|
|
|
## DNS Resolving
|
|
|
|
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
|
|
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
|
|
|
|
## Computer Vision
|
|
|
|
* TODO
|
|
|
|
## Proxy Server
|
|
|
|
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
|
|
* [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
|
|
|
|
## Other Golang lists
|
|
* TODO
|