1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-12-10 10:40:14 +02:00
awesome-web-scraping/golang.md

305 lines
18 KiB
Markdown
Raw Normal View History

2017-05-07 07:34:47 +02:00
# Golang Web Scraping
This list contains Golang libraries related to web scraping and data processing
* [Golang Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other Golang Lists](#other-Golang-lists)
## Network
* General
* [net](https://golang.org/pkg/net/) - built-in package manipulating networking
* [net/http](https://golang.org/pkg/net/http/) - build-in package capable of HTTP programming
2017-05-07 07:34:47 +02:00
* Asynchronous
* [goroutine](https://tour.golang.org/concurrency/1) - primitive green thread in Golang
2017-05-07 07:34:47 +02:00
## Web-Scraping Frameworks
* Full Featured Crawlers
* [Pholcus](https://github.com/henrylee2cn/pholcus) - Pholcus is a distributed, high concurrency and powerful web crawler software.
* [go_spider](https://github.com/hu17889/go_spider) - An flexible, modular and expansible Go concurrent Crawler(spider) framework.
* [ants-go](https://github.com/wcong/ants-go) - A distributed, restful crawler engine in golang.
2017-11-15 20:58:47 +02:00
* Full Featured Scrapers
* [colly](https://github.com/gocolly/colly) - Fast and elegant scraping framework
2018-11-01 21:17:42 +02:00
* [dataflow kit](https://github.com/slotix/dataflowkit) - Dataflow Kit - extract structured data from web sites.
2017-05-07 07:34:47 +02:00
* Other
2018-11-14 06:36:00 +02:00
* [ferret](https://github.com/MontFerret/ferret) - A web scraping tool with a declarative query language.
2017-05-07 07:34:47 +02:00
## HTML/XML Parsing
2018-09-06 15:00:07 +02:00
* [encoding/xml](https://golang.org/pkg/encoding/xml/) - A built-in package implements a simple XML 1.0 parser.
2017-05-07 07:34:47 +02:00
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [regexp](https://golang.org/pkg/regexp/) - A built-in package implements regular expression search.
2017-05-07 07:34:47 +02:00
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* [encoding/json](https://golang.org/pkg/encoding/json/) - A built-in package implements encoding and decoding of JSON as defined in RFC 4627.
* [allot](https://github.com/sbstjn/allot) - Placeholder and wildcard text parsing for CLI tools and bots
* [bbConvert](https://github.com/CalebQ42/bbConvert) - Converts bbCode to HTML that allows you to add support for custom bbCode tags
* [blackfriday](https://github.com/russross/blackfriday) - Markdown processor in Go
* [bluemonday](https://github.com/microcosm-cc/bluemonday) - HTML Sanitizer
* [editorconfig-core-go](https://github.com/editorconfig/editorconfig-core-go) - Editorconfig file parser and manipulator for Go
* [enca](https://github.com/endeveit/enca) - Minimal cgo bindings for [libenca](http://cihar.com/software/enca/).
* [genex](https://github.com/alixaxel/genex) - Count and expand Regular Expressions into all matching Strings
* [github_flavored_markdown](https://godoc.org/github.com/shurcooL/github_flavored_markdown) - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links.
* [go-humanize](https://github.com/dustin/go-humanize) - Formatters for time, numbers, and memory size to human readable format.
* [go-nmea](https://github.com/adrianmo/go-nmea) - NMEA parser library for the Go language.
* [go-pkg-rss](https://github.com/jteeuwen/go-pkg-rss) - This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs.
* [go-pkg-xmlx](https://github.com/jteeuwen/go-pkg-xmlx) - Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions.
* [go-runewidth](https://github.com/mattn/go-runewidth) - Functions to get fixed width of the character or string.
* [go-slugify](https://github.com/mozillazg/go-slugify) - Make pretty slug with multiple languages support.
* [go-vcard](https://github.com/emersion/go-vcard) - Parse and format vCard
* [gofeed](https://github.com/mmcdole/gofeed) - Parse RSS and Atom feeds in Go
* [gographviz](https://github.com/awalterschulze/gographviz) - Parses the Graphviz DOT language.
* [gommon/bytes](https://github.com/labstack/gommon/tree/master/bytes) - Format bytes to string.
* [gonameparts](https://github.com/polera/gonameparts) - Parses human names into individual name parts
* [GoQuery](https://github.com/PuerkitoBio/goquery) - GoQuery brings a syntax and a set of features similar to jQuery to the Go language.
* [goregen](https://github.com/zach-klippenstein/goregen) - A library for generating random strings from regular expressions.
* [gotext](https://github.com/leonelquinteros/gotext) - GNU gettext utilities for Go.
* [guesslanguage](https://github.com/endeveit/guesslanguage) - Functions to determine the natural language of a unicode text.
* [inject](https://github.com/facebookgo/inject) - Package inject provides a reflect based injector.
* [mxj](https://github.com/clbanning/mxj) - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
* [sh](https://github.com/mvdan/sh) - A shell parser and formatter
* [slug](https://github.com/gosimple/slug) - URL-friendly slugify with multiple languages support.
* [Slugify](https://github.com/avelino/slugify) - A Go slugify application that handles string.
* [toml](https://github.com/BurntSushi/toml) - TOML configuration format (encoder/decoder with reflection).
* [xpath](https://github.com/antchfx/xpath) - XPath package for Go.
* [xquery](https://github.com/antchfx/xquery) - XQuery lets you extract data from HTML/XML documents using XPath expression.
2018-09-06 15:00:07 +02:00
## Natural Language Processing
*Libraries for working with human languages.*
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
## Browser automation and emulation
* [chromedp](https://github.com/chromedp/chromedp) - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol
2018-09-06 15:00:07 +02:00
## Multiprocessing
* TODO
## Asynchronous
*Libraries for asynchronous networking programming.*
* TODO
## Queue
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
## Email
*Libraries for parsing email.*
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* [net/url](https://golang.org/pkg/net/url/)
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* [x/net/html](golang.org/x/net/html)
## WebSocket
*Libraries for working with WebSocket.*
* [gorilla/websocket](https://github.com/gorilla/websocket)
## DNS Resolving
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
## Computer Vision
* TODO
## Proxy Server
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
* [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
## Other Golang lists
* TODO
2017-05-07 07:34:47 +02:00
* Something
* TODO
## Natural Language Processing
*Libraries for working with human languages.*
2018-09-06 15:00:07 +02:00
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
2017-05-07 07:34:47 +02:00
## Browser automation and emulation
2018-09-06 15:00:07 +02:00
* TODO
2017-05-07 07:34:47 +02:00
## Multiprocessing
2018-09-06 15:00:07 +02:00
* TODO
2017-05-07 07:34:47 +02:00
## Asynchronous
*Libraries for asynchronous networking programming.*
2018-09-06 15:00:07 +02:00
* TODO
2017-05-07 07:34:47 +02:00
## Queue
2018-09-06 15:00:07 +02:00
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
2017-05-07 07:34:47 +02:00
## Email
*Libraries for parsing email.*
2018-09-06 15:00:07 +02:00
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
2017-05-07 07:34:47 +02:00
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* [net/url](https://golang.org/pkg/net/url/)
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
2018-09-06 15:00:07 +02:00
* [x/net/html](golang.org/x/net/html)
2017-05-07 07:34:47 +02:00
## WebSocket
*Libraries for working with WebSocket.*
2018-09-06 15:00:07 +02:00
* [gorilla/websocket](https://github.com/gorilla/websocket)
2017-05-07 07:34:47 +02:00
## DNS Resolving
2018-09-06 15:00:07 +02:00
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
2017-05-07 07:34:47 +02:00
## Computer Vision
2018-09-06 15:00:07 +02:00
* TODO
2017-05-07 07:34:47 +02:00
## Proxy Server
2018-09-06 15:00:07 +02:00
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
* [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
2017-05-07 07:34:47 +02:00
2018-09-06 15:00:07 +02:00
## Other Golang lists
* TODO