# Golang Web Scraping This list contains Golang libraries related to web scraping and data processing * [Golang Web Scraping](#javascript-web-scraping) * [Network](#network) * [Web-scraping Frameworks](#web-scraping-frameworks) * [HTML/XML Parsing](#htmlxml-parsing) * [Text processing](#text-processing) * [Specific Formats Processing](#specific-formats-processing) * [Natural Language Processing](#natural-language-processing) * [Browser automation and emulation](#browser-automation-and-emulation) * [Multiprocessing](#multiprocessing) * [Queue](#queue) * [Email](#email) * [URL and Network Address Manipulation](#url-and-network-address-manipulation) * [Web Content Extracting](#web-content-extracting) * [Asynchronous](#asynchronous) * [WebSocket](#websocket) * [DNS Resolving](#dns-resolving) * [Computer Vision](#computer-vision) * [Proxy Server](#proxy-server) * [Other Golang Lists](#other-Golang-lists) ## Network * General * [net](https://golang.org/pkg/net/) - built-in package manipulating networking * [net/http](https://golang.org/pkg/net/http/) - build-in package capable of HTTP programming * Asynchronous * [goroutine](https://tour.golang.org/concurrency/1) - primitive green thread in Golang ## Web-Scraping Frameworks * Full Featured Crawlers * [Pholcus](https://github.com/henrylee2cn/pholcus) - Pholcus is a distributed, high concurrency and powerful web crawler software. * [go_spider](https://github.com/hu17889/go_spider) - An flexible, modular and expansible Go concurrent Crawler(spider) framework. * [ants-go](https://github.com/wcong/ants-go) - A distributed, restful crawler engine in golang. * Full Featured Scrapers * [colly](https://github.com/gocolly/colly) - Fast and elegant scraping framework * [dataflow kit](https://github.com/slotix/dataflowkit) - Dataflow Kit - extract structured data from web sites. * Other * [ferret](https://github.com/MontFerret/ferret) - A web scraping tool with a declarative query language. ## HTML/XML Parsing * [encoding/xml](https://golang.org/pkg/encoding/xml/) - A built-in package implements a simple XML 1.0 parser. ## Text Processing *Libraries for parsing and manipulating plain texts.* * General * [regexp](https://golang.org/pkg/regexp/) - A built-in package implements regular expression search. ## Specific Formats Processing *Libraries for parsing and manipulating specific text formats.* * General * [encoding/json](https://golang.org/pkg/encoding/json/) - A built-in package implements encoding and decoding of JSON as defined in RFC 4627. * [allot](https://github.com/sbstjn/allot) - Placeholder and wildcard text parsing for CLI tools and bots * [bbConvert](https://github.com/CalebQ42/bbConvert) - Converts bbCode to HTML that allows you to add support for custom bbCode tags * [blackfriday](https://github.com/russross/blackfriday) - Markdown processor in Go * [bluemonday](https://github.com/microcosm-cc/bluemonday) - HTML Sanitizer * [editorconfig-core-go](https://github.com/editorconfig/editorconfig-core-go) - Editorconfig file parser and manipulator for Go * [enca](https://github.com/endeveit/enca) - Minimal cgo bindings for [libenca](http://cihar.com/software/enca/). * [genex](https://github.com/alixaxel/genex) - Count and expand Regular Expressions into all matching Strings * [github_flavored_markdown](https://godoc.org/github.com/shurcooL/github_flavored_markdown) - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links. * [go-humanize](https://github.com/dustin/go-humanize) - Formatters for time, numbers, and memory size to human readable format. * [go-nmea](https://github.com/adrianmo/go-nmea) - NMEA parser library for the Go language. * [go-pkg-rss](https://github.com/jteeuwen/go-pkg-rss) - This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs. * [go-pkg-xmlx](https://github.com/jteeuwen/go-pkg-xmlx) - Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions. * [go-runewidth](https://github.com/mattn/go-runewidth) - Functions to get fixed width of the character or string. * [go-slugify](https://github.com/mozillazg/go-slugify) - Make pretty slug with multiple languages support. * [go-vcard](https://github.com/emersion/go-vcard) - Parse and format vCard * [gofeed](https://github.com/mmcdole/gofeed) - Parse RSS and Atom feeds in Go * [gographviz](https://github.com/awalterschulze/gographviz) - Parses the Graphviz DOT language. * [gommon/bytes](https://github.com/labstack/gommon/tree/master/bytes) - Format bytes to string. * [gonameparts](https://github.com/polera/gonameparts) - Parses human names into individual name parts * [GoQuery](https://github.com/PuerkitoBio/goquery) - GoQuery brings a syntax and a set of features similar to jQuery to the Go language. * [goregen](https://github.com/zach-klippenstein/goregen) - A library for generating random strings from regular expressions. * [gotext](https://github.com/leonelquinteros/gotext) - GNU gettext utilities for Go. * [guesslanguage](https://github.com/endeveit/guesslanguage) - Functions to determine the natural language of a unicode text. * [inject](https://github.com/facebookgo/inject) - Package inject provides a reflect based injector. * [mxj](https://github.com/clbanning/mxj) - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages. * [sh](https://github.com/mvdan/sh) - A shell parser and formatter * [slug](https://github.com/gosimple/slug) - URL-friendly slugify with multiple languages support. * [Slugify](https://github.com/avelino/slugify) - A Go slugify application that handles string. * [toml](https://github.com/BurntSushi/toml) - TOML configuration format (encoder/decoder with reflection). * [xpath](https://github.com/antchfx/xpath) - XPath package for Go. * [xquery](https://github.com/antchfx/xquery) - XQuery lets you extract data from HTML/XML documents using XPath expression. ## Natural Language Processing *Libraries for working with human languages.* * [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser. * [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models. * [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text. * [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer. * [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work. * [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm. * [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text. * [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings. * [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm. * [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2 * [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go * [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. * [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2. * [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm. * [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm * [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. * [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer. * [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. * [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE) * [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/) * [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences. * [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/). * [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers. * [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text * [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). * [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules ## Browser automation and emulation * [chromedp](https://github.com/chromedp/chromedp) - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol ## Multiprocessing * TODO ## Asynchronous *Libraries for asynchronous networking programming.* * TODO ## Queue * [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform. * [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system. ## Email *Libraries for parsing email.* * [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails. * [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go. * [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email. * [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers * [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages * [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails. * [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API * [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails * [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface * [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email * [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine ## URL and Network Address Manipulation *Libraries for parsing/modifying URLs and network addresses.* * URL * [net/url](https://golang.org/pkg/net/url/) * Network Address * TODO ## Web Content Extracting *Libraries for extracting web contents.* * Text and Meta Data from HTML pages * [x/net/html](golang.org/x/net/html) ## WebSocket *Libraries for working with WebSocket.* * [gorilla/websocket](https://github.com/gorilla/websocket) ## DNS Resolving * [net](https://golang.org/pkg/net/) - Built-in some DNS related functions. * [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go. ## Computer Vision * TODO ## Proxy Server * [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers. * [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server. ## Other Golang lists * TODO * Something * TODO ## Natural Language Processing *Libraries for working with human languages.* * [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser. * [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models. * [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text. * [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer. * [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work. * [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm. * [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text. * [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings. * [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm. * [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2 * [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go * [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. * [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2. * [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm. * [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm * [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. * [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer. * [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. * [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE) * [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/) * [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences. * [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/). * [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers. * [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text * [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). * [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules ## Browser automation and emulation * TODO ## Multiprocessing * TODO ## Asynchronous *Libraries for asynchronous networking programming.* * TODO ## Queue * [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform. * [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system. ## Email *Libraries for parsing email.* * [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails. * [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go. * [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email. * [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers * [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages * [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails. * [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API * [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails * [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface * [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email * [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine ## URL and Network Address Manipulation *Libraries for parsing/modifying URLs and network addresses.* * URL * [net/url](https://golang.org/pkg/net/url/) * Network Address * TODO ## Web Content Extracting *Libraries for extracting web contents.* * Text and Meta Data from HTML pages * [x/net/html](golang.org/x/net/html) ## WebSocket *Libraries for working with WebSocket.* * [gorilla/websocket](https://github.com/gorilla/websocket) ## DNS Resolving * [net](https://golang.org/pkg/net/) - Built-in some DNS related functions. * [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go. ## Computer Vision * TODO ## Proxy Server * [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers. * [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server. ## Other Golang lists * TODO