1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-12-10 10:40:14 +02:00
awesome-web-scraping/golang.md
Gregory Petukhov a765a4ee19
Merge pull request #74 from andriyor/add-golang-chromedp
add chromedp to Golang Browser automation and emulation
2019-01-28 13:49:28 +03:00

18 KiB

Golang Web Scraping

This list contains Golang libraries related to web scraping and data processing

Network

  • General
    • net - built-in package manipulating networking
    • net/http - build-in package capable of HTTP programming
  • Asynchronous
    • goroutine - primitive green thread in Golang

Web-Scraping Frameworks

  • Full Featured Crawlers
    • Pholcus - Pholcus is a distributed, high concurrency and powerful web crawler software.
    • go_spider - An flexible, modular and expansible Go concurrent Crawler(spider) framework.
    • ants-go - A distributed, restful crawler engine in golang.
  • Full Featured Scrapers
    • colly - Fast and elegant scraping framework
    • dataflow kit - Dataflow Kit - extract structured data from web sites.
  • Other
    • ferret - A web scraping tool with a declarative query language.

HTML/XML Parsing

  • encoding/xml - A built-in package implements a simple XML 1.0 parser.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • regexp - A built-in package implements regular expression search.

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • encoding/json - A built-in package implements encoding and decoding of JSON as defined in RFC 4627.
    • allot - Placeholder and wildcard text parsing for CLI tools and bots
    • bbConvert - Converts bbCode to HTML that allows you to add support for custom bbCode tags
    • blackfriday - Markdown processor in Go
    • bluemonday - HTML Sanitizer
    • editorconfig-core-go - Editorconfig file parser and manipulator for Go
    • enca - Minimal cgo bindings for libenca.
    • genex - Count and expand Regular Expressions into all matching Strings
    • github_flavored_markdown - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links.
    • go-humanize - Formatters for time, numbers, and memory size to human readable format.
    • go-nmea - NMEA parser library for the Go language.
    • go-pkg-rss - This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs.
    • go-pkg-xmlx - Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions.
    • go-runewidth - Functions to get fixed width of the character or string.
    • go-slugify - Make pretty slug with multiple languages support.
    • go-vcard - Parse and format vCard
    • gofeed - Parse RSS and Atom feeds in Go
    • gographviz - Parses the Graphviz DOT language.
    • gommon/bytes - Format bytes to string.
    • gonameparts - Parses human names into individual name parts
    • GoQuery - GoQuery brings a syntax and a set of features similar to jQuery to the Go language.
    • goregen - A library for generating random strings from regular expressions.
    • gotext - GNU gettext utilities for Go.
    • guesslanguage - Functions to determine the natural language of a unicode text.
    • inject - Package inject provides a reflect based injector.
    • mxj - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
    • sh - A shell parser and formatter
    • slug - URL-friendly slugify with multiple languages support.
    • Slugify - A Go slugify application that handles string.
    • toml - TOML configuration format (encoder/decoder with reflection).
    • xpath - XPath package for Go.
    • xquery - XQuery lets you extract data from HTML/XML documents using XPath expression.

Natural Language Processing

Libraries for working with human languages.

  • dpar - Transition-based statistical dependency parser.
  • go-eco - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
  • go-i18n - A package and an accompanying tool to work with localized text.
  • go-mystem - CGo bindings to Yandex.Mystem - russian morphology analyzer.
  • go-nlp - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
  • go-stem - Implementation of the porter stemming algorithm.
  • go-unidecode - ASCII transliterations of Unicode text.
  • go2vec - Reader and utility functions for word2vec embeddings.
  • gojieba - This is a Go implementation of jieba which a Chinese word splitting algorithm.
  • golibstemmer - Go bindings for the snowball libstemmer library including porter 2
  • gounidecode - Unicode transliterator (also known as unidecode) for Go
  • icu - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
  • libtextcat - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
  • MMSEGO - This is a GO implementation of MMSEG which a Chinese word splitting algorithm.
  • paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
  • porter - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
  • porter2 - Really fast Porter 2 stemmer.
  • prose - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
  • RAKE.go - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
  • segment - A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
  • sentences - A sentence tokenizer: converts text into a list of sentences.
  • snowball - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native.
  • stemmer - Stemmer packages for Go programming language. Includes English and German stemmers.
  • textcat - A Go package for n-gram based text categorization, with support for utf-8 and raw text
  • whatlanggo - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
  • when - A natural EN and RU language date/time parser with pluggable rules

Browser automation and emulation

  • chromedp - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol

Multiprocessing

  • TODO

Asynchronous

Libraries for asynchronous networking programming.

  • TODO

Queue

  • NSQ - A realtime distributed messaging platform.
  • NATS - Golang client for NATS, the cloud native messaging system.

Email

Libraries for parsing email.

  • douceur - CSS inliner for your HTML emails.
  • email - A robust and flexible email library for Go.
  • go-dkim - A DKIM library, to sign & verify email.
  • go-imap - An IMAP library for clients and servers
  • go-message - A streaming library for the Internet Message Format and mail messages
  • Gomail - Gomail is a very simple and powerful package to send emails.
  • Hectane - Lightweight SMTP client providing an HTTP API
  • hermes - Golang package that generates clean, responsive HTML e-mails
  • MailHog - Email and SMTP testing with web and API interface
  • SendGrid - SendGrid's Go library for sending email
  • smtp - SMTP server protocol state machine

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

Web Content Extracting

Libraries for extracting web contents.

WebSocket

Libraries for working with WebSocket.

DNS Resolving

  • net - Built-in some DNS related functions.
  • miekg/dns - A DNS library in Go.

Computer Vision

  • TODO

Proxy Server

  • gin - Live reload utility for Go web servers.
  • Caddy - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.

Other Golang lists

  • TODO

  • Something

    • TODO

Natural Language Processing

Libraries for working with human languages.

  • dpar - Transition-based statistical dependency parser.
  • go-eco - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
  • go-i18n - A package and an accompanying tool to work with localized text.
  • go-mystem - CGo bindings to Yandex.Mystem - russian morphology analyzer.
  • go-nlp - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
  • go-stem - Implementation of the porter stemming algorithm.
  • go-unidecode - ASCII transliterations of Unicode text.
  • go2vec - Reader and utility functions for word2vec embeddings.
  • gojieba - This is a Go implementation of jieba which a Chinese word splitting algorithm.
  • golibstemmer - Go bindings for the snowball libstemmer library including porter 2
  • gounidecode - Unicode transliterator (also known as unidecode) for Go
  • icu - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
  • libtextcat - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
  • MMSEGO - This is a GO implementation of MMSEG which a Chinese word splitting algorithm.
  • paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
  • porter - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
  • porter2 - Really fast Porter 2 stemmer.
  • prose - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
  • RAKE.go - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
  • segment - A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
  • sentences - A sentence tokenizer: converts text into a list of sentences.
  • snowball - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native.
  • stemmer - Stemmer packages for Go programming language. Includes English and German stemmers.
  • textcat - A Go package for n-gram based text categorization, with support for utf-8 and raw text
  • whatlanggo - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
  • when - A natural EN and RU language date/time parser with pluggable rules

Browser automation and emulation

  • TODO

Multiprocessing

  • TODO

Asynchronous

Libraries for asynchronous networking programming.

  • TODO

Queue

  • NSQ - A realtime distributed messaging platform.
  • NATS - Golang client for NATS, the cloud native messaging system.

Email

Libraries for parsing email.

  • douceur - CSS inliner for your HTML emails.
  • email - A robust and flexible email library for Go.
  • go-dkim - A DKIM library, to sign & verify email.
  • go-imap - An IMAP library for clients and servers
  • go-message - A streaming library for the Internet Message Format and mail messages
  • Gomail - Gomail is a very simple and powerful package to send emails.
  • Hectane - Lightweight SMTP client providing an HTTP API
  • hermes - Golang package that generates clean, responsive HTML e-mails
  • MailHog - Email and SMTP testing with web and API interface
  • SendGrid - SendGrid's Go library for sending email
  • smtp - SMTP server protocol state machine

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

Web Content Extracting

Libraries for extracting web contents.

WebSocket

Libraries for working with WebSocket.

DNS Resolving

  • net - Built-in some DNS related functions.
  • miekg/dns - A DNS library in Go.

Computer Vision

  • TODO

Proxy Server

  • gin - Live reload utility for Go web servers.
  • Caddy - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.

Other Golang lists

  • TODO