mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-24 08:32:19 +02:00
c758e4584c
Geziyor is full-featured fast web scraping framework that supports JS rendering.
18 KiB
18 KiB
Golang Web Scraping
This list contains Golang libraries related to web scraping and data processing
- Golang Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- URL and Network Address Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Other Golang Lists
Network
- General
- Asynchronous
- goroutine - primitive green thread in Golang
Web-Scraping Frameworks
- Full Featured Crawlers
- Full Featured Scrapers
- geziyor - Geziyor, a blazing fast web scraping framework, supports JS rendering.
- colly - Fast and elegant scraping framework
- dataflow kit - Dataflow Kit - extract structured data from web sites.
- Other
- ferret - A web scraping tool with a declarative query language.
HTML/XML Parsing
- encoding/xml - A built-in package implements a simple XML 1.0 parser.
Text Processing
Libraries for parsing and manipulating plain texts.
- General
- regexp - A built-in package implements regular expression search.
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
- General
- encoding/json - A built-in package implements encoding and decoding of JSON as defined in RFC 4627.
- allot - Placeholder and wildcard text parsing for CLI tools and bots
- bbConvert - Converts bbCode to HTML that allows you to add support for custom bbCode tags
- blackfriday - Markdown processor in Go
- bluemonday - HTML Sanitizer
- editorconfig-core-go - Editorconfig file parser and manipulator for Go
- enca - Minimal cgo bindings for libenca.
- genex - Count and expand Regular Expressions into all matching Strings
- github_flavored_markdown - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links.
- go-humanize - Formatters for time, numbers, and memory size to human readable format.
- go-nmea - NMEA parser library for the Go language.
- go-pkg-rss - This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs.
- go-pkg-xmlx - Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions.
- go-runewidth - Functions to get fixed width of the character or string.
- go-slugify - Make pretty slug with multiple languages support.
- go-vcard - Parse and format vCard
- gofeed - Parse RSS and Atom feeds in Go
- gographviz - Parses the Graphviz DOT language.
- gommon/bytes - Format bytes to string.
- gonameparts - Parses human names into individual name parts
- GoQuery - GoQuery brings a syntax and a set of features similar to jQuery to the Go language.
- goregen - A library for generating random strings from regular expressions.
- gotext - GNU gettext utilities for Go.
- guesslanguage - Functions to determine the natural language of a unicode text.
- inject - Package inject provides a reflect based injector.
- mxj - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
- sh - A shell parser and formatter
- slug - URL-friendly slugify with multiple languages support.
- Slugify - A Go slugify application that handles string.
- toml - TOML configuration format (encoder/decoder with reflection).
- xpath - XPath package for Go.
- xquery - XQuery lets you extract data from HTML/XML documents using XPath expression.
Natural Language Processing
Libraries for working with human languages.
- dpar - Transition-based statistical dependency parser.
- go-eco - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
- go-i18n - A package and an accompanying tool to work with localized text.
- go-mystem - CGo bindings to Yandex.Mystem - russian morphology analyzer.
- go-nlp - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
- go-stem - Implementation of the porter stemming algorithm.
- go-unidecode - ASCII transliterations of Unicode text.
- go2vec - Reader and utility functions for word2vec embeddings.
- gojieba - This is a Go implementation of jieba which a Chinese word splitting algorithm.
- golibstemmer - Go bindings for the snowball libstemmer library including porter 2
- gounidecode - Unicode transliterator (also known as unidecode) for Go
- icu - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
- libtextcat - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
- MMSEGO - This is a GO implementation of MMSEG which a Chinese word splitting algorithm.
- paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
- porter - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
- porter2 - Really fast Porter 2 stemmer.
- prose - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
- RAKE.go - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
- segment - A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
- sentences - A sentence tokenizer: converts text into a list of sentences.
- snowball - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native.
- stemmer - Stemmer packages for Go programming language. Includes English and German stemmers.
- textcat - A Go package for n-gram based text categorization, with support for utf-8 and raw text
- whatlanggo - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
- when - A natural EN and RU language date/time parser with pluggable rules
Browser automation and emulation
- chromedp - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol
Multiprocessing
- TODO
Asynchronous
Libraries for asynchronous networking programming.
- TODO
Queue
- NSQ - A realtime distributed messaging platform.
- NATS - Golang client for NATS, the cloud native messaging system.
Libraries for parsing email.
- douceur - CSS inliner for your HTML emails.
- email - A robust and flexible email library for Go.
- go-dkim - A DKIM library, to sign & verify email.
- go-imap - An IMAP library for clients and servers
- go-message - A streaming library for the Internet Message Format and mail messages
- Gomail - Gomail is a very simple and powerful package to send emails.
- Hectane - Lightweight SMTP client providing an HTTP API
- hermes - Golang package that generates clean, responsive HTML e-mails
- MailHog - Email and SMTP testing with web and API interface
- SendGrid - SendGrid's Go library for sending email
- smtp - SMTP server protocol state machine
URL and Network Address Manipulation
Libraries for parsing/modifying URLs and network addresses.
- URL
- Network Address
- TODO
Web Content Extracting
Libraries for extracting web contents.
- Text and Meta Data from HTML pages
- x/net/html
WebSocket
Libraries for working with WebSocket.
DNS Resolving
Computer Vision
- TODO
Proxy Server
- gin - Live reload utility for Go web servers.
- Caddy - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
Other Golang lists
-
TODO
-
Something
- TODO
Natural Language Processing
Libraries for working with human languages.
- dpar - Transition-based statistical dependency parser.
- go-eco - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
- go-i18n - A package and an accompanying tool to work with localized text.
- go-mystem - CGo bindings to Yandex.Mystem - russian morphology analyzer.
- go-nlp - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
- go-stem - Implementation of the porter stemming algorithm.
- go-unidecode - ASCII transliterations of Unicode text.
- go2vec - Reader and utility functions for word2vec embeddings.
- gojieba - This is a Go implementation of jieba which a Chinese word splitting algorithm.
- golibstemmer - Go bindings for the snowball libstemmer library including porter 2
- gounidecode - Unicode transliterator (also known as unidecode) for Go
- icu - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
- libtextcat - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
- MMSEGO - This is a GO implementation of MMSEG which a Chinese word splitting algorithm.
- paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
- porter - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
- porter2 - Really fast Porter 2 stemmer.
- prose - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
- RAKE.go - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
- segment - A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
- sentences - A sentence tokenizer: converts text into a list of sentences.
- snowball - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native.
- stemmer - Stemmer packages for Go programming language. Includes English and German stemmers.
- textcat - A Go package for n-gram based text categorization, with support for utf-8 and raw text
- whatlanggo - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
- when - A natural EN and RU language date/time parser with pluggable rules
Browser automation and emulation
- TODO
Multiprocessing
- TODO
Asynchronous
Libraries for asynchronous networking programming.
- TODO
Queue
- NSQ - A realtime distributed messaging platform.
- NATS - Golang client for NATS, the cloud native messaging system.
Libraries for parsing email.
- douceur - CSS inliner for your HTML emails.
- email - A robust and flexible email library for Go.
- go-dkim - A DKIM library, to sign & verify email.
- go-imap - An IMAP library for clients and servers
- go-message - A streaming library for the Internet Message Format and mail messages
- Gomail - Gomail is a very simple and powerful package to send emails.
- Hectane - Lightweight SMTP client providing an HTTP API
- hermes - Golang package that generates clean, responsive HTML e-mails
- MailHog - Email and SMTP testing with web and API interface
- SendGrid - SendGrid's Go library for sending email
- smtp - SMTP server protocol state machine
URL and Network Address Manipulation
Libraries for parsing/modifying URLs and network addresses.
- URL
- Network Address
- TODO
Web Content Extracting
Libraries for extracting web contents.
- Text and Meta Data from HTML pages
- x/net/html
WebSocket
Libraries for working with WebSocket.
DNS Resolving
Computer Vision
- TODO
Proxy Server
- gin - Live reload utility for Go web servers.
- Caddy - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
Other Golang lists
- TODO