mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-24 08:32:19 +02:00
23 KiB
23 KiB
Ruby Web Scraping
This list contains ruby libraries related to web scraping and data processing
- Ruby Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Asynchronous
- Queue
- URL Manipulation
- Web Content Extracting
- WebSocket
- DNS Resolving
- Computer Vision
- Geolocation
- Other Ruby Lists
Network
- httparty Makes http fun again!
- faraday Simple, but flexible HTTP client library, with support for multiple backends.
- http A simple Ruby DSL for making HTTP requests
- excon Usable, fast, simple HTTP(S) 1.1 for Ruby
- nestful Simple Ruby HTTP/REST client with a sane API
- EM-HTTP-Request - EventMachine based asynchronous HTTP client
- excon - Usable, fast, simple Ruby HTTP 1.1. It works great as a general HTTP(s) client and is particularly well suited to usage in API clients.
- Faraday - an HTTP client lib that provides a common interface over many adapters (such as Net::HTTP) and embraces the concept of Rack middleware when processing the request/response cycle.
- Http Client - Gives something like the functionality of libwww-perl (LWP) in Ruby.
- HTTP - The HTTP Gem: a simple Ruby DSL for making HTTP requests.
- Http-2 - Pure Ruby implementation of HTTP/2 protocol
- Patron - Patron is a Ruby HTTP client library based on libcurl.
- RESTClient - Simple HTTP and REST client for Ruby, inspired by microframework syntax for specifying actions.
- Savon - Savon is a SOAP client for the Ruby programming language.
- Sawyer - Secret user agent of HTTP, built on top of Faraday.
- Spyke - Interact with REST services in an ActiveRecord-like manner.
- Typhoeus - Typhoeus wraps libcurl in order to make fast and reliable requests.
- Mechanize - Mechanize is a ruby library that makes automated web interaction easy.
Web-Scraping Frameworks
- upton - A batteries-included framework for easy web-scraping
- Wombat - Web scraper with an elegant DSL that parses structured data from web pages.
- Anemone - web spider framework that can spider a domain and collect useful information about the pages it visits
HTML/XML Parsing
- nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
- loofah - HTML/XML manipulation and sanitization based on Nokogiri
- HappyMapper - allows you to parse XML data and convert it quickly and easily into ruby data structures.
- HTML::Pipeline - HTML processing filters and utilities.
- Oga - An XML/HTML parser written in Ruby. Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms.
- Ox - A fast XML parser and Object marshaller.
- ROXML - Custom mapping and bidirectional marshalling between Ruby and XML using annotation-style class methods, via Nokogiri or LibXML.
- equivalent-xml - Easy tests of equivalency of XML documents for Nokogiri::XML
Text Processing
Libraries for parsing and manipulating plain texts.
- General
- Kiba - library for writing reliable, concise, well-tested & maintainable data-processing code
- diffy - a convenient way to generate a diff from two strings or files
- CommonRegexRuby - find a lot of kinds of common information in a string
- Phone number
- GlobalPhone - Parse, validate, and format phone numbers in Ruby using Google's libphonenumber database.
- Country names
- i18n_data - country/language names and 2-letter-code pairs, in 85 languages, for country/language i18n.
- normalize_country - Convert country names and codes to a standard, includes a conversion program for XMLs, CSVs and DBs.
- User agent
- Device Detector - A precise and fast user agent parser and device detector, backed by the largest and most up-to-date user agent database.
- General parser
- Date & time
- Chronic - A natural language date/time parser written in pure Ruby.
- yymmdd - Tiny DSL for idiomatic date parsing and formatting.
- Chronic Between - a simple Ruby natural language parser for date and time ranges
- Chronic Duration - a simple Ruby natural language parser for elapsed time
- Kronic - a dirt simple library for parsing and formatting human readable dates
- Nickel - extracts date, time, and message information from naturally worded text
- Tickle - a natural language parser for recurring events
- Human Names
- nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching
- N-grams
- Text Similarity
- FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
- fuzzy-string-match - fuzzy string matching library for ruby
- FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
- Going the Distance - contains scripts that do various distance calculations
- hotwater - Fast Ruby FFI string edit distance algorithms
- levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
- TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
- tf-idf-similarity - calculate the similarity between texts using tf*idf
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
- General
- markup — GitHub library to convert mardown, rst, creole, etc into HTML
- Office
- Yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
- spreadsheet - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
- roo - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
- google-spreadsheet-ruby - This is a library to read/write Google Spreadsheet.
- rubyXL - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
- remote_table - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
- sheets - Work with spreadsheets easily in a native ruby format.
- workbook - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
- oxcelix - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
- wrap_excel - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
- libpcap PacketFul - A library for reading and writing packets to an interface or to a libpcap-formatted file.
- JSON
- JsonCompare - Returns the difference between two JSON files
- JSON — includes pure Ruby and C implementation for JSON.
- JSON::Stream — a streaming JSON parser that generates SAX-like events.
- YAJL — a streaming JSON parsing and encoding library for Ruby (C bindings to YAJL).
- OJ — Optimized JSON, as the name implies, was written to provide speed optimized JSON handling. So far it has achieved that, and is about 2 times faster than any other Ruby JSON parser, and 3 or more times faster at serializing JSON.
- Markdown
- ATOM/RSS
- Feed normalizer - Extensible Ruby wrapper for Atom and RSS parsers.
- Feedjira - A feed fetching and parsing library.
- Ratom - A fast, libxml based, Ruby Atom library.
- Simple rss - A simple, flexible, extensible, and liberal RSS and Atom reader.
- BSON
- BSON — Ruby implementation of the BSON Specification (2.0.0+), http://bsonspec.org
- MessagePack
- MessagePack — an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves. See http://msgpack.org
- Protobuf
- Protobuf — Ruby implementation for Protocol Buffers.
- RDF
- rdf - pure-Ruby library for working with Resource Description Framework (RDF) data
Natural Language Processing
Libraries for working with human languages.
- General
- Treat - Treat is a toolkit for natural language processing and computational linguistics in Ruby
- Pragmatic Segmenter - Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
- Text - A collection of text algorithms including Levenshtein distance, Metaphone, Soundex 2, Porter stemming & White similarity.
- whatlanguage - a language detection library for Ruby that uses bloom filters for speed
- nlp - NLP tools for the Polish language
- NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
- Open NLP (Ruby bindings)
- Stanford Core NLP (Ruby bindings)
- ve - a linguistic framework that's easy to use
- zipf - a collection of various NLP tools and libraries
- ruby-ner - named entity recognition with Stanford NER and Ruby
- ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer
- linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
- Part-of-Speech Tagger
- engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
- rbtagger - a simple ruby rule-based part of speech tagger
- TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid
- Sentence segmentation
- Stemmers
- Greek stemmer - a Greek stemmer
- Ruby-Stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby
- Turkish stemmer - a Turkish stemmer
- uea-stemmer - a conservative stemmer for search and indexing
- Summarization
- Tokenizers
- Jieba - Chinese tokenizer and segmenter (jRuby)
- MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
- NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
- rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
- thailang4r - Thai tokenizer
- tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
- tokenizer - a simple multilingual tokenizer
- Word Count
- wc - a rubygem to count word occurrences in a given text
- word_count - a word counter for String and Hash in Ruby
- Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
- WordsCounted - a highly customisable Ruby text analyser
Browser automation and emulation
- selenium - A browser automation framework and ecosystem
- watir-webdriver - Watir implementation built on WebDriver's Ruby bindings
- capybara-webkit - A Capybara driver for headless WebKit to test JavaScript web apps
- poltergeist - A PhantomJS driver for Capybara
Multiprocessing
- Celluloid - Actor-based concurrent object framework for Ruby
- Parallel - Run any code in parallel Processes (> use all CPUs) or Threads (> speedup blocking operations).
- Concurrent Ruby - Modern concurrency tools including agents, futures, promises, thread pools, supervisors, and more.
- childprocess - Cross-platform ruby library for managing child processes.
- forkoff - brain-dead simple parallel processing for ruby.
- posix-spawn - Fast Process::spawn for Rubys >= 1.8.7 based on the posix_spawn() system interfaces.
- thread — extensions to the thread library (includes thread pool).
- Sprawling — spawn gem for Rails to easily fork or thread long-running code blocks.
Asynchronous
Libraries for asynchronous networking programming.
- EventMachine - event-driven I/O and lightweight concurrency library
Queue
- Resque A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
- Delayed::Job — Database backed asynchronous priority queue.
- Qu A Ruby library for queuing and processing background jobs.
- Sidekiq - A full-featured background processing framework for Ruby. It aims to be simple to integrate with any modern Rails application and much higher performance than other existing solutions.
- Sneakers - A fast background processing framework for Ruby and RabbitMQ
- Backburner - Backburner is a beanstalkd-powered job queue that can handle a very high volume of jobs.
- Delayed::Job - Database backed asynchronous priority queue.
- Que - A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
- Shoryuken - A super efficient AWS SQS thread based message processor for Ruby.
- Sucker Punch - A single process background processing library using Celluloid. Aimed to be Sidekiq's little brother.
Libraries for parsing email.
- mail A Really Ruby Mail Library
URL Manipulation
Libraries for parsing URLs.
- addressable - Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to RFC 3986, RFC 3987, and RFC 6570 (level 4), providing support for IRIs and URI templates.
Web Content Extracting
Libraries for extracting web contents.
- Metainspector - scrapes a given URL, and returns its title, meta description, meta keywords, an array with all the links, all the images in it, etc
- LinkThumbnailer - Ruby gem that generates thumbnail images and videos from a given URL. Much like popular social website with link preview.
- docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
- Ruby Readability - a tool for extracting the primary readable content of a webpage
WebSocket
Libraries for working with WebSocket.
- em-websocket - EventMachine based WebSocket server
- Faye - A set of tools for simple publish-subscribe messaging between web clients.
- Firehose - Build realtime Ruby web applications.
- Slanger - Open Pusher implementation compatible with Pusher libraries.
DNS Resolving
- em-resolve-replace - EventMachine-aware pure Ruby DNS resolution
- Celluloid::DNS - a high-performance DNS client resolver and server which can be easily integrated into other projects or used as a stand-alone daemon. It was forked from RubyDNS which is now implemented in terms of this library.
Computer Vision
- ruby-opencv - An OpenCV wrapper for Ruby.
Geolocation
- geocoder - A complete geocoding solution for Ruby. With Rails it adds geocoding (by street or IP address), reverse geocoding (find street address based on given coordinates), and distance queries.
- Geokit - Geokit gem provides geocoding and distance/heading calculations.
- geoip - Searches a GeoIP database for a given host or IP address, and returns information about the country where the IP address is allocated, and the city, ISP and other information.
Other Ruby Lists
- awesome-ruby by markets
- awesome-ruby by Sdogruyol
- ruby-nlp - a collection of Natural Language Processing (NLP) Ruby libraries, tools and software