1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-24 08:32:19 +02:00
awesome-web-scraping/ruby.md

149 lines
6.1 KiB
Markdown
Raw Normal View History

2015-08-16 18:16:59 +02:00
# Python Web Scraping
This list contains ruby libraries related to web scraping and data processing
* [Python Web Scraping](#python-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Downloader](#downloader)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Cloud Computing](#cloud-computing)
* [Email](#email)
* [URL Manipulation](#url-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Geolocation](#geolocation)
* [Other Python Lists](#other-python-lists)
## Network
* [httparty](https://github.com/jnunemaker/httparty) Makes http fun again!
* [faraday](https://github.com/lostisland/faraday) Simple, but flexible HTTP client library, with support for multiple backends.
* [http](https://github.com/tarcieri/http) A simple Ruby DSL for making HTTP requests
* [excon](https://github.com/excon/excon) Usable, fast, simple HTTP(S) 1.1 for Ruby
* [nestful](https://github.com/maccman/nestful) Simple Ruby HTTP/REST client with a sane API
* [EM-HTTP-Request](https://github.com/igrigorik/em-http-request) - EventMachine based asynchronous HTTP client
2015-08-16 18:28:16 +02:00
2015-08-16 18:16:59 +02:00
## Web-Scraping Frameworks
2015-08-16 18:28:16 +02:00
* [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping
2015-08-16 18:16:59 +02:00
## HTML/XML Parsing
* [nokogiri](https://github.com/sparklemotion/nokogiri) - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
* [loofah](https://github.com/flavorjones/loofah) - HTML/XML manipulation and sanitization based on Nokogiri
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* TODO
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
2015-08-16 18:28:16 +02:00
* Office
* [Yomu](https://github.com/Erol) - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
* [spreadsheet](https://github.com/zdavatz/spreadsheet) - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
* [roo](https://github.com/Empact/roo) - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
* [google-spreadsheet-ruby](https://github.com/gimite/google-spreadsheet-ruby) - This is a library to read/write Google Spreadsheet.
* [rubyXL](https://github.com/weshatheleopard/rubyXL) - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
* [remote_table](https://github.com/seamusabshere/remote_table) - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
* [sheets](https://github.com/bspaulding/Sheets) - Work with spreadsheets easily in a native ruby format.
* [workbook](https://github.com/murb/workbook) - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
* [oxcelix](https://github.com/gbiczo/oxcelix) - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
* [wrap_excel](https://github.com/tomiacannondale/wrap_excel) - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
* libpcap
[PacketFul](https://github.com/packetfu/packetfu) - A library for reading and writing packets to an interface or to a libpcap-formatted file.
* JSON
* [JsonCompare](https://github.com/a2design-company/json-compare) - Returns the difference between two JSON files
2015-08-16 18:16:59 +02:00
## Natural Language Processing
*Libraries for working with human languages.*
* [Treat](https://github.com/louismullie/treat) - Treat is a toolkit for natural language processing and computational linguistics in Ruby
## Downloader
*Libraries for downloading.*
* TODO
## Browser automation and emulation
* TODO
## Multiprocessing
* [Celluloid](https://github.com/celluloid/celluloid) - Actor-based concurrent object framework for Ruby
* [Parallel](https://github.com/grosser/parallel) - Ruby parallel processing made simple and fast
## Asynchronous
*Libraries for asynchronous networking programming.*
* [EventMachine](https://github.com/eventmachine/eventmachine) - event-driven I/O and lightweight concurrency library
## Queue
* [Resque](https://github.com/resque/resque) A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
* [Delayed::Job](https://github.com/tobi/delayed_job) — Database backed asynchronous priority queue.
* [Qu](https://github.com/bkeepers/qu) A Ruby library for queuing and processing background jobs.
* [Sidekiq](https://github.com/mperham/sidekiq) Simple, efficient background processing for Ruby
2015-08-16 18:28:16 +02:00
* [Sneakers](https://github.com/jondot/sneakers) - A fast background processing framework for Ruby and RabbitMQ
2015-08-16 18:16:59 +02:00
## Cloud Computing
* TODO
## Email
*Libraries for parsing email.*
* [mail](https://github.com/mikel/mail) A Really Ruby Mail Library
## URL Manipulation
*Libraries for parsing URLs.*
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
2015-08-16 18:28:16 +02:00
* [Metainspector](https://github.com/jaimeiniesta/metainspector) - scrapes a given URL, and returns its title, meta description, meta keywords, an array with all the links, all the images in it, etc
2015-08-16 18:16:59 +02:00
## WebSocket
*Libraries for working with WebSocket.*
* [em-websocket](https://github.com/igrigorik/em-websocket) - EventMachine based WebSocket server
## DNS Resolving
* TODO
## Computer Vision
* TODO
## Geolocation
* [geocoder](https://github.com/alexreisner/geocoder) Complete Ruby geocoding solution
* [Geokit](https://github.com/geokit/geokit) - Geokit gem provides geocoding and distance/heading calculations.
## Other ruby lists
* TODO