1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2025-02-15 13:33:11 +02:00
Gregory Petukhov 485b95b5a7 Create ruby.md
2015-08-16 21:16:59 +05:00

5.4 KiB

Python Web Scraping

This list contains ruby libraries related to web scraping and data processing

Network

  • httparty Makes http fun again!
  • faraday Simple, but flexible HTTP client library, with support for multiple backends.
  • http A simple Ruby DSL for making HTTP requests
  • excon Usable, fast, simple HTTP(S) 1.1 for Ruby
  • nestful Simple Ruby HTTP/REST client with a sane API
  • EM-HTTP-Request - EventMachine based asynchronous HTTP client

Web-Scraping Frameworks

  • TODO

HTML/XML Parsing

  • nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
  • loofah - HTML/XML manipulation and sanitization based on Nokogiri

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • TODO

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • Office
    • Yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
    • spreadsheet - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
    • roo - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
    • google-spreadsheet-ruby - This is a library to read/write Google Spreadsheet.
    • rubyXL - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
    • remote_table - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
    • sheets - Work with spreadsheets easily in a native ruby format.
    • workbook - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
    • oxcelix - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
    • wrap_excel - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.

Natural Language Processing

Libraries for working with human languages.

  • Treat - Treat is a toolkit for natural language processing and computational linguistics in Ruby

Downloader

Libraries for downloading.

  • TODO

Browser automation and emulation

  • TODO

Multiprocessing

  • Celluloid - Actor-based concurrent object framework for Ruby
  • Parallel - Ruby parallel processing made simple and fast

Asynchronous

Libraries for asynchronous networking programming.

  • EventMachine - event-driven I/O and lightweight concurrency library

Queue

  • Resque A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
  • Delayed::Job — Database backed asynchronous priority queue.
  • Qu A Ruby library for queuing and processing background jobs.
  • Sidekiq Simple, efficient background processing for Ruby

Cloud Computing

  • TODO

Email

Libraries for parsing email.

  • mail A Really Ruby Mail Library

URL Manipulation

Libraries for parsing URLs.

  • TODO

Web Content Extracting

Libraries for extracting web contents.

  • TODO

WebSocket

Libraries for working with WebSocket.

DNS Resolving

  • TODO

Computer Vision

  • TODO

Geolocation

  • geocoder Complete Ruby geocoding solution
  • Geokit - Geokit gem provides geocoding and distance/heading calculations.

Other ruby lists

  • TODO