mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2025-02-15 13:33:11 +02:00
5.4 KiB
5.4 KiB
Python Web Scraping
This list contains ruby libraries related to web scraping and data processing
- Python Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Downloader
- Browser automation and emulation
- Multiprocessing
- Queue
- Cloud Computing
- URL Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Geolocation
- Other Python Lists
Network
- httparty Makes http fun again!
- faraday Simple, but flexible HTTP client library, with support for multiple backends.
- http A simple Ruby DSL for making HTTP requests
- excon Usable, fast, simple HTTP(S) 1.1 for Ruby
- nestful Simple Ruby HTTP/REST client with a sane API
- EM-HTTP-Request - EventMachine based asynchronous HTTP client
Web-Scraping Frameworks
- TODO
HTML/XML Parsing
- nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
- loofah - HTML/XML manipulation and sanitization based on Nokogiri
Text Processing
Libraries for parsing and manipulating plain texts.
- General
- TODO
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
- Office
- Yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
- spreadsheet - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
- roo - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
- google-spreadsheet-ruby - This is a library to read/write Google Spreadsheet.
- rubyXL - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
- remote_table - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
- sheets - Work with spreadsheets easily in a native ruby format.
- workbook - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
- oxcelix - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
- wrap_excel - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
Natural Language Processing
Libraries for working with human languages.
- Treat - Treat is a toolkit for natural language processing and computational linguistics in Ruby
Downloader
Libraries for downloading.
- TODO
Browser automation and emulation
- TODO
Multiprocessing
- Celluloid - Actor-based concurrent object framework for Ruby
- Parallel - Ruby parallel processing made simple and fast
Asynchronous
Libraries for asynchronous networking programming.
- EventMachine - event-driven I/O and lightweight concurrency library
Queue
- Resque A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
- Delayed::Job — Database backed asynchronous priority queue.
- Qu A Ruby library for queuing and processing background jobs.
- Sidekiq Simple, efficient background processing for Ruby
Cloud Computing
- TODO
Libraries for parsing email.
- mail A Really Ruby Mail Library
URL Manipulation
Libraries for parsing URLs.
- TODO
Web Content Extracting
Libraries for extracting web contents.
- TODO
WebSocket
Libraries for working with WebSocket.
- em-websocket - EventMachine based WebSocket server
DNS Resolving
- TODO
Computer Vision
- TODO
Geolocation
- geocoder Complete Ruby geocoding solution
- Geokit - Geokit gem provides geocoding and distance/heading calculations.
Other ruby lists
- TODO