mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-24 08:32:19 +02:00
143 lines
5.4 KiB
Markdown
143 lines
5.4 KiB
Markdown
|
# Python Web Scraping
|
||
|
|
||
|
This list contains ruby libraries related to web scraping and data processing
|
||
|
|
||
|
* [Python Web Scraping](#python-web-scraping)
|
||
|
* [Network](#network)
|
||
|
* [Web-scraping Frameworks](#web-scraping-frameworks)
|
||
|
* [HTML/XML Parsing](#htmlxml-parsing)
|
||
|
* [Text processing](#text-processing)
|
||
|
* [Specific Formats Processing](#specific-formats-processing)
|
||
|
* [Natural Language Processing](#natural-language-processing)
|
||
|
* [Downloader](#downloader)
|
||
|
* [Browser automation and emulation](#browser-automation-and-emulation)
|
||
|
* [Multiprocessing](#multiprocessing)
|
||
|
* [Queue](#queue)
|
||
|
* [Cloud Computing](#cloud-computing)
|
||
|
* [Email](#email)
|
||
|
* [URL Manipulation](#url-manipulation)
|
||
|
* [Web Content Extracting](#web-content-extracting)
|
||
|
* [Asynchronous](#asynchronous)
|
||
|
* [WebSocket](#websocket)
|
||
|
* [DNS Resolving](#dns-resolving)
|
||
|
* [Computer Vision](#computer-vision)
|
||
|
* [Geolocation](#geolocation)
|
||
|
* [Other Python Lists](#other-python-lists)
|
||
|
|
||
|
## Network
|
||
|
|
||
|
* [httparty](https://github.com/jnunemaker/httparty) Makes http fun again!
|
||
|
* [faraday](https://github.com/lostisland/faraday) Simple, but flexible HTTP client library, with support for multiple backends.
|
||
|
* [http](https://github.com/tarcieri/http) A simple Ruby DSL for making HTTP requests
|
||
|
* [excon](https://github.com/excon/excon) Usable, fast, simple HTTP(S) 1.1 for Ruby
|
||
|
* [nestful](https://github.com/maccman/nestful) Simple Ruby HTTP/REST client with a sane API
|
||
|
* [EM-HTTP-Request](https://github.com/igrigorik/em-http-request) - EventMachine based asynchronous HTTP client
|
||
|
|
||
|
## Web-Scraping Frameworks
|
||
|
|
||
|
* TODO
|
||
|
|
||
|
## HTML/XML Parsing
|
||
|
|
||
|
* [nokogiri](https://github.com/sparklemotion/nokogiri) - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
|
||
|
* [loofah](https://github.com/flavorjones/loofah) - HTML/XML manipulation and sanitization based on Nokogiri
|
||
|
|
||
|
## Text Processing
|
||
|
|
||
|
*Libraries for parsing and manipulating plain texts.*
|
||
|
|
||
|
* General
|
||
|
* TODO
|
||
|
|
||
|
## Specific Formats Processing
|
||
|
|
||
|
*Libraries for parsing and manipulating specific text formats.*
|
||
|
|
||
|
* Office
|
||
|
* [Yomu](https://github.com/Erol) - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
|
||
|
* [spreadsheet](https://github.com/zdavatz/spreadsheet) - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
|
||
|
* [roo](https://github.com/Empact/roo) - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
|
||
|
* [google-spreadsheet-ruby](https://github.com/gimite/google-spreadsheet-ruby) - This is a library to read/write Google Spreadsheet.
|
||
|
* [rubyXL](https://github.com/weshatheleopard/rubyXL) - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
|
||
|
* [remote_table](https://github.com/seamusabshere/remote_table) - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
|
||
|
* [sheets](https://github.com/bspaulding/Sheets) - Work with spreadsheets easily in a native ruby format.
|
||
|
* [workbook](https://github.com/murb/workbook) - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
|
||
|
* [oxcelix](https://github.com/gbiczo/oxcelix) - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
|
||
|
* [wrap_excel](https://github.com/tomiacannondale/wrap_excel) - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
|
||
|
|
||
|
## Natural Language Processing
|
||
|
|
||
|
*Libraries for working with human languages.*
|
||
|
|
||
|
* [Treat](https://github.com/louismullie/treat) - Treat is a toolkit for natural language processing and computational linguistics in Ruby
|
||
|
|
||
|
## Downloader
|
||
|
|
||
|
*Libraries for downloading.*
|
||
|
|
||
|
* TODO
|
||
|
|
||
|
## Browser automation and emulation
|
||
|
* TODO
|
||
|
|
||
|
## Multiprocessing
|
||
|
|
||
|
* [Celluloid](https://github.com/celluloid/celluloid) - Actor-based concurrent object framework for Ruby
|
||
|
* [Parallel](https://github.com/grosser/parallel) - Ruby parallel processing made simple and fast
|
||
|
|
||
|
## Asynchronous
|
||
|
|
||
|
*Libraries for asynchronous networking programming.*
|
||
|
|
||
|
* [EventMachine](https://github.com/eventmachine/eventmachine) - event-driven I/O and lightweight concurrency library
|
||
|
|
||
|
## Queue
|
||
|
|
||
|
* [Resque](https://github.com/resque/resque) A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
|
||
|
* [Delayed::Job](https://github.com/tobi/delayed_job) — Database backed asynchronous priority queue.
|
||
|
* [Qu](https://github.com/bkeepers/qu) A Ruby library for queuing and processing background jobs.
|
||
|
* [Sidekiq](https://github.com/mperham/sidekiq) Simple, efficient background processing for Ruby
|
||
|
|
||
|
## Cloud Computing
|
||
|
* TODO
|
||
|
|
||
|
## Email
|
||
|
|
||
|
*Libraries for parsing email.*
|
||
|
|
||
|
* [mail](https://github.com/mikel/mail) A Really Ruby Mail Library
|
||
|
|
||
|
## URL Manipulation
|
||
|
|
||
|
*Libraries for parsing URLs.*
|
||
|
|
||
|
* TODO
|
||
|
|
||
|
## Web Content Extracting
|
||
|
|
||
|
*Libraries for extracting web contents.*
|
||
|
|
||
|
* TODO
|
||
|
|
||
|
|
||
|
## WebSocket
|
||
|
|
||
|
*Libraries for working with WebSocket.*
|
||
|
|
||
|
* [em-websocket](https://github.com/igrigorik/em-websocket) - EventMachine based WebSocket server
|
||
|
|
||
|
## DNS Resolving
|
||
|
* TODO
|
||
|
|
||
|
## Computer Vision
|
||
|
* TODO
|
||
|
|
||
|
## Geolocation
|
||
|
|
||
|
* [geocoder](https://github.com/alexreisner/geocoder) Complete Ruby geocoding solution
|
||
|
* [Geokit](https://github.com/geokit/geokit) - Geokit gem provides geocoding and distance/heading calculations.
|
||
|
|
||
|
## Other ruby lists
|
||
|
|
||
|
* TODO
|