# Python Web Scraping This list contains ruby libraries related to web scraping and data processing * [Python Web Scraping](#python-web-scraping) * [Network](#network) * [Web-scraping Frameworks](#web-scraping-frameworks) * [HTML/XML Parsing](#htmlxml-parsing) * [Text processing](#text-processing) * [Specific Formats Processing](#specific-formats-processing) * [Natural Language Processing](#natural-language-processing) * [Downloader](#downloader) * [Browser automation and emulation](#browser-automation-and-emulation) * [Multiprocessing](#multiprocessing) * [Queue](#queue) * [Cloud Computing](#cloud-computing) * [Email](#email) * [URL Manipulation](#url-manipulation) * [Web Content Extracting](#web-content-extracting) * [Asynchronous](#asynchronous) * [WebSocket](#websocket) * [DNS Resolving](#dns-resolving) * [Computer Vision](#computer-vision) * [Geolocation](#geolocation) * [Other Python Lists](#other-python-lists) ## Network * [httparty](https://github.com/jnunemaker/httparty) Makes http fun again! * [faraday](https://github.com/lostisland/faraday) Simple, but flexible HTTP client library, with support for multiple backends. * [http](https://github.com/tarcieri/http) A simple Ruby DSL for making HTTP requests * [excon](https://github.com/excon/excon) Usable, fast, simple HTTP(S) 1.1 for Ruby * [nestful](https://github.com/maccman/nestful) Simple Ruby HTTP/REST client with a sane API * [EM-HTTP-Request](https://github.com/igrigorik/em-http-request) - EventMachine based asynchronous HTTP client ## Web-Scraping Frameworks * [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping ## HTML/XML Parsing * [nokogiri](https://github.com/sparklemotion/nokogiri) - HTML, XML, SAX, and Reader parser with XPath and CSS selector support * [loofah](https://github.com/flavorjones/loofah) - HTML/XML manipulation and sanitization based on Nokogiri ## Text Processing *Libraries for parsing and manipulating plain texts.* * General * TODO ## Specific Formats Processing *Libraries for parsing and manipulating specific text formats.* * Office * [Yomu](https://github.com/Erol) - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf) * [spreadsheet](https://github.com/zdavatz/spreadsheet) - The Spreadsheet Library is designed to read and write Spreadsheet Documents. * [roo](https://github.com/Empact/roo) - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets. * [google-spreadsheet-ruby](https://github.com/gimite/google-spreadsheet-ruby) - This is a library to read/write Google Spreadsheet. * [rubyXL](https://github.com/weshatheleopard/rubyXL) - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents * [remote_table](https://github.com/seamusabshere/remote_table) - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs. * [sheets](https://github.com/bspaulding/Sheets) - Work with spreadsheets easily in a native ruby format. * [workbook](https://github.com/murb/workbook) - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files... * [oxcelix](https://github.com/gbiczo/oxcelix) - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects * [wrap_excel](https://github.com/tomiacannondale/wrap_excel) - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README. * libpcap [PacketFul](https://github.com/packetfu/packetfu) - A library for reading and writing packets to an interface or to a libpcap-formatted file. * JSON * [JsonCompare](https://github.com/a2design-company/json-compare) - Returns the difference between two JSON files ## Natural Language Processing *Libraries for working with human languages.* * [Treat](https://github.com/louismullie/treat) - Treat is a toolkit for natural language processing and computational linguistics in Ruby ## Downloader *Libraries for downloading.* * TODO ## Browser automation and emulation * TODO ## Multiprocessing * [Celluloid](https://github.com/celluloid/celluloid) - Actor-based concurrent object framework for Ruby * [Parallel](https://github.com/grosser/parallel) - Ruby parallel processing made simple and fast ## Asynchronous *Libraries for asynchronous networking programming.* * [EventMachine](https://github.com/eventmachine/eventmachine) - event-driven I/O and lightweight concurrency library ## Queue * [Resque](https://github.com/resque/resque) A Redis-backed Ruby library for creating background jobs, placing them on multiple queues. * [Delayed::Job](https://github.com/tobi/delayed_job) — Database backed asynchronous priority queue. * [Qu](https://github.com/bkeepers/qu) A Ruby library for queuing and processing background jobs. * [Sidekiq](https://github.com/mperham/sidekiq) Simple, efficient background processing for Ruby * [Sneakers](https://github.com/jondot/sneakers) - A fast background processing framework for Ruby and RabbitMQ ## Cloud Computing * TODO ## Email *Libraries for parsing email.* * [mail](https://github.com/mikel/mail) A Really Ruby Mail Library ## URL Manipulation *Libraries for parsing URLs.* * TODO ## Web Content Extracting *Libraries for extracting web contents.* * [Metainspector](https://github.com/jaimeiniesta/metainspector) - scrapes a given URL, and returns its title, meta description, meta keywords, an array with all the links, all the images in it, etc ## WebSocket *Libraries for working with WebSocket.* * [em-websocket](https://github.com/igrigorik/em-websocket) - EventMachine based WebSocket server ## DNS Resolving * TODO ## Computer Vision * TODO ## Geolocation * [geocoder](https://github.com/alexreisner/geocoder) Complete Ruby geocoding solution * [Geokit](https://github.com/geokit/geokit) - Geokit gem provides geocoding and distance/heading calculations. ## Other ruby lists * TODO