2015-08-16 15:46:49 +02:00
# PHP Web Scraping
This list contains PHP libraries related to web scraping and data processing
2015-08-16 19:02:51 +02:00
* [PHP Web Scraping ](#php-web-scraping )
2015-08-16 15:46:49 +02:00
* [Network ](#network )
* [Web-scraping Frameworks ](#web-scraping-frameworks )
* [HTML/XML Parsing ](#htmlxml-parsing )
* [Text processing ](#text-processing )
* [Specific Formats Processing ](#specific-formats-processing )
* [Natural Language Processing ](#natural-language-processing )
* [Browser automation and emulation ](#browser-automation-and-emulation )
* [Multiprocessing ](#multiprocessing )
* [Queue ](#queue )
* [Cloud Computing ](#cloud-computing )
* [Email ](#email )
* [URL Manipulation ](#url-manipulation )
* [Web Content Extracting ](#web-content-extracting )
* [Asynchronous ](#asynchronous )
* [WebSocket ](#websocket )
* [DNS Resolving ](#dns-resolving )
* [Computer Vision ](#computer-vision )
2015-08-17 11:47:57 +02:00
* [Geocoding ](#geocoding )
2015-08-18 09:36:48 +02:00
* [API Clients ](#api-clients )
2015-08-16 19:02:51 +02:00
* [Other PHP Lists ](#other-php-lists )
2015-08-16 15:46:49 +02:00
## Network
* [Guzzle ]( https://github.com/guzzle/guzzle ) - A comprehensive HTTP client.
* [Buzz ](https://github.com/kriswallsmith/Buzz ) - Another HTTP client.
* [Requests ](https://github.com/rmccue/Requests ) - A simple HTTP library.
* [HTTPFul ](https://github.com/nategood/httpful ) - A chainable HTTP client.
* [Goutte ](https://github.com/fabpot/Goutte ) - A simple web scraper.
## Web-Scraping Frameworks
* TODO
## HTML/XML Parsing
* [HTML5 PHP ](https://github.com/Masterminds/html5-php ) - An HTML5 parser and serializer library.
2015-08-17 16:48:31 +02:00
* [QueryPath ](https://github.com/technosophos/querypath ) - a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.
2015-08-16 15:46:49 +02:00
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [ANSI to HTML5 ](https://github.com/sensiolabs/ansi-to-html ) - An ANSI to HTML5 converter library.
* [Patchwork UTF-8 ](https://github.com/nicolas-grekas/Patchwork-UTF8 ) - A portable library for working with UTF-8 strings.
2015-08-17 15:42:02 +02:00
* [Hoa String ](https://github.com/hoaproject/Ustring ) - Another UTF-8 string library.
2015-08-16 15:46:49 +02:00
* [Stringy ](https://github.com/danielstjules/Stringy ) - A string manipulation library with multibyte support.
* [Color Jizz ](https://github.com/mikeemoo/ColorJizz-PHP ) - A library for manipulating and converting colours.
* [Text ](https://github.com/kzykhys/Text ) - A text manipulation library.
2015-08-16 16:02:27 +02:00
* [Flux ](https://github.com/selvinortiz/flux ) - A regular expression building library.
2015-08-16 15:46:49 +02:00
* Transliteration
* [Urlify ](https://github.com/jbroadway/urlify ) - A PHP port of Django's URLify.js.
* [Slugify ](https://github.com/cocur/slugify ) - A library to convert strings to slugs.
* User-agent
* [Device Detector ](https://github.com/piwik/device-detector ) - Another library for parsing user agent strings.
* [Mobile-Detect ](https://github.com/serbanghita/Mobile-Detect ) - A lightweight PHP class for detecting mobile devices (including tablets).
* [UA Parser ](https://github.com/tobie/ua-parser/tree/master/php ) - A library for parsing user agent strings.
* Unites of measure
* [ByteUnits ](https://github.com/gabrielelana/byte-units ) - A library to parse, format and convert byte units in binary and metric systems.
* [PHP Units of Measure ](https://github.com/triplepoint/php-units-of-measure ) - A library for converting between units of measure.
* [PHP Conversion ](https://github.com/Crisu83/php-conversion ) - Another library for converting between units of measure.
* Phone number
* [LibPhoneNumber for PHP ](https://github.com/giggsey/libphonenumber-for-php ) - A PHP implementation of Google's phone number handling library.
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
2015-08-16 20:02:17 +02:00
* CSV
* [CSV ](https://github.com/thephpleague/csv ) - A CSV data manipulation library.
2015-08-16 15:46:49 +02:00
* Office
* [PHPWord ](https://github.com/PHPOffice/PHPWord ) - A library for working with Microsoft Word documents.
* [PHPExcel ](https://github.com/PHPOffice/PHPExcel ) - A library for working with Microsoft Excel documents.
2015-08-18 20:37:34 +02:00
* [PHPPowerPoint ](https://github.com/PHPOffice/PHPPowerPoint ) - A library for working with Microsoft PowerPoint documents.
2015-08-16 15:46:49 +02:00
* [ExcelAnt ](https://github.com/Wisembly/ExcelAnt ) - A library for manipulating Microsoft Excel documents.
* Markdown
* [PHP Markdown ](https://github.com/michelf/php-markdown ) - A Markdown parser.
* [CommonMark PHP ](https://github.com/thephpleague/commonmark ) - A Markdown parser which supports the full [CommonMark spec ](https://jgm.github.io/stmd/spec.html ).
* [Parsedown ](https://github.com/erusev/parsedown ) - Another Markdown parser.
* [Ciconia ](https://github.com/kzykhys/Ciconia ) - Another Markdown parser that supports Github flavoured Markdown.
* [Cebe Markdown ](https://github.com/cebe/markdown ) - An fast and extensible Markdown parser.
* BBCode
* [Decoda ](https://github.com/milesj/decoda ) - A lightweight lexical string parser for BBCode styled markup.
* JSON
* [JsonMapper ](https://github.com/netresearch/jsonmapper ) - A library that maps nested JSON structures onto PHP classes.
2015-08-16 16:02:27 +02:00
* vCard
* [vobject ](https://github.com/fruux/sabre-vobject ) - The VObject library allows you to easily parse and manipulate iCalendar and vCard objects.
2015-08-16 20:02:17 +02:00
* File Type Detection
* [Hoa Mime ](https://github.com/hoaproject/Mime ) - Another MIME detection library.
* [Canal ](https://github.com/dflydev/dflydev-canal ) - A library to determine internet media types.
* [Apache MIME Types ](https://github.com/dflydev/dflydev-apache-mime-types ) - A library that parses Apache MIME types.
* GeoJSON
* [GeoJSON ](https://github.com/jmikola/geojson ) - A GeoJSON implementation.
2015-08-16 15:46:49 +02:00
## Natural Language Processing
*Libraries for working with human languages.*
2015-08-16 16:53:31 +02:00
* [PHP NlpTools ](https://github.com/angeloskath/php-nlp-tools ) - Natural Language Processing Tools in PHP
* [nlpTools ](https://github.com/atrilla/nlptools ) - Natural Language Processing Toolkit for PHP
2015-08-16 15:46:49 +02:00
## Browser automation and emulation
2015-08-16 16:53:31 +02:00
* [php-webdriver ](https://github.com/facebook/php-webdriver ) - A php client for webdriver.
2015-08-16 17:43:06 +02:00
* [PHP PhantomJS ](https://github.com/jonnnnyw/php-phantomjs ) - Execute PhantomJS commands through PHP
* [Mink ](https://github.com/minkphp/Mink ) - universal API for multiple browser emulators (selenium, zombie.js, goutte)
2015-08-16 15:46:49 +02:00
## Multiprocessing
2015-08-16 16:02:27 +02:00
* [Spork ](https://github.com/kriswallsmith/spork ) - A process forking library.
2015-08-16 15:46:49 +02:00
## Asynchronous
*Libraries for asynchronous networking programming.*
2015-08-16 16:44:00 +02:00
* [React ](https://github.com/reactphp/react ) - An event driven non-blocking I/O library.
* [Rx.PHP ](https://github.com/asm89/Rx.PHP ) - A reactive extension library.
* [Hoa EventSource ](https://github.com/hoaproject/Eventsource ) - An event source library.
* [Evenement ](https://github.com/igorw/evenement ) - An event dispatcher library.
* [Event ](https://github.com/thephpleague/event ) - An event library with a focus on domain events.
* [Broadway ](https://github.com/qandidate-labs/broadway ) - An event source and CQRS library.
2015-08-16 15:46:49 +02:00
## Queue
* [Pheanstalk ](https://github.com/pda/pheanstalk ) - A Beanstalkd client library.
* [PHP AMQP ](https://github.com/videlalvaro/php-amqplib ) - A pure PHP AMQP library.
* [Thumper ](https://github.com/videlalvaro/Thumper ) - A RabbitMQ pattern library.
* [Bernard ](https://github.com/bernardphp/bernard ) - A multibackend abstraction library.
## Cloud Computing
* TODO
## Email
*Libraries for parsing email.*
* [Email Reply Parser ](https://github.com/willdurand/EmailReplyParser ) - An email reply parser library.
* [Email Validator ](https://github.com/nojacko/email-validator ) - A small email address validation library.
## URL Manipulation
*Libraries for parsing URLs.*
* [Purl ](https://github.com/jwage/purl ) - A URL manipulation library.
* [PHP Domain Parser ](https://github.com/jeremykendall/php-domain-parser ) - A domain suffix parser library.
* [Url ](https://github.com/thephpleague/url ) - A simple URL manipulation library.
## Web Content Extracting
2015-08-18 11:18:10 +02:00
* Text and Meta Data from Web Documents
* [Essence ](https://github.com/felixgirault/essence ) - A library for extracting web media.
* [Embera ](https://github.com/mpratt/Embera ) - An Oembed consumer library.
* Video
* [Youtube-Downloader ](https://github.com/jeckman/YouTube-Downloader ) - PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers
2015-08-16 15:46:49 +02:00
## WebSocket
*Libraries for working with WebSocket.*
2015-08-16 16:44:00 +02:00
* [Ratchet ](https://github.com/cboden/Ratchet ) - A web socket library.
* [Hoa WebSocket ](https://github.com/hoaproject/Websocket ) - Another web socket library.
* [Elephant.io ](https://github.com/Wisembly/Elephant.io ) - Yet another web socket library.
2015-08-16 15:46:49 +02:00
## DNS Resolving
2015-08-16 16:44:00 +02:00
* [Net_DNS2 ](https://github.com/mikepultz/netdns2 ) - Native PHP DNS Resolver and Updater
2015-08-16 15:46:49 +02:00
## Computer Vision
2015-08-16 16:44:00 +02:00
* [OpenCV-for-PHP ](https://github.com/mgdm/OpenCV-for-PHP ) - An OpenCV binding for PHP
2015-08-16 15:46:49 +02:00
2015-08-16 20:02:17 +02:00
## Geocoding
2015-08-16 15:46:49 +02:00
* [GeoCoder ](http://geocoder-php.org/ ) - A geocoding library.
* [GeoTools ](https://github.com/php-loep/Geotools ) - A library of geo-related tools.
2015-08-18 09:36:48 +02:00
## API Clients
*Libraries for working with remote web-scraping API*
* [diffbot-php-client ](https://github.com/Swader/diffbot-php-client/ ) - Diffbot API client
2015-08-16 19:02:51 +02:00
## Other PHP lists
2015-08-16 15:46:49 +02:00
2015-08-16 16:02:27 +02:00
* [awesome-php ](https://github.com/ziadoz/awesome-php )