1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-21 17:17:03 +02:00
awesome-web-scraping/php.md
2024-08-23 10:07:40 +03:00

10 KiB

PHP Web Scraping

This list contains PHP libraries related to web scraping and data processing

Network

  • Guzzle - A comprehensive HTTP client.
  • Buzz - Another HTTP client.
  • Requests - A simple HTTP library.
  • HTTPFul - A chainable HTTP client.
  • Goutte - A simple web scraper.
  • PHP Spider - A comprehensive web spider.

Web-Scraping Frameworks

  • Crawler - (crwlr) - Library for Rapid (Web) Crawler and Scraper Development
  • Roach - It is port of the popular Scrapy package for Python. Include adapter to Laravel and Symfony

HTML/XML Parsing

  • HTML5 PHP - An HTML5 parser and serializer library.
  • QueryPath - a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.
  • DiDOM - super fast HTML parser (because it was build on top of plain PHP).
  • PHPScraper - an highly opinionated web-interface.
  • DomCrawler - (Symfony) - The DomCrawler component eases DOM navigation for HTML and XML documents.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • ANSI to HTML5 - An ANSI to HTML5 converter library.
    • Patchwork UTF-8 - A portable library for working with UTF-8 strings.
    • Hoa String - Another UTF-8 string library.
    • Stringy - A string manipulation library with multibyte support.
    • Color Jizz - A library for manipulating and converting colours.
    • Text - A text manipulation library.
    • Flux - A regular expression building library.
  • Transliteration
    • Urlify - A PHP port of Django's URLify.js.
    • Slugify - A library to convert strings to slugs.
  • User-agent
    • CrawlerDetect - CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header.
    • PHPUserAgent - A simple, streamlined PHP user-agent parser!
    • AgentZero - A library for extracting information from User-Agent strings very fast.
    • Device Detector - Another library for parsing user agent strings.
    • Mobile-Detect - A lightweight PHP class for detecting mobile devices (including tablets).
    • UA Parser - A library for parsing user agent strings.
  • Unites of measure
    • ByteUnits - A library to parse, format and convert byte units in binary and metric systems.
    • PHP Units of Measure - A library for converting between units of measure.
    • PHP Conversion - Another library for converting between units of measure.
  • Phone number

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • CSV
    • CSV - A CSV data manipulation library.
  • Office
    • PHPWord - A library for working with Microsoft Word documents.
    • PHPExcel - A library for working with Microsoft Excel documents.
    • PHPPowerPoint - A library for working with Microsoft PowerPoint documents.
    • ExcelAnt - A library for manipulating Microsoft Excel documents.
  • Markdown
  • BBCode
    • Decoda - A lightweight lexical string parser for BBCode styled markup.
  • JSON
    • JsonMapper - A library that maps nested JSON structures onto PHP classes.
  • vCard
    • vobject - The VObject library allows you to easily parse and manipulate iCalendar and vCard objects.
  • File Type Detection
    • Hoa Mime - Another MIME detection library.
    • Canal - A library to determine internet media types.
    • Apache MIME Types - A library that parses Apache MIME types.
  • GeoJSON
    • GeoJSON - A GeoJSON implementation.

Natural Language Processing

Libraries for working with human languages.

  • PHP NlpTools - Natural Language Processing Tools in PHP
  • nlpTools - Natural Language Processing Toolkit for PHP

Browser automation and emulation

  • php-webdriver - A php client for webdriver.
  • PHP PhantomJS - Execute PhantomJS commands through PHP
  • Mink - universal API for multiple browser emulators (selenium, zombie.js, goutte)

Multiprocessing

  • Spork - A process forking library.

Asynchronous

Libraries for asynchronous networking programming.

  • React - An event driven non-blocking I/O library.
  • Rx.PHP - A reactive extension library.
  • Hoa EventSource - An event source library.
  • Evenement - An event dispatcher library.
  • Event - An event library with a focus on domain events.
  • Broadway - An event source and CQRS library.

Queue

  • Pheanstalk - A Beanstalkd client library.
  • PHP AMQP - A pure PHP AMQP library.
  • Thumper - A RabbitMQ pattern library.
  • Bernard - A multibackend abstraction library.

Cloud Computing

  • TODO

Email

Libraries for parsing email.

URL Manipulation

Libraries for parsing URLs.

  • Purl - A URL manipulation library.
  • PHP Domain Parser - A domain suffix parser library.
  • Uri (The PHP League) - A simple URL manipulation library (PSR-7 compatible).
  • Url (crwlr) - Swiss Army knife for urls.

Web Content Extracting

  • Text and Meta Data from Web Documents
    • Essence - A library for extracting web media.
    • Embera - An Oembed consumer library.
    • Embed - An awesome library for getting useful information from a webpage.
  • Video
    • Youtube-Downloader - PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers

WebSocket

Libraries for working with WebSocket.

DNS Resolving

  • Net_DNS2 - Native PHP DNS Resolver and Updater

Computer Vision

Geocoding

Other PHP lists