mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-21 17:17:03 +02:00
c437605b7b
The crawling part of Apify SDK is now named Crawlee and its new version is out with a bunch of improvements.
15 KiB
15 KiB
JavaScript Web Scraping
This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).
- JavaScript Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- URL and Network Address Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Other JavaScript Lists
- Data Structure
Network
- request - Simplified HTTP request client.
- socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
- rest - RESTful HTTP client for JavaScript
- wreck - HTTP Client Utilities
- got - Simplified HTTP requests
- node-fetch - A light-weight module that brings window.fetch to Node.js
- bent - Functional HTTP client for Node.js w/ async/await
- axios - Promise based HTTP client for the browser and node.js
- superagent - Ajax for Node.js and browsers (JS HTTP client)
- urllib - Request HTTP(s) URLs in a complex world
- needle - Nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support
Web-Scraping Frameworks
- webparsy - NodeJS lib and cli for scraping websites using Puppeteer and YAML
- node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery
- node-simplecrawler - Flexible event driven crawler for node
- Crawlee - Node.js and TypeScript library that crawls with Cheerio, JSDOM, Playwright and Puppeteer while enhancing them with anti-blocking features, queue, storages and more.
- Ayakashi - The next generation web scraping framework. Features all the necessary tools to create reliable and maintainable scraping and automation systems.
- pjscrape - A web-scraping framework written in Javascript, using PhantomJS and jQuery
HTML/XML Parsing
- General
- parse5 - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
- htmlparser2 - forgiving html and xml parser
- sax-js - A sax style parser for JS
- cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
- Sanitizing
Text Processing
Libraries for parsing and manipulating plain texts.
- General
- string.js - Extra JavaScript string methods.
- accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
- validator.js - String validation and sanitization.
- Date and time
- moment - Parse, validate, manipulate, and display dates in javascript.
- moment-timezone - Timezone support for moment.js.
- date - Date() for humans.
- ms.js - Tiny millisecond conversion utility.
- moment - Parse, validate, manipulate, and display dates in javascript.
- HTML entities
- he - A robust HTML entity encoder/decoder written in JavaScript.
- Money
- money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
- Color
- User Agent
- UAParser.js - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
- Semantic Version
- node-semver - The semver parser for node
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
- General
- jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
- Office
- js-xlsx - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
- CSV
- JSON
- json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.
- EXIF
- exif-js - JavaScript library for reading EXIF image metadata
- CSS
- parse-css - Standards-based CSS Parser
- parser-lib CSS parser - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
- Torrent
- parse-torrent - Parse a torrent identifier (magnet uri, .torrent file, info hash)
- SQL
- SQL Parser - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
- YAML
- JS-YAML - JavaScript YAML parser and dumper. Very fast.
- Markdown
- markdown-it - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
- Atom/RSS
- node-feedparser - Robust RSS, Atom, and RDF feed parsing in Node.js
- Netscape Bookmarks(Firefox, Google Chrome, ...)
- node-bookmarks-parser - Parses Firefox/Chrome HTML bookmarks files
Natural Language Processing
Libraries for working with human languages.
- General
- natural - general natural language facilities for node
- nlp_compromise - natural language processing
- Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
- salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
- node-summary - Node module that summarizes text using a naive summarization algorithm
- Stemmer
- snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
- porter-stemmer - Martin Porter's stemmer for node.js
- Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
- lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
- Language detection
- franc - Natural language detection
- guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js
Browser automation and emulation
- phantomjs - Scriptable Headless WebKit.
- slimerjs - A PhantomJS-like tool running Gecko.
- casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
- zombie - Insanely fast, full-stack, headless browser testing using node.js.
- nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
- puppeteer - Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
- headless-chrome-crawler - Distributed crawler powered by Headless Chrome
- puppeteer-recorder - Puppeteer recorder is a Chrome extension that records your browser interactions and generates a Puppeteer script.
- wendigo - Test-oriented headless browser, built on top of Puppeteer.
- Playwright - Node.js library to automate Chromium, Firefox and WebKit with a single API
Multiprocessing
- nexpect - spawn and control child processes in node.js with ease
- respawn - Spawn a process and restart it if it crashes
- node-webworker - A WebWorkers implementation for NodeJS
Asynchronous
Libraries for asynchronous networking programming.
- socket.io - Realtime application framework (Node.JS server)
- engine.io - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
- async - Async utilities for node and the browser
Queue
- kue - Kue is a priority job queue backed by redis, built for node.js
- bull - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.
Libraries for parsing email.
- mailparser - Decode mime formatted e-mails
URL and Network Address Manipulation
Libraries for parsing/modifying URLs and network addresses.
- URL
- query-string - Parse and stringify URL query strings.
- URI.js - Javascript URL mutation library.
- jsurl - Lightweight URL manipulation with JavaScript.
- arg.js - Lightweight URL argument and parameter parser
- Network Address
- node-ip - IP address tools for node.js
- ip-address - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript
Web Content Extracting
Libraries for extracting web contents.
- node-read - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
- node-ytdl-core - Youtube video downloader in javascript
- ImageResolver - Does its best to determine the main image on a URL without loading all images.
WebSocket
Libraries for working with WebSocket.
- websocket.io - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
- WebScoket-Node - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)
DNS Resolving
- multicast-dns - Low level multicast-dns implementation in pure javascript
- node-dns - Replacement dns module in pure javascript for node.js
Computer Vision
- tracking.js - A modern approach for Computer Vision on the web.
- ocrad.js - OCR in Javascript via Emscripten.
Proxy Server
- toxy - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions
- proxy-chain - Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining
Data Structure
- immutable - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
- lodash - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects