1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-24 08:32:19 +02:00
awesome-web-scraping/javascript.md
Ondra Urban c437605b7b
Replace Apify SDK with Crawlee, its successor
The crawling part of Apify SDK is now named Crawlee and its new version is out with a bunch of improvements.
2022-08-17 22:48:10 +02:00

15 KiB

JavaScript Web Scraping

This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).

Network

  • request - Simplified HTTP request client.
  • socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
  • rest - RESTful HTTP client for JavaScript
  • wreck - HTTP Client Utilities
  • got - Simplified HTTP requests
  • node-fetch - A light-weight module that brings window.fetch to Node.js
  • bent - Functional HTTP client for Node.js w/ async/await
  • axios - Promise based HTTP client for the browser and node.js
  • superagent - Ajax for Node.js and browsers (JS HTTP client)
  • urllib - Request HTTP(s) URLs in a complex world
  • needle - Nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support

Web-Scraping Frameworks

  • webparsy - NodeJS lib and cli for scraping websites using Puppeteer and YAML
  • node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery
  • node-simplecrawler - Flexible event driven crawler for node
  • Crawlee - Node.js and TypeScript library that crawls with Cheerio, JSDOM, Playwright and Puppeteer while enhancing them with anti-blocking features, queue, storages and more.
  • Ayakashi - The next generation web scraping framework. Features all the necessary tools to create reliable and maintainable scraping and automation systems.
  • pjscrape - A web-scraping framework written in Javascript, using PhantomJS and jQuery

HTML/XML Parsing

  • General
    • parse5 - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
    • htmlparser2 - forgiving html and xml parser
    • sax-js - A sax style parser for JS
    • cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
  • Sanitizing
    • js-xss - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
    • surgeon - Declarative DOM extraction expression evaluator

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • string.js - Extra JavaScript string methods.
    • accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
    • validator.js - String validation and sanitization.
  • Date and time
    • moment - Parse, validate, manipulate, and display dates in javascript.
    • date - Date() for humans.
    • ms.js - Tiny millisecond conversion utility.
  • HTML entities
    • he - A robust HTML entity encoder/decoder written in JavaScript.
  • Money
    • money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
  • Color
    • chroma.js - JavaScript library for all kinds of color manipulations.
    • color - JavaScript color conversion and manipulation library.
    • TinyColor - Fast, small color manipulation and conversion for JavaScript.
  • User Agent
    • UAParser.js - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
  • Semantic Version

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
  • Office
    • js-xlsx - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
  • CSV
    • BabyParse - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
    • CSV - A simple, blazing-fast CSV parser and encoder. Full RFC 4180 compliance.
  • JSON
    • json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.
  • EXIF
    • exif-js - JavaScript library for reading EXIF image metadata
  • CSS
    • parse-css - Standards-based CSS Parser
    • parser-lib CSS parser - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
  • Torrent
    • parse-torrent - Parse a torrent identifier (magnet uri, .torrent file, info hash)
  • SQL
    • SQL Parser - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
  • YAML
    • JS-YAML - JavaScript YAML parser and dumper. Very fast.
  • Markdown
    • markdown-it - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
  • Atom/RSS
  • Netscape Bookmarks(Firefox, Google Chrome, ...)

Natural Language Processing

Libraries for working with human languages.

  • General
    • natural - general natural language facilities for node
    • nlp_compromise - natural language processing
    • Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
    • salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
    • node-summary - Node module that summarizes text using a naive summarization algorithm
  • Stemmer
    • snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
    • porter-stemmer - Martin Porter's stemmer for node.js
    • Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
    • lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
  • Language detection
    • franc - Natural language detection
    • guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js

Browser automation and emulation

  • phantomjs - Scriptable Headless WebKit.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
  • puppeteer - Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
  • headless-chrome-crawler - Distributed crawler powered by Headless Chrome
  • puppeteer-recorder - Puppeteer recorder is a Chrome extension that records your browser interactions and generates a Puppeteer script.
  • wendigo - Test-oriented headless browser, built on top of Puppeteer.
  • Playwright - Node.js library to automate Chromium, Firefox and WebKit with a single API

Multiprocessing

  • nexpect - spawn and control child processes in node.js with ease
  • respawn - Spawn a process and restart it if it crashes
  • node-webworker - A WebWorkers implementation for NodeJS

Asynchronous

Libraries for asynchronous networking programming.

  • socket.io - Realtime application framework (Node.JS server)
  • engine.io - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
  • async - Async utilities for node and the browser

Queue

  • kue - Kue is a priority job queue backed by redis, built for node.js
  • bull - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.

Email

Libraries for parsing email.

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

  • URL
    • query-string - Parse and stringify URL query strings.
    • URI.js - Javascript URL mutation library.
    • jsurl - Lightweight URL manipulation with JavaScript.
    • arg.js - Lightweight URL argument and parameter parser
  • Network Address
    • node-ip - IP address tools for node.js
    • ip-address - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript

Web Content Extracting

Libraries for extracting web contents.

  • node-read - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
  • node-ytdl-core - Youtube video downloader in javascript
  • ImageResolver - Does its best to determine the main image on a URL without loading all images.

WebSocket

Libraries for working with WebSocket.

  • websocket.io - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
  • WebScoket-Node - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)

DNS Resolving

  • multicast-dns - Low level multicast-dns implementation in pure javascript
  • node-dns - Replacement dns module in pure javascript for node.js

Computer Vision

  • tracking.js - A modern approach for Computer Vision on the web.
  • ocrad.js - OCR in Javascript via Emscripten.

Proxy Server

  • toxy - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions
  • proxy-chain - Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining

Data Structure

  • immutable - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
  • lodash - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects

Other JavaScript lists