1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-28 08:48:58 +02:00
awesome-web-scraping/javascript.md
2015-08-21 22:50:46 +05:00

7.9 KiB

JavaScript Web Scraping

This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).

Network

  • node-http2 - An HTTP/2 client and server implementation for node.js
  • httpinvoke - A no-dependencies HTTP client library for browsers and Node.js with a promise-based or Node.js-style callback-based API to progress events, text and binary file upload and download, partial response body, request and response headers, status code.
  • request - Simplified HTTP request client.
  • socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
  • rest - RESTful HTTP client for JavaScript
  • wreck - HTTP Client Utilities

Web-Scraping Frameworks

HTML/XML Parsing

  • General
    • TODO
  • Sanitizing
    • js-xss - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • string.js - Extra JavaScript string methods.
    • accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
    • validator.js - String validation and sanitization.
  • Date and time
    • moment - Parse, validate, manipulate, and display dates in javascript.
    • date - Date() for humans.
    • ms.js - Tiny millisecond conversion utility.
  • HTML entities
    • he - A robust HTML entity encoder/decoder written in JavaScript.
  • Money
    • money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
  • Color
    • chroma.js - JavaScript library for all kinds of color manipulations.
    • color - JavaScript color conversion and manipulation library.
    • TinyColor - Fast, small color manipulation and conversion for JavaScript.

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
  • CSV
    • BabyParse - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
  • JSON
    • json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.

Natural Language Processing

Libraries for working with human languages.

  • General
    • natural - general natural language facilities for node
    • nlp_compromise - natural language processing
    • Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
    • salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
  • Stemmer
    • snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
    • porter-stemmer - Martin Porter's stemmer for node.js
    • Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
    • lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
  • Language detection
    • franc - Natural language detection
    • guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js

Browser automation and emulation

  • phantomjs - Scriptable Headless WebKit.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks

Multiprocessing

  • nexpect - spawn and control child processes in node.js with ease
  • respawn - Spawn a process and restart it if it crashes

Asynchronous

Libraries for asynchronous networking programming.

  • TODO

Queue

  • TODO

Email

Libraries for parsing email.

  • TODO

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

  • URL
    • query-string - Parse and stringify URL query strings.
    • URI.js - Javascript URL mutation library.
    • jsurl - Lightweight URL manipulation with JavaScript.
  • Network Address
    • TODO

Web Content Extracting

Libraries for extracting web contents.

  • Text and Meta Data from HTML pages
    • TODO

WebSocket

Libraries for working with WebSocket.

  • TODO

DNS Resolving

  • TODO

Computer Vision

  • tracking.js - A modern approach for Computer Vision on the web.
  • ocrad.js - OCR in Javascript via Emscripten.

Proxy Server

  • TODO

Data Structure

  • immutable - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
  • lodash - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects

Other JavaScript lists