mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-28 08:48:58 +02:00
7.9 KiB
7.9 KiB
JavaScript Web Scraping
This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).
- JavaScript Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- URL and Network Address Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Other JavaScript Lists
- Data Structure
Network
- node-http2 - An HTTP/2 client and server implementation for node.js
- httpinvoke - A no-dependencies HTTP client library for browsers and Node.js with a promise-based or Node.js-style callback-based API to progress events, text and binary file upload and download, partial response body, request and response headers, status code.
- request - Simplified HTTP request client.
- socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
- rest - RESTful HTTP client for JavaScript
- wreck - HTTP Client Utilities
Web-Scraping Frameworks
- node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery
- node-simplecrawler - Flexible event driven crawler for node
HTML/XML Parsing
- General
- TODO
- Sanitizing
- js-xss - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
Text Processing
Libraries for parsing and manipulating plain texts.
- General
- string.js - Extra JavaScript string methods.
- accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
- validator.js - String validation and sanitization.
- Date and time
- moment - Parse, validate, manipulate, and display dates in javascript.
- moment-timezone - Timezone support for moment.js.
- date - Date() for humans.
- ms.js - Tiny millisecond conversion utility.
- moment - Parse, validate, manipulate, and display dates in javascript.
- HTML entities
- he - A robust HTML entity encoder/decoder written in JavaScript.
- Money
- money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
- Color
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
- General
- jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
- CSV
- BabyParse - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
- JSON
- json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.
Natural Language Processing
Libraries for working with human languages.
- General
- natural - general natural language facilities for node
- nlp_compromise - natural language processing
- Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
- salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
- Stemmer
- snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
- porter-stemmer - Martin Porter's stemmer for node.js
- Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
- lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
- Language detection
- franc - Natural language detection
- guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js
Browser automation and emulation
- phantomjs - Scriptable Headless WebKit.
- slimerjs - A PhantomJS-like tool running Gecko.
- casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
- zombie - Insanely fast, full-stack, headless browser testing using node.js.
- nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
Multiprocessing
- nexpect - spawn and control child processes in node.js with ease
- respawn - Spawn a process and restart it if it crashes
Asynchronous
Libraries for asynchronous networking programming.
- TODO
Queue
- TODO
Libraries for parsing email.
- TODO
URL and Network Address Manipulation
Libraries for parsing/modifying URLs and network addresses.
- URL
- query-string - Parse and stringify URL query strings.
- URI.js - Javascript URL mutation library.
- jsurl - Lightweight URL manipulation with JavaScript.
- Network Address
- TODO
Web Content Extracting
Libraries for extracting web contents.
- Text and Meta Data from HTML pages
- TODO
WebSocket
Libraries for working with WebSocket.
- TODO
DNS Resolving
- TODO
Computer Vision
- tracking.js - A modern approach for Computer Vision on the web.
- ocrad.js - OCR in Javascript via Emscripten.
Proxy Server
- TODO
Data Structure
- immutable - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
- lodash - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects