1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2025-02-21 19:06:39 +02:00
awesome-web-scraping/javascript.md

197 lines
12 KiB
Markdown
Raw Normal View History

2015-08-21 21:15:35 +05:00
# JavaScript Web Scraping
2015-08-21 22:27:52 +05:00
This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).
2015-08-21 21:15:35 +05:00
* [JavaScript Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other JavaScript Lists](#other-javascript-lists)
2015-08-21 22:07:13 +05:00
* [Data Structure](#data-structure)
2015-08-21 21:15:35 +05:00
## Network
2015-08-21 21:56:54 +05:00
* [node-http2](https://github.com/molnarg/node-http2) - An HTTP/2 client and server implementation for node.js
* [httpinvoke](https://github.com/jakutis/httpinvoke) - A no-dependencies HTTP client library for browsers and Node.js with a promise-based or Node.js-style callback-based API to progress events, text and binary file upload and download, partial response body, request and response headers, status code.
* [request](https://github.com/request/request) - Simplified HTTP request client.
* [socks5-http-client](https://github.com/mattcg/socks5-http-client) - SOCKS v5 HTTP client implementation in JavaScript for Node.js
* [rest](https://github.com/cujojs/rest) - RESTful HTTP client for JavaScript
* [wreck](https://github.com/hapijs/wreck) - HTTP Client Utilities
2015-08-21 21:15:35 +05:00
## Web-Scraping Frameworks
2015-08-21 22:21:51 +05:00
* [node-crawler](https://github.com/sylvinus/node-crawler) - Web Crawler/Spider for NodeJS + server-side jQuery
* [node-simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Flexible event driven crawler for node
2015-08-21 21:15:35 +05:00
## HTML/XML Parsing
2015-08-21 21:43:43 +05:00
* General
2015-08-22 00:04:48 +05:00
* [parse5](https://github.com/inikulin/parse5) - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
* [htmlparser2](https://github.com/fb55/htmlparser2) - forgiving html and xml parser
* [sax-js](https://github.com/isaacs/sax-js) - A sax style parser for JS
2015-08-22 00:14:54 +05:00
* [cheerio](https://github.com/cheeriojs/cheerio) - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
2015-08-21 21:43:43 +05:00
* Sanitizing
2015-08-21 21:50:24 +05:00
* [js-xss](https://github.com/leizongmin/js-xss) - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
2015-08-21 21:15:35 +05:00
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
2015-08-21 21:43:43 +05:00
* [string.js](https://github.com/jprichardson/string.js) - Extra JavaScript string methods.
* [accounting.js](https://github.com/openexchangerates/accounting.js) - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
* [validator.js](https://github.com/chriso/validator.js) - String validation and sanitization.
* Date and time
* [moment](https://github.com/moment/moment) - Parse, validate, manipulate, and display dates in javascript.
* [moment-timezone](https://github.com/moment/moment-timezone) - Timezone support for moment.js.
* [date](https://github.com/MatthewMueller/date) - Date() for humans.
* [ms.js](https://github.com/guille/ms.js) - Tiny millisecond conversion utility.
* HTML entities
* [he](https://github.com/mathiasbynens/he) - A robust HTML entity encoder/decoder written in JavaScript.
* Money
* [money.js](https://github.com/openexchangerates/money.js) - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
* Color
* [chroma.js](https://github.com/gka/chroma.js) - JavaScript library for all kinds of color manipulations.
* [color](https://github.com/harthur/color) - JavaScript color conversion and manipulation library.
* [TinyColor](https://github.com/bgrins/TinyColor) - Fast, small color manipulation and conversion for JavaScript.
2015-08-21 23:53:05 +05:00
* User Agent
* [UAParser.js](https://github.com/faisalman/ua-parser-js) - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
2015-08-22 00:04:48 +05:00
* Semantic Version
* [node-semver](https://github.com/npm/node-semver) - The semver parser for node
2015-08-21 21:15:35 +05:00
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
2015-08-21 21:43:43 +05:00
* [jBinary](https://github.com/jDataView/jBinary) - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
2015-08-22 00:04:48 +05:00
* Office
* [js-xlsx](https://github.com/SheetJS/js-xlsx) - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
2015-08-21 21:43:43 +05:00
* CSV
* [BabyParse](https://github.com/Rich-Harris/BabyParse) - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
2015-08-21 23:53:05 +05:00
* [CSV](https://github.com/knrz/CSV.js) - A simple, blazing-fast CSV parser and encoder. Full RFC 4180 compliance.
2015-08-21 21:43:43 +05:00
* JSON
* [json3](https://github.com/bestiejs/json3) - A modern JSON implementation compatible with nearly all JavaScript platforms.
2015-08-21 23:53:05 +05:00
* EXIF
* [exif-js](https://github.com/exif-js/exif-js) - JavaScript library for reading EXIF image metadata
* CSS
* [parse-css](https://github.com/tabatkins/parse-css) - Standards-based CSS Parser
* Torrent
* [parser-lib CSS parser](https://github.com/CSSLint/parser-lib) - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
* [parse-torrent](https://github.com/feross/parse-torrent) - Parse a torrent identifier (magnet uri, .torrent file, info hash)
2015-08-21 23:53:05 +05:00
* SQL
* [SQL Parser](https://github.com/forward/sql-parser) - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
* YAML
[JS-YAML](https://github.com/nodeca/js-yaml) - JavaScript YAML parser and dumper. Very fast.
2015-08-22 00:04:48 +05:00
* Markdown
* [markdown-it](https://github.com/markdown-it/markdown-it) - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
* Atom/RSS
* [node-feedparser](https://github.com/danmactough/node-feedparser) - Robust RSS, Atom, and RDF feed parsing in Node.js
2015-08-21 21:43:43 +05:00
2015-08-21 21:15:35 +05:00
## Natural Language Processing
*Libraries for working with human languages.*
2015-08-21 22:42:02 +05:00
* General
* [natural](https://github.com/NaturalNode/natural) - general natural language facilities for node
* [nlp_compromise](https://github.com/spencermountain/nlp_compromise) - natural language processing
* [Hanzi](https://github.com/nieldlr/Hanzi) - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
* [salient](https://github.com/nyxtom/salient) - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
2015-08-21 23:53:05 +05:00
* [node-summary](https://github.com/jbrooksuk/node-summary) - Node module that summarizes text using a naive summarization algorithm
2015-08-21 22:42:02 +05:00
* Stemmer
* [snowball-js](https://github.com/fortnightlabs/snowball-js) - javascript implementation of the popular snowball word stemming nlp algorithm
* [porter-stemmer](https://github.com/jedp/porter-stemmer) - Martin Porter's stemmer for node.js
* [Porter-Stemmer](https://github.com/kristopolous/Porter-Stemmer) - A Javascript Implementation of the Porter Stemmer
* [lunr-languages](https://github.com/MihaiValentin/lunr-languages) - a collection of languages stemmers and stopwords for Lunr Javascript library
* Language detection
* [franc](https://github.com/wooorm/franc) - Natural language detection
* [guessLanguage.js](https://github.com/richtr/guessLanguage.js) - A natural language detection library based on trigram statistical analysis for Node.js
2015-08-21 21:15:35 +05:00
## Browser automation and emulation
2015-08-21 21:43:43 +05:00
* [phantomjs](https://github.com/ariya/phantomjs) - Scriptable Headless WebKit.
* [slimerjs](https://github.com/laurentj/slimerjs) - A PhantomJS-like tool running Gecko.
* [casperjs](https://github.com/n1k0/casperjs) - Navigation scripting & testing utility for PhantomJS and SlimerJS.
* [zombie](https://github.com/assaf/zombie) - Insanely fast, full-stack, headless browser testing using node.js.
2015-08-21 21:49:04 +05:00
* [nightmare](https://github.com/segmentio/nightmare) - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
2015-08-21 21:15:35 +05:00
## Multiprocessing
2015-08-21 22:50:46 +05:00
* [nexpect](https://github.com/nodejitsu/nexpect) - spawn and control child processes in node.js with ease
* [respawn](https://github.com/mafintosh/respawn) - Spawn a process and restart it if it crashes
2015-08-21 23:10:02 +05:00
* [node-webworker](https://github.com/pgriess/node-webworker) - A WebWorkers implementation for NodeJS
2015-08-21 21:15:35 +05:00
## Asynchronous
*Libraries for asynchronous networking programming.*
2015-08-21 23:34:43 +05:00
* [socket.io](https://github.com/socketio/socket.io) - Realtime application framework (Node.JS server)
* [engine.io](https://github.com/socketio/engine.io) - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
* [async](https://github.com/caolan/async) - Async utilities for node and the browser
2015-08-21 21:15:35 +05:00
## Queue
2015-08-22 00:04:48 +05:00
* [kue](https://github.com/Automattic/kue) - Kue is a priority job queue backed by redis, built for node.js
* [bull](https://github.com/OptimalBits/bull) - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.
2015-08-21 21:15:35 +05:00
## Email
*Libraries for parsing email.*
2015-08-21 23:34:43 +05:00
* [mailparser](https://github.com/andris9/mailparser) - Decode mime formatted e-mails
2015-08-21 21:15:35 +05:00
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
2015-08-21 21:43:43 +05:00
* [query-string](https://github.com/sindresorhus/query-string) - Parse and stringify URL query strings.
* [URI.js](https://github.com/medialize/URI.js/) - Javascript URL mutation library.
* [jsurl](https://github.com/Mikhus/jsurl) - Lightweight URL manipulation with JavaScript.
2015-08-22 00:04:48 +05:00
* [arg.js](https://github.com/stretchr/arg.js) - Lightweight URL argument and parameter parser
2015-08-21 21:15:35 +05:00
* Network Address
2016-08-17 09:38:04 -04:00
* [node-ip](https://github.com/indutny/node-ip) - IP address tools for node.js
2015-08-21 23:34:43 +05:00
* [ip-address](https://github.com/beaugunderson/ip-address) - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript
2015-08-21 21:15:35 +05:00
## Web Content Extracting
*Libraries for extracting web contents.*
2015-08-22 00:14:54 +05:00
* [node-read](https://github.com/bndr/node-read) - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
* [node-ytdl-core](https://github.com/fent/node-ytdl-core) - Youtube video downloader in javascript
2015-10-23 14:18:00 +05:00
* [ImageResolver](https://github.com/mauricesvay/ImageResolver) - Does its best to determine the main image on a URL without loading all images.
2015-08-21 21:15:35 +05:00
## WebSocket
*Libraries for working with WebSocket.*
2015-08-21 23:34:43 +05:00
* [websocket.io](https://github.com/LearnBoost/websocket.io) - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
* [WebScoket-Node](https://github.com/theturtle32/WebSocket-Node) - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)
2015-08-21 21:15:35 +05:00
## DNS Resolving
2015-08-21 23:34:43 +05:00
* [multicast-dns](https://github.com/mafintosh/multicast-dns) - Low level multicast-dns implementation in pure javascript
* [node-dns](https://github.com/tjfontaine/node-dns) - Replacement dns module in pure javascript for node.js
2015-08-21 21:15:35 +05:00
## Computer Vision
2015-08-21 21:43:43 +05:00
* [tracking.js](https://github.com/eduardolundgren/tracking.js) - A modern approach for Computer Vision on the web.
* [ocrad.js](https://github.com/antimatter15/ocrad.js) - OCR in Javascript via Emscripten.
2015-08-21 21:15:35 +05:00
## Proxy Server
2015-08-21 23:53:05 +05:00
* [toxy](https://github.com/h2non/toxy) - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions
2015-08-21 21:15:35 +05:00
2015-08-21 22:06:22 +05:00
## Data Structure
* [immutable](https://github.com/facebook/immutable-js) - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
2015-08-21 22:16:49 +05:00
* [lodash](https://github.com/lodash/lodash) - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects
2015-08-21 22:06:22 +05:00
2015-08-21 21:15:35 +05:00
## Other JavaScript lists
* [awesome-javascript](https://github.com/sorrycc/awesome-javascript)