mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-24 08:32:19 +02:00
6ec0ce6da0
File manuals.md contains a list of articles and books teaching base things of web scraping
111 lines
3.5 KiB
Markdown
111 lines
3.5 KiB
Markdown
# Web Scraping Manuals
|
|
|
|
## Table of Contents
|
|
|
|
- [About the List](#about-the-list)
|
|
- [Base Things](#base-things)
|
|
- [Information Availability](#information-availability)
|
|
- [Information Granularity](#information-granularity)
|
|
- [How to Contribute](#how-to-contribute)
|
|
- [Web Scraping Articles and Topics](#web-scraping-articles-and-topics)
|
|
- [HTML](#html)
|
|
- [HTTP](#http)
|
|
- [DNS](#dns)
|
|
- [TCP](#tcp)
|
|
- [TLS](#tls)
|
|
- [WebSocket](#websocket)
|
|
- [Concurrency](#concurrency)
|
|
- [Text Encoding](#text-encoding)
|
|
- [URL](#url)
|
|
- [XMLHttpRequest](#xmlhttprequest)
|
|
- [Security](#security)
|
|
- [IP Address](#ip-address)
|
|
- [Data Structures](#data-structures)
|
|
|
|
## About the List
|
|
|
|
This is a list of articles and books teaching web scraping.
|
|
|
|
### Base Things
|
|
|
|
To know base things is more important than to know particular tools or implementations.
|
|
|
|
It is important to know what is HTTP, TCP, TLS, DNS, HTML, XML, XPath, CSS, DOM, proxying network requests.
|
|
|
|
It is LESS important to know how to build crawler with SuperScrapingFramework or what function of PowerfulHTMLParsingLibrary allows
|
|
you to extract text from selected element of HTML DOM tree. These things are very specific. You do not have to know how to operate
|
|
with every scraping framework or HTML parsing package in the world. If you know base things it is just a matter of short time
|
|
to get knowledge about how to operate this base things with a particular programming package.
|
|
|
|
### Information Availability
|
|
|
|
The list must provide information which is accessable instantly. The list does not accept books whose content are not available online.
|
|
|
|
### Information Granularity
|
|
|
|
If a book contains a number of topics, it makes sense to refer to particular topic of the book in a particular section of
|
|
Learning Web Scraping list.
|
|
|
|
### How to Contribute
|
|
|
|
You may submit a new issue with an article or book you want to add. I will read the article or take a look at animals on
|
|
a cover picture of the book and will decide is it worth to be included in the list.
|
|
|
|
## Web Scraping Articles and Topics
|
|
|
|
### HTML
|
|
|
|
- [WHATWG / HTML](https://html.spec.whatwg.org/multipage/)
|
|
|
|
### HTTP
|
|
|
|
- [High Performance Browser Networking / HTTP/1.X](https://hpbn.co/http1x/)
|
|
- [High Performance Browser Networking / HTTP/2](https://hpbn.co/http2/)
|
|
- [HTTP Working Group HTTP Specs](https://httpwg.org/specs/)
|
|
|
|
### DNS
|
|
|
|
Nothing yet here.
|
|
|
|
### TCP
|
|
|
|
- [High Performance Browser Networking / Building Blocks of TCP](https://hpbn.co/building-blocks-of-tcp/)
|
|
|
|
### TLS
|
|
|
|
- [High Performance Browser Networking / Transport Layer Security (TLS)](https://hpbn.co/transport-layer-security-tls/)
|
|
|
|
### WebSocket
|
|
|
|
- [High Performance Browser Networking / WebSocket](https://hpbn.co/websocket/)
|
|
- [WHATWG / Websocket](https://websockets.spec.whatwg.org/)
|
|
|
|
### Concurrency
|
|
|
|
- [The Little Book of Semaphores](https://greenteapress.com/wp/semaphores/)
|
|
|
|
### Text Encoding
|
|
|
|
- [WHATWG / Encoding](https://encoding.spec.whatwg.org/)
|
|
|
|
### URL
|
|
|
|
- [WHATWG / URL](https://url.spec.whatwg.org/)
|
|
|
|
### XMLHttpRequest
|
|
|
|
- [WHATWG / XMLHttpRequest](https://xhr.spec.whatwg.org/)
|
|
- [High Performance Browser Networking / XMLHttpRequest](https://hpbn.co/xmlhttprequest/)
|
|
|
|
### Security
|
|
|
|
- [OWASP Web Security Testing Guide](https://owasp.org/www-project-web-security-testing-guide/latest/)
|
|
|
|
### IP Address
|
|
|
|
- [Understanding IP Addressing](http://pages.di.unipi.it/ricci/501302.pdf)
|
|
|
|
### Data Structures
|
|
|
|
- [Probabilistic Data Structures for Web Analytics and Data Mining](https://dirtysalt.github.io/html/probabilistic-data-structures-for-web-analytics-and-data-mining.html)
|