1
0
mirror of https://github.com/lorien/awesome-web-scraping.git synced 2024-11-21 17:17:03 +02:00

Create manuals.md

File manuals.md contains a list of articles and books teaching base things of web scraping
This commit is contained in:
lorien 2023-08-07 13:22:39 +06:00 committed by GitHub
parent ad05430f52
commit 6ec0ce6da0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

110
manuals.md Normal file
View File

@ -0,0 +1,110 @@
# Web Scraping Manuals
## Table of Contents
- [About the List](#about-the-list)
- [Base Things](#base-things)
- [Information Availability](#information-availability)
- [Information Granularity](#information-granularity)
- [How to Contribute](#how-to-contribute)
- [Web Scraping Articles and Topics](#web-scraping-articles-and-topics)
- [HTML](#html)
- [HTTP](#http)
- [DNS](#dns)
- [TCP](#tcp)
- [TLS](#tls)
- [WebSocket](#websocket)
- [Concurrency](#concurrency)
- [Text Encoding](#text-encoding)
- [URL](#url)
- [XMLHttpRequest](#xmlhttprequest)
- [Security](#security)
- [IP Address](#ip-address)
- [Data Structures](#data-structures)
## About the List
This is a list of articles and books teaching web scraping.
### Base Things
To know base things is more important than to know particular tools or implementations.
It is important to know what is HTTP, TCP, TLS, DNS, HTML, XML, XPath, CSS, DOM, proxying network requests.
It is LESS important to know how to build crawler with SuperScrapingFramework or what function of PowerfulHTMLParsingLibrary allows
you to extract text from selected element of HTML DOM tree. These things are very specific. You do not have to know how to operate
with every scraping framework or HTML parsing package in the world. If you know base things it is just a matter of short time
to get knowledge about how to operate this base things with a particular programming package.
### Information Availability
The list must provide information which is accessable instantly. The list does not accept books whose content are not available online.
### Information Granularity
If a book contains a number of topics, it makes sense to refer to particular topic of the book in a particular section of
Learning Web Scraping list.
### How to Contribute
You may submit a new issue with an article or book you want to add. I will read the article or take a look at animals on
a cover picture of the book and will decide is it worth to be included in the list.
## Web Scraping Articles and Topics
### HTML
- [WHATWG / HTML](https://html.spec.whatwg.org/multipage/)
### HTTP
- [High Performance Browser Networking / HTTP/1.X](https://hpbn.co/http1x/)
- [High Performance Browser Networking / HTTP/2](https://hpbn.co/http2/)
- [HTTP Working Group HTTP Specs](https://httpwg.org/specs/)
### DNS
Nothing yet here.
### TCP
- [High Performance Browser Networking / Building Blocks of TCP](https://hpbn.co/building-blocks-of-tcp/)
### TLS
- [High Performance Browser Networking / Transport Layer Security (TLS)](https://hpbn.co/transport-layer-security-tls/)
### WebSocket
- [High Performance Browser Networking / WebSocket](https://hpbn.co/websocket/)
- [WHATWG / Websocket](https://websockets.spec.whatwg.org/)
### Concurrency
- [The Little Book of Semaphores](https://greenteapress.com/wp/semaphores/)
### Text Encoding
- [WHATWG / Encoding](https://encoding.spec.whatwg.org/)
### URL
- [WHATWG / URL](https://url.spec.whatwg.org/)
### XMLHttpRequest
- [WHATWG / XMLHttpRequest](https://xhr.spec.whatwg.org/)
- [High Performance Browser Networking / XMLHttpRequest](https://hpbn.co/xmlhttprequest/)
### Security
- [OWASP Web Security Testing Guide](https://owasp.org/www-project-web-security-testing-guide/latest/)
### IP Address
- [Understanding IP Addressing](http://pages.di.unipi.it/ricci/501302.pdf)
### Data Structures
- [Probabilistic Data Structures for Web Analytics and Data Mining](https://dirtysalt.github.io/html/probabilistic-data-structures-for-web-analytics-and-data-mining.html)