mirror of
https://github.com/lorien/awesome-web-scraping.git
synced 2024-11-21 17:17:03 +02:00
Create manuals.md
File manuals.md contains a list of articles and books teaching base things of web scraping
This commit is contained in:
parent
ad05430f52
commit
6ec0ce6da0
110
manuals.md
Normal file
110
manuals.md
Normal file
@ -0,0 +1,110 @@
|
||||
# Web Scraping Manuals
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [About the List](#about-the-list)
|
||||
- [Base Things](#base-things)
|
||||
- [Information Availability](#information-availability)
|
||||
- [Information Granularity](#information-granularity)
|
||||
- [How to Contribute](#how-to-contribute)
|
||||
- [Web Scraping Articles and Topics](#web-scraping-articles-and-topics)
|
||||
- [HTML](#html)
|
||||
- [HTTP](#http)
|
||||
- [DNS](#dns)
|
||||
- [TCP](#tcp)
|
||||
- [TLS](#tls)
|
||||
- [WebSocket](#websocket)
|
||||
- [Concurrency](#concurrency)
|
||||
- [Text Encoding](#text-encoding)
|
||||
- [URL](#url)
|
||||
- [XMLHttpRequest](#xmlhttprequest)
|
||||
- [Security](#security)
|
||||
- [IP Address](#ip-address)
|
||||
- [Data Structures](#data-structures)
|
||||
|
||||
## About the List
|
||||
|
||||
This is a list of articles and books teaching web scraping.
|
||||
|
||||
### Base Things
|
||||
|
||||
To know base things is more important than to know particular tools or implementations.
|
||||
|
||||
It is important to know what is HTTP, TCP, TLS, DNS, HTML, XML, XPath, CSS, DOM, proxying network requests.
|
||||
|
||||
It is LESS important to know how to build crawler with SuperScrapingFramework or what function of PowerfulHTMLParsingLibrary allows
|
||||
you to extract text from selected element of HTML DOM tree. These things are very specific. You do not have to know how to operate
|
||||
with every scraping framework or HTML parsing package in the world. If you know base things it is just a matter of short time
|
||||
to get knowledge about how to operate this base things with a particular programming package.
|
||||
|
||||
### Information Availability
|
||||
|
||||
The list must provide information which is accessable instantly. The list does not accept books whose content are not available online.
|
||||
|
||||
### Information Granularity
|
||||
|
||||
If a book contains a number of topics, it makes sense to refer to particular topic of the book in a particular section of
|
||||
Learning Web Scraping list.
|
||||
|
||||
### How to Contribute
|
||||
|
||||
You may submit a new issue with an article or book you want to add. I will read the article or take a look at animals on
|
||||
a cover picture of the book and will decide is it worth to be included in the list.
|
||||
|
||||
## Web Scraping Articles and Topics
|
||||
|
||||
### HTML
|
||||
|
||||
- [WHATWG / HTML](https://html.spec.whatwg.org/multipage/)
|
||||
|
||||
### HTTP
|
||||
|
||||
- [High Performance Browser Networking / HTTP/1.X](https://hpbn.co/http1x/)
|
||||
- [High Performance Browser Networking / HTTP/2](https://hpbn.co/http2/)
|
||||
- [HTTP Working Group HTTP Specs](https://httpwg.org/specs/)
|
||||
|
||||
### DNS
|
||||
|
||||
Nothing yet here.
|
||||
|
||||
### TCP
|
||||
|
||||
- [High Performance Browser Networking / Building Blocks of TCP](https://hpbn.co/building-blocks-of-tcp/)
|
||||
|
||||
### TLS
|
||||
|
||||
- [High Performance Browser Networking / Transport Layer Security (TLS)](https://hpbn.co/transport-layer-security-tls/)
|
||||
|
||||
### WebSocket
|
||||
|
||||
- [High Performance Browser Networking / WebSocket](https://hpbn.co/websocket/)
|
||||
- [WHATWG / Websocket](https://websockets.spec.whatwg.org/)
|
||||
|
||||
### Concurrency
|
||||
|
||||
- [The Little Book of Semaphores](https://greenteapress.com/wp/semaphores/)
|
||||
|
||||
### Text Encoding
|
||||
|
||||
- [WHATWG / Encoding](https://encoding.spec.whatwg.org/)
|
||||
|
||||
### URL
|
||||
|
||||
- [WHATWG / URL](https://url.spec.whatwg.org/)
|
||||
|
||||
### XMLHttpRequest
|
||||
|
||||
- [WHATWG / XMLHttpRequest](https://xhr.spec.whatwg.org/)
|
||||
- [High Performance Browser Networking / XMLHttpRequest](https://hpbn.co/xmlhttprequest/)
|
||||
|
||||
### Security
|
||||
|
||||
- [OWASP Web Security Testing Guide](https://owasp.org/www-project-web-security-testing-guide/latest/)
|
||||
|
||||
### IP Address
|
||||
|
||||
- [Understanding IP Addressing](http://pages.di.unipi.it/ricci/501302.pdf)
|
||||
|
||||
### Data Structures
|
||||
|
||||
- [Probabilistic Data Structures for Web Analytics and Data Mining](https://dirtysalt.github.io/html/probabilistic-data-structures-for-web-analytics-and-data-mining.html)
|
Loading…
Reference in New Issue
Block a user