mirror of
https://github.com/MontFerret/ferret.git
synced 2025-03-19 21:28:32 +02:00
Reorganize README (#565)
Re-order and change heading level to adjust grouping
This commit is contained in:
parent
060f3de07b
commit
0890f94c24
87
README.md
87
README.md
@ -29,7 +29,14 @@ It is extremely portable, extensible and fast.
|
|||||||
|
|
||||||
[Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183)
|
[Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183)
|
||||||
|
|
||||||
## Show me some code
|
### Features
|
||||||
|
|
||||||
|
* Declarative language
|
||||||
|
* Support of both static and dynamic web pages
|
||||||
|
* Embeddable
|
||||||
|
* Extensible
|
||||||
|
|
||||||
|
### Show me some code
|
||||||
The following example demonstrates the use of dynamic pages.
|
The following example demonstrates the use of dynamic pages.
|
||||||
We load the main Google Search page, type search criteria into an input box and then click a search button.
|
We load the main Google Search page, type search criteria into an input box and then click a search button.
|
||||||
The click action triggers a redirect, so we wait until its end.
|
The click action triggers a redirect, so we wait until its end.
|
||||||
@ -59,20 +66,14 @@ FOR result IN ELEMENTS(google, '.g')
|
|||||||
|
|
||||||
More examples you can find [here](./examples)
|
More examples you can find [here](./examples)
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
* Declarative language
|
### Motivation
|
||||||
* Support of both static and dynamic web pages
|
|
||||||
* Embeddable
|
|
||||||
* Extensible
|
|
||||||
|
|
||||||
## Motivation
|
|
||||||
Nowadays data is everything and who owns data - owns the world.
|
Nowadays data is everything and who owns data - owns the world.
|
||||||
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
|
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
|
||||||
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
|
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
|
||||||
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
|
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
|
||||||
|
|
||||||
## Inspiration
|
### Inspiration
|
||||||
FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
|
FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
|
||||||
But due to the domain specifics, there are some differences in syntax and how things work.
|
But due to the domain specifics, there are some differences in syntax and how things work.
|
||||||
|
|
||||||
@ -96,7 +97,7 @@ You can download latest binaries from [here](https://github.com/MontFerret/ferre
|
|||||||
go get github.com/MontFerret/ferret
|
go get github.com/MontFerret/ferret
|
||||||
```
|
```
|
||||||
|
|
||||||
## Environment
|
### Environment
|
||||||
|
|
||||||
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
|
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
|
||||||
For ease of use we recommend to run [Chromium inside a Docker container](https://github.com/MontFerret/chromium):
|
For ease of use we recommend to run [Chromium inside a Docker container](https://github.com/MontFerret/chromium):
|
||||||
@ -300,7 +301,9 @@ func getTopTenTrendingTopics() ([]*Topic, error) {
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Extensibility
|
## Extras
|
||||||
|
|
||||||
|
### Extensibility
|
||||||
|
|
||||||
That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library.
|
That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library.
|
||||||
|
|
||||||
@ -417,7 +420,7 @@ func main() {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Proxy
|
### Proxy
|
||||||
|
|
||||||
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a particular proxy. It should be done during a browser launch.
|
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a particular proxy. It should be done during a browser launch.
|
||||||
|
|
||||||
@ -461,21 +464,45 @@ func run(q string) ([]byte, error) {
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Cookies
|
### Cookies
|
||||||
|
|
||||||
### Non-incognito mode
|
#### Get, Set, Delete
|
||||||
|
For more precise work, you can set/get/delete cookies manually before and after loading the page:
|
||||||
|
|
||||||
|
```
|
||||||
|
LET doc = DOCUMENT("https://www.google.com", {
|
||||||
|
driver: "cdp",
|
||||||
|
cookies: [
|
||||||
|
{
|
||||||
|
name: "foo",
|
||||||
|
value: "bar"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
})
|
||||||
|
|
||||||
|
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
|
||||||
|
COOKIE_DEL(doc, "foo")
|
||||||
|
|
||||||
|
LET c = COOKIE_GET(doc, "baz")
|
||||||
|
|
||||||
|
FOR cookie IN doc.cookies
|
||||||
|
RETURN cookie.name
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Access previously-set cookies (non-incognito mode)
|
||||||
|
|
||||||
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries.
|
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries.
|
||||||
However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies.
|
However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies.
|
||||||
In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that:
|
In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that:
|
||||||
|
|
||||||
#### CLI
|
##### CLI
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
ferret --cdp-keep-cookies my-query.fql
|
ferret --cdp-keep-cookies my-query.fql
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Code
|
##### Code
|
||||||
|
|
||||||
```go
|
```go
|
||||||
package main
|
package main
|
||||||
@ -509,7 +536,7 @@ func run(q string) ([]byte, error) {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Query
|
##### Query
|
||||||
```
|
```
|
||||||
LET doc = DOCUMENT("https://www.google.com", {
|
LET doc = DOCUMENT("https://www.google.com", {
|
||||||
driver: "cdp",
|
driver: "cdp",
|
||||||
@ -517,31 +544,7 @@ LET doc = DOCUMENT("https://www.google.com", {
|
|||||||
})
|
})
|
||||||
```
|
```
|
||||||
|
|
||||||
### Cookies manipulation
|
### File System
|
||||||
For more precise work, you can set/get/delete cookies manually during and after page load:
|
|
||||||
|
|
||||||
```
|
|
||||||
LET doc = DOCUMENT("https://www.google.com", {
|
|
||||||
driver: "cdp",
|
|
||||||
cookies: [
|
|
||||||
{
|
|
||||||
name: "foo",
|
|
||||||
value: "bar"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
})
|
|
||||||
|
|
||||||
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
|
|
||||||
COOKIE_DEL(doc, "foo")
|
|
||||||
|
|
||||||
LET c = COOKIE_GET(doc, "baz")
|
|
||||||
|
|
||||||
FOR cookie IN doc.cookies
|
|
||||||
RETURN cookie.name
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
## File System
|
|
||||||
|
|
||||||
#### Write
|
#### Write
|
||||||
```
|
```
|
||||||
|
Loading…
x
Reference in New Issue
Block a user