1
0
mirror of https://github.com/MontFerret/ferret.git synced 2025-03-19 21:28:32 +02:00

Reorganize README (#565)

Re-order and change heading level to adjust grouping
This commit is contained in:
David Landry 2020-10-31 11:04:20 -04:00 committed by GitHub
parent 060f3de07b
commit 0890f94c24
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -29,7 +29,14 @@ It is extremely portable, extensible and fast.
[Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183) [Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183)
## Show me some code ### Features
* Declarative language
* Support of both static and dynamic web pages
* Embeddable
* Extensible
### Show me some code
The following example demonstrates the use of dynamic pages. The following example demonstrates the use of dynamic pages.
We load the main Google Search page, type search criteria into an input box and then click a search button. We load the main Google Search page, type search criteria into an input box and then click a search button.
The click action triggers a redirect, so we wait until its end. The click action triggers a redirect, so we wait until its end.
@ -59,20 +66,14 @@ FOR result IN ELEMENTS(google, '.g')
More examples you can find [here](./examples) More examples you can find [here](./examples)
## Features
* Declarative language ### Motivation
* Support of both static and dynamic web pages
* Embeddable
* Extensible
## Motivation
Nowadays data is everything and who owns data - owns the world. Nowadays data is everything and who owns data - owns the world.
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is. I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution. After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle. ```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
## Inspiration ### Inspiration
FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language). FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
But due to the domain specifics, there are some differences in syntax and how things work. But due to the domain specifics, there are some differences in syntax and how things work.
@ -96,7 +97,7 @@ You can download latest binaries from [here](https://github.com/MontFerret/ferre
go get github.com/MontFerret/ferret go get github.com/MontFerret/ferret
``` ```
## Environment ### Environment
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker. In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
For ease of use we recommend to run [Chromium inside a Docker container](https://github.com/MontFerret/chromium): For ease of use we recommend to run [Chromium inside a Docker container](https://github.com/MontFerret/chromium):
@ -300,7 +301,9 @@ func getTopTenTrendingTopics() ([]*Topic, error) {
``` ```
## Extensibility ## Extras
### Extensibility
That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library. That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library.
@ -417,7 +420,7 @@ func main() {
} }
``` ```
## Proxy ### Proxy
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a particular proxy. It should be done during a browser launch. By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a particular proxy. It should be done during a browser launch.
@ -461,21 +464,45 @@ func run(q string) ([]byte, error) {
``` ```
## Cookies ### Cookies
### Non-incognito mode #### Get, Set, Delete
For more precise work, you can set/get/delete cookies manually before and after loading the page:
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
cookies: [
{
name: "foo",
value: "bar"
}
]
})
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
COOKIE_DEL(doc, "foo")
LET c = COOKIE_GET(doc, "baz")
FOR cookie IN doc.cookies
RETURN cookie.name
```
#### Access previously-set cookies (non-incognito mode)
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries. By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries.
However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies. However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies.
In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that: In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that:
#### CLI ##### CLI
```sh ```sh
ferret --cdp-keep-cookies my-query.fql ferret --cdp-keep-cookies my-query.fql
``` ```
#### Code ##### Code
```go ```go
package main package main
@ -509,7 +536,7 @@ func run(q string) ([]byte, error) {
} }
``` ```
#### Query ##### Query
``` ```
LET doc = DOCUMENT("https://www.google.com", { LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp", driver: "cdp",
@ -517,31 +544,7 @@ LET doc = DOCUMENT("https://www.google.com", {
}) })
``` ```
### Cookies manipulation ### File System
For more precise work, you can set/get/delete cookies manually during and after page load:
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
cookies: [
{
name: "foo",
value: "bar"
}
]
})
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
COOKIE_DEL(doc, "foo")
LET c = COOKIE_GET(doc, "baz")
FOR cookie IN doc.cookies
RETURN cookie.name
```
## File System
#### Write #### Write
``` ```