mirror of
https://github.com/MontFerret/ferret.git
synced 2025-03-17 21:18:37 +02:00
Minor grammar and rewording changes of README (#566)
* Minor grammar and rewording changes of README * Fix typo in README * Add link to montferret.dev docs
This commit is contained in:
parent
0890f94c24
commit
83d93016bf
90
README.md
90
README.md
@ -25,7 +25,7 @@
|
||||
## What is it?
|
||||
```ferret``` is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.
|
||||
```ferret``` allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language.
|
||||
It is extremely portable, extensible and fast.
|
||||
It is extremely portable, extensible, and fast.
|
||||
|
||||
[Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183)
|
||||
|
||||
@ -38,10 +38,9 @@ It is extremely portable, extensible and fast.
|
||||
|
||||
### Show me some code
|
||||
The following example demonstrates the use of dynamic pages.
|
||||
We load the main Google Search page, type search criteria into an input box and then click a search button.
|
||||
The click action triggers a redirect, so we wait until its end.
|
||||
Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable.
|
||||
The final for loop filters out empty elements that might be because of inaccurate use of selectors.
|
||||
We load the main Google Search page, type a search criteria into the input box, and then click the search button.
|
||||
The click action triggers a redirect, so we wait until the the page we were redirected to finishes loading.
|
||||
Once the results page is loaded, we iterate over all elements in the search results and assign output to a variable.
|
||||
|
||||
```aql
|
||||
LET google = DOCUMENT("https://www.google.com/", {
|
||||
@ -64,17 +63,19 @@ FOR result IN ELEMENTS(google, '.g')
|
||||
}
|
||||
```
|
||||
|
||||
More examples you can find [here](./examples)
|
||||
You can find more examples [here](./examples).
|
||||
|
||||
|
||||
### Motivation
|
||||
Nowadays data is everything and who owns data - owns the world.
|
||||
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
|
||||
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
|
||||
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
|
||||
Nowadays, data is everything and who owns data - owns the world.
|
||||
I have worked on multiple data-driven projects where data was an essential part of a system, and I realized how repetitive it is to write scraping code.
|
||||
Other scraping libraries require lots of boilerplate code and tend to encourage an imperative approach to extracting data.
|
||||
After some time looking for a tool that would let me declare which data I needed (instead of imperatively instructing it how to extract it), I decided to build my own solution.
|
||||
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without the hassle of other scrapers.
|
||||
|
||||
### Inspiration
|
||||
FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
|
||||
FQL (Ferret Query Language) is meant to feel like writing a database query.
|
||||
It is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
|
||||
But due to the domain specifics, there are some differences in syntax and how things work.
|
||||
|
||||
|
||||
@ -100,14 +101,15 @@ go get github.com/MontFerret/ferret
|
||||
### Environment
|
||||
|
||||
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
|
||||
For ease of use we recommend to run [Chromium inside a Docker container](https://github.com/MontFerret/chromium):
|
||||
For ease of use, we recommend to running Chromium inside a Docker container.
|
||||
You can probably use most Chromium-based headless images, but we've put together [an image that's ready to go](https://github.com/MontFerret/chromium):
|
||||
|
||||
```sh
|
||||
docker pull montferret/chromium
|
||||
docker run -d -p 9222:9222 montferret/chromium
|
||||
```
|
||||
|
||||
But if you want to see what's happening during query execution, just start your Chrome with remote debugging port:
|
||||
If you'd rather see what's happening during query execution, just start launch Chrome from your host with the remote debugging port set:
|
||||
|
||||
```sh
|
||||
chrome.exe --remote-debugging-port=9222
|
||||
@ -117,7 +119,8 @@ chrome.exe --remote-debugging-port=9222
|
||||
|
||||
### Browserless mode
|
||||
|
||||
If you want to play with ```fql``` and check its syntax, you can run CLI with the following commands:
|
||||
If you want to try out ```fql```, you can get started without Chrome or a Chromium container.
|
||||
Executing the `ferret` CLI without any options will open `ferret` in REPL mode.
|
||||
```
|
||||
ferret
|
||||
```
|
||||
@ -154,17 +157,18 @@ ferret < ./docs/examples/static-page.fql
|
||||
|
||||
### Browser mode
|
||||
|
||||
By default, ``ferret`` loads HTML pages via HTTP protocol, because it's faster.
|
||||
But nowadays, there are more and more websites rendered with JavaScript, and therefore, this 'old school' approach does not really work.
|
||||
For such cases, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).
|
||||
First, you need to make sure that you launched Chrome with ```remote-debugging-port=9222``` flag.
|
||||
By default, ``ferret`` loads HTML pages directly via HTTP protocol, because it's faster.
|
||||
But nowadays, more and more websites are rendered with JavaScript, and this 'old school' approach does not really work.
|
||||
For these dynamic websites, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).
|
||||
First, you need to make sure that you launched Chrome with ```remote-debugging-port=9222``` flag (see "Environment" in this README for instructions on setting this up).
|
||||
Second, you need to pass the address to ```ferret``` CLI.
|
||||
|
||||
```
|
||||
ferret --cdp http://127.0.0.1:9222
|
||||
```
|
||||
|
||||
**NOTE:** By default, ```ferret``` will try to use this local address as a default one, so it makes sense to explicitly pass the parameter only in case of either different port number or remote address.
|
||||
**NOTE:** By default, ```ferret``` will try to use this local address as a default one.
|
||||
You only need to explicitly pass the parameter if you are using a different port number or remote address.
|
||||
|
||||
Alternatively, you can tell CLI to launch Chrome for you.
|
||||
|
||||
@ -174,7 +178,7 @@ ferret --cdp-launch
|
||||
|
||||
**NOTE:** Launch command is currently broken on MacOS.
|
||||
|
||||
Once ```ferret``` knows how to communicate with Chrome, you can use a function ```DOCUMENT(url, isDynamic)``` with ```true``` boolean value for dynamic pages:
|
||||
Once ```ferret``` knows how to communicate with Chrome, you can use the function ```DOCUMENT(url, isDynamic)```, setting ```isDynamic``` to ```{driver: "cdp"}``` for dynamic pages:
|
||||
|
||||
```shell
|
||||
Welcome to Ferret REPL
|
||||
@ -211,7 +215,11 @@ Please use `exit` or `Ctrl-D` to exit this program.
|
||||
|
||||
### Embedded mode
|
||||
|
||||
```ferret``` is a very modular system and therefore, can be easily be embedded into your Go application.
|
||||
```ferret``` is a very modular system.
|
||||
It can be be embedded into your Go application in only a few lines of code.
|
||||
|
||||
Here is an example of a short Go application that defines an `fql` query, compiles it, executes it, then returns the results.
|
||||
|
||||
|
||||
```go
|
||||
package main
|
||||
@ -305,7 +313,9 @@ func getTopTenTrendingTopics() ([]*Topic, error) {
|
||||
|
||||
### Extensibility
|
||||
|
||||
That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library.
|
||||
With ```ferret```'s modular system, you can also extend its standard library.
|
||||
|
||||
In this example, we define a `transform` function in Go, then register that function with ```ferret```, making it available for use in ```fql``` queries.
|
||||
|
||||
```go
|
||||
package main
|
||||
@ -395,15 +405,15 @@ func getStrings() ([]string, error) {
|
||||
}
|
||||
```
|
||||
|
||||
On top of that, you can completely turn off the standard library, bypassing the following option:
|
||||
You can completely turn off the ```ferret``` standard library, as follows:
|
||||
|
||||
```go
|
||||
comp := compiler.New(compiler.WithoutStdlib())
|
||||
```
|
||||
|
||||
And after that, you can easily provide your own implementation of functions from standard library.
|
||||
After disabling ```stdlib```, you can register your own implementation of functions from standard library.
|
||||
|
||||
If you don't need a particular set of functions from standard library, you can turn off the entire ```stdlib``` and register separate packages from that:
|
||||
If you only need a subset of the ```stdlib``` functions, you can only have those enabled by disabling the entire ```stdlib```, then registering the individual packages that are needed:
|
||||
|
||||
```go
|
||||
package main
|
||||
@ -422,17 +432,17 @@ func main() {
|
||||
|
||||
### Proxy
|
||||
|
||||
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a particular proxy. It should be done during a browser launch.
|
||||
By default, ```ferret``` does not attempt to use a proxy. This is due to an inability to CDP-compatible browsers to use an arbitrary proxy. If you need to use a proxy, it should be defined while launching the browser.
|
||||
|
||||
But you can pass an address of a proxy server you want to use for static pages.
|
||||
However, if you are querying static pages, you can define a proxy while launching ``ferret``` from the CLI or from embedded applications.
|
||||
|
||||
#### CLI
|
||||
#### CLI example
|
||||
|
||||
```sh
|
||||
ferret --proxy=http://localhost:8888 my-query.fql
|
||||
```
|
||||
|
||||
#### Code
|
||||
#### Embedded example
|
||||
|
||||
```go
|
||||
package main
|
||||
@ -492,17 +502,19 @@ FOR cookie IN doc.cookies
|
||||
|
||||
#### Access previously-set cookies (non-incognito mode)
|
||||
|
||||
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries.
|
||||
However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies.
|
||||
In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that:
|
||||
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid collisions from cookies persisted by previous queries.
|
||||
However, sometimes you might want access to persisted cookies (e.g. to avoid re-authenticating with a site).
|
||||
In order to do that, we need to configure the driver to execute all queries in non-incognito tabs.
|
||||
|
||||
##### CLI
|
||||
Here is how to do that:
|
||||
|
||||
##### CLI example
|
||||
|
||||
```sh
|
||||
ferret --cdp-keep-cookies my-query.fql
|
||||
```
|
||||
|
||||
##### Code
|
||||
##### Embedded example
|
||||
|
||||
```go
|
||||
package main
|
||||
@ -546,7 +558,9 @@ LET doc = DOCUMENT("https://www.google.com", {
|
||||
|
||||
### File System
|
||||
|
||||
#### Write
|
||||
```ferret``` can also read and write to the file system.
|
||||
|
||||
#### Write example
|
||||
```
|
||||
USE IO::FS
|
||||
|
||||
@ -555,7 +569,7 @@ LET favicon = DOWNLOAD("https://www.google.com/favicon.ico")
|
||||
RETURN WRITE("google.favicon.ico", favicon)
|
||||
```
|
||||
|
||||
#### Read
|
||||
#### Read example
|
||||
```
|
||||
USE IO::FS
|
||||
|
||||
@ -566,6 +580,10 @@ FOR url IN urls
|
||||
RETURN DOCUMENT(url)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
Further documentation is available [at our website](https://www.montferret.dev/docs/introduction/)
|
||||
|
||||
## Contributors
|
||||
|
||||
Thanks to everyone who contributed.
|
||||
|
Loading…
x
Reference in New Issue
Block a user