2018-09-18 22:42:38 +02:00
# Ferret
2018-10-16 01:39:45 +02:00
< p align = "center" >
2018-10-28 08:08:06 +02:00
< a href = "https://goreportcard.com/report/github.com/MontFerret/ferret" >
< img alt = "Go Report Status" src = "https://goreportcard.com/badge/github.com/MontFerret/ferret" >
< / a >
2020-06-23 04:28:43 +02:00
< a href = "https://github.com/MontFerret/ferret/actions" >
< img alt = "Build Status" src = "https://github.com/MontFerret/ferret/workflows/build/badge.svg" >
2018-10-28 08:08:06 +02:00
< / a >
< a href = "https://codecov.io/gh/MontFerret/ferret" >
< img src = "https://codecov.io/gh/MontFerret/ferret/branch/master/graph/badge.svg" / >
< / a >
< a href = "https://discord.gg/kzet32U" >
< img alt = "Discord Chat" src = "https://img.shields.io/discord/501533080880676864.svg" >
< / a >
< a href = "https://github.com/MontFerret/ferret/releases" >
< img alt = "Ferret release" src = "https://img.shields.io/github/release/MontFerret/ferret.svg" >
< / a >
< a href = "http://opensource.org/licenses/MIT" >
< img alt = "MIT License" src = "http://img.shields.io/badge/license-MIT-brightgreen.svg" >
< / a >
2018-10-16 01:39:45 +02:00
< / p >
2018-09-26 00:12:12 +02:00
![ferret ](https://raw.githubusercontent.com/MontFerret/ferret/master/assets/intro.jpg )
2018-09-22 05:48:35 +02:00
## What is it?
2019-03-07 16:53:45 +02:00
```ferret``` is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.
```ferret``` allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language.
2020-10-31 21:40:58 +02:00
It is extremely portable, extensible, and fast.
2018-09-19 03:41:16 +02:00
2019-02-08 20:47:04 +02:00
[Read the introductory blog post about Ferret here! ](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183 )
2020-10-31 17:04:20 +02:00
### Features
* Declarative language
* Support of both static and dynamic web pages
* Embeddable
* Extensible
### Show me some code
2018-09-28 19:43:29 +02:00
The following example demonstrates the use of dynamic pages.
2020-10-31 21:40:58 +02:00
We load the main Google Search page, type a search criteria into the input box, and then click the search button.
The click action triggers a redirect, so we wait until the the page we were redirected to finishes loading.
Once the results page is loaded, we iterate over all elements in the search results and assign output to a variable.
2018-09-27 19:37:37 +02:00
```aql
2019-07-23 20:55:26 +02:00
LET google = DOCUMENT("https://www.google.com/", {
2019-08-04 20:01:03 +02:00
driver: "cdp",
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
2019-07-23 20:55:26 +02:00
})
2018-09-28 01:43:33 +02:00
2020-11-21 03:09:21 +02:00
HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria , 30)
WAIT(RAND(100))
WAIT_ELEMENT(google, '.UUbT9')
WAIT(RAND(100))
2018-10-02 19:32:22 +02:00
CLICK(google, 'input[name="btnK"]')
2018-09-28 01:43:33 +02:00
2018-10-02 19:32:22 +02:00
WAIT_NAVIGATION(google)
2018-09-28 01:43:33 +02:00
2018-10-09 05:28:15 +02:00
FOR result IN ELEMENTS(google, '.g')
// filter out extra elements like videos and 'People also ask'
FILTER TRIM(result.attributes.class) == 'g'
RETURN {
title: INNER_TEXT(result, 'h3'),
2020-11-21 03:09:21 +02:00
description: INNER_TEXT(result, '.rc > div:nth-child(2) span'),
2018-10-09 05:28:15 +02:00
url: INNER_TEXT(result, 'cite')
}
2018-09-27 19:37:37 +02:00
```
2020-10-31 21:40:58 +02:00
You can find more examples [here ](./examples ).
2018-10-06 05:36:23 +02:00
2018-09-19 03:41:16 +02:00
2020-10-31 17:04:20 +02:00
### Motivation
2020-10-31 21:40:58 +02:00
Nowadays, data is everything and who owns data - owns the world.
I have worked on multiple data-driven projects where data was an essential part of a system, and I realized how repetitive it is to write scraping code.
Other scraping libraries require lots of boilerplate code and tend to encourage an imperative approach to extracting data.
After some time looking for a tool that would let me declare which data I needed (instead of imperatively instructing it how to extract it), I decided to build my own solution.
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without the hassle of other scrapers.
2018-09-23 08:34:26 +02:00
2020-10-31 17:04:20 +02:00
### Inspiration
2020-10-31 21:40:58 +02:00
FQL (Ferret Query Language) is meant to feel like writing a database query.
It is heavily inspired by [AQL ](https://www.arangodb.com/ ) (ArangoDB Query Language).
2020-11-21 03:09:21 +02:00
But, due to the domain specifics, there are some differences in syntax and how things work.
2018-09-18 22:42:38 +02:00
2018-09-22 05:48:35 +02:00
## Installation
2018-10-14 03:58:05 +02:00
### Binary
2020-11-21 03:09:21 +02:00
You can download the latest binaries from [here ](https://github.com/MontFerret/ferret/releases ).
2018-10-14 03:58:05 +02:00
### Source code
2018-10-05 16:10:19 +02:00
#### Production
2019-03-27 23:24:49 +02:00
* Go >=1.11
2018-10-05 16:10:19 +02:00
* Chrome or Docker
#### Development
2018-09-22 05:48:35 +02:00
* GNU Make
2020-05-23 20:34:46 +02:00
* ANTLR4 >=4.8
2018-10-05 16:10:19 +02:00
2018-09-22 05:48:35 +02:00
```sh
2018-10-05 04:16:03 +02:00
go get github.com/MontFerret/ferret
2018-09-22 05:48:35 +02:00
```
2020-10-31 17:04:20 +02:00
### Environment
2018-10-14 03:58:05 +02:00
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
2020-10-31 21:40:58 +02:00
For ease of use, we recommend to running Chromium inside a Docker container.
You can probably use most Chromium-based headless images, but we've put together [an image that's ready to go ](https://github.com/MontFerret/chromium ):
2018-09-29 05:47:40 +02:00
```sh
2020-09-11 03:43:31 +02:00
docker pull montferret/chromium
docker run -d -p 9222:9222 montferret/chromium
2018-09-29 05:47:40 +02:00
```
2020-10-31 21:40:58 +02:00
If you'd rather see what's happening during query execution, just start launch Chrome from your host with the remote debugging port set:
2018-10-04 22:19:26 +02:00
```sh
chrome.exe --remote-debugging-port=9222
```
2018-09-19 03:41:16 +02:00
## Quick start
2018-09-18 22:42:38 +02:00
### Browserless mode
2020-10-31 21:40:58 +02:00
If you want to try out ```fql```, you can get started without Chrome or a Chromium container.
Executing the `ferret` CLI without any options will open `ferret` in REPL mode.
2018-09-18 22:42:38 +02:00
```
2018-10-05 04:16:03 +02:00
ferret
2018-09-18 22:42:38 +02:00
```
2018-09-19 03:41:16 +02:00
```ferret``` will run in REPL mode.
2018-09-18 22:42:38 +02:00
```shell
Welcome to Ferret REPL
2020-04-20 21:56:38 +02:00
Please use `exit` or `Ctrl-D` to exit this program.
2018-09-23 08:34:26 +02:00
>%
>LET doc = DOCUMENT('https://news.ycombinator.com/')
>FOR post IN ELEMENTS(doc, '.storylink')
2018-09-18 22:42:38 +02:00
>RETURN post.attributes.href
2018-09-23 08:34:26 +02:00
>%
2018-09-18 22:42:38 +02:00
```
2018-10-05 11:17:22 +02:00
**Note:** symbol ```%``` is used to start and end multi-line queries. You also can use the heredoc format.
2018-09-18 22:42:38 +02:00
2018-09-19 03:41:16 +02:00
If you want to execute a query stored in a file, just pass a file name:
2018-09-18 22:42:38 +02:00
```
2018-10-05 04:17:33 +02:00
ferret ./docs/examples/static-page.fql
2018-09-18 22:42:38 +02:00
```
2018-09-23 08:34:26 +02:00
```
2018-10-05 04:17:33 +02:00
cat ./docs/examples/static-page.fql | ferret
2018-09-23 08:34:26 +02:00
```
```
2018-10-05 04:17:33 +02:00
ferret < . / docs / examples / static-page . fql
2018-09-23 08:34:26 +02:00
```
2018-09-18 22:42:38 +02:00
### Browser mode
2020-10-31 21:40:58 +02:00
By default, ``ferret`` loads HTML pages directly via HTTP protocol, because it's faster.
2020-11-21 03:09:21 +02:00
But, nowadays, more and more websites are rendered with JavaScript, and this 'old school' approach does not really work.
2020-10-31 21:40:58 +02:00
For these dynamic websites, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).
First, you need to make sure that you launched Chrome with ```remote-debugging-port=9222``` flag (see "Environment" in this README for instructions on setting this up).
2018-09-19 03:47:54 +02:00
Second, you need to pass the address to ```ferret``` CLI.
2018-09-18 22:42:38 +02:00
2018-09-19 03:41:16 +02:00
```
2018-10-05 04:16:03 +02:00
ferret --cdp http://127.0.0.1:9222
2018-09-18 22:42:38 +02:00
```
2020-10-31 21:40:58 +02:00
**NOTE:** By default, ```ferret``` will try to use this local address as a default one.
You only need to explicitly pass the parameter if you are using a different port number or remote address.
2018-09-18 22:42:38 +02:00
2018-09-19 03:41:16 +02:00
Alternatively, you can tell CLI to launch Chrome for you.
2018-09-18 22:42:38 +02:00
2018-09-19 03:41:16 +02:00
```shell
2018-10-05 04:16:03 +02:00
ferret --cdp-launch
2018-09-18 22:42:38 +02:00
```
2020-10-31 21:40:58 +02:00
Once ```ferret``` knows how to communicate with Chrome, you can use the function ```DOCUMENT(url, isDynamic)```, setting ```isDynamic``` to ```{driver: "cdp"}``` for dynamic pages:
2018-09-18 22:42:38 +02:00
```shell
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
2018-09-23 08:34:26 +02:00
>%
2019-07-23 20:55:26 +02:00
>LET doc = DOCUMENT('https://soundcloud.com/charts/top', { driver: "cdp" })
2018-09-23 10:33:20 +02:00
>WAIT_ELEMENT(doc, '.chartTrack__details', 5000)
2018-09-18 22:42:38 +02:00
>LET tracks = ELEMENTS(doc, '.chartTrack__details')
>FOR track IN tracks
> LET username = ELEMENT(track, '.chartTrack__username')
> LET title = ELEMENT(track, '.chartTrack__title')
> RETURN {
> artist: username.innerText,
> track: title.innerText
> }
2018-09-23 08:34:26 +02:00
>%
2018-09-19 03:41:16 +02:00
```
2018-09-26 00:12:12 +02:00
```shell
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
>%
2019-07-23 20:55:26 +02:00
>LET doc = DOCUMENT("https://github.com/", { driver: "cdp" })
2018-09-26 00:12:12 +02:00
>LET btn = ELEMENT(doc, ".HeaderMenu a")
>CLICK(btn)
>WAIT_NAVIGATION(doc)
>WAIT_ELEMENT(doc, '.IconNav')
>FOR el IN ELEMENTS(doc, '.IconNav a')
> RETURN TRIM(el.innerText)
>%
```
2018-09-19 03:41:16 +02:00
### Embedded mode
2020-10-31 21:40:58 +02:00
```ferret``` is a very modular system.
2020-11-21 03:09:21 +02:00
It can be embedded into your Go application in only a few lines of code.
2020-10-31 21:40:58 +02:00
Here is an example of a short Go application that defines an `fql` query, compiles it, executes it, then returns the results.
2018-09-19 03:41:16 +02:00
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
2018-12-22 18:24:06 +02:00
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
"github.com/MontFerret/ferret/pkg/drivers/http"
2018-09-19 03:41:16 +02:00
)
type Topic struct {
Name string `json:"name"`
Description string `json:"description"`
2018-12-01 02:30:55 +02:00
URL string `json:"url"`
2018-09-19 03:41:16 +02:00
}
func main() {
topics, err := getTopTenTrendingTopics()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, topic := range topics {
2018-12-01 02:30:55 +02:00
fmt.Println(fmt.Sprintf("%s: %s %s", topic.Name, topic.Description, topic.URL))
2018-09-19 03:41:16 +02:00
}
}
func getTopTenTrendingTopics() ([]*Topic, error) {
query := `
LET doc = DOCUMENT("https://github.com/topics")
FOR el IN ELEMENTS(doc, ".py-4.border-bottom")
LIMIT 10
LET url = ELEMENT(el, "a")
LET name = ELEMENT(el, ".f3")
2020-04-20 21:53:36 +02:00
LET description = ELEMENT(el, ".f5")
2018-09-19 03:41:16 +02:00
RETURN {
name: TRIM(name.innerText),
2020-04-20 21:53:36 +02:00
description: TRIM(description.innerText),
2018-09-19 03:41:16 +02:00
url: "https://github.com" + url.attributes.href
}
`
comp := compiler.New()
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
2018-12-01 02:30:55 +02:00
// create a root context
ctx := context.Background()
// enable HTML drivers
2018-12-22 18:24:06 +02:00
// by default, Ferret Runtime does not know about any HTML drivers
2018-12-01 02:30:55 +02:00
// all HTML manipulations are done via functions from standard library
2018-12-22 18:24:06 +02:00
// that assume that at least one driver is available
2019-02-21 16:46:36 +02:00
ctx = drivers.WithContext(ctx, cdp.NewDriver())
ctx = drivers.WithContext(ctx, http.NewDriver(), drivers.AsDefault())
2018-12-01 02:30:55 +02:00
out, err := program.Run(ctx)
2018-09-19 03:41:16 +02:00
if err != nil {
return nil, err
}
res := make([]*Topic, 0, 10)
err = json.Unmarshal(out, & res)
if err != nil {
return nil, err
}
return res, nil
}
2018-12-01 02:30:55 +02:00
2018-09-19 03:41:16 +02:00
```
2020-10-31 17:04:20 +02:00
## Extras
### Extensibility
2018-09-19 03:41:16 +02:00
2020-10-31 21:40:58 +02:00
With ```ferret```'s modular system, you can also extend its standard library.
In this example, we define a `transform` function in Go, then register that function with ```ferret```, making it available for use in ```fql``` queries.
2018-09-19 03:41:16 +02:00
2018-09-19 03:49:26 +02:00
```go
2018-09-19 03:41:16 +02:00
package main
import (
"context"
"encoding/json"
"fmt"
2018-12-01 02:30:55 +02:00
"os"
"strings"
2018-09-19 03:41:16 +02:00
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/runtime/core"
"github.com/MontFerret/ferret/pkg/runtime/values"
)
func main() {
strs, err := getStrings()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, str := range strs {
fmt.Println(str)
}
}
func getStrings() ([]string, error) {
// function implements is a type of a function that ferret supports as a runtime function
transform := func(ctx context.Context, args ...core.Value) (core.Value, error) {
// it's just a helper function which helps to validate a number of passed args
2018-12-01 02:30:55 +02:00
err := core.ValidateArgs(args, 1, 1)
2018-09-19 03:41:16 +02:00
if err != nil {
// it's recommended to return built-in None type, instead of nil
return values.None, err
}
// this is another helper functions allowing to do type validation
err = core.ValidateType(args[0], core.StringType)
if err != nil {
return values.None, err
}
// cast to built-in string type
str := args[0].(values.String)
2018-12-01 02:30:55 +02:00
return values.NewString(strings.ToUpper(str.String() + "_ferret")), nil
2018-09-19 03:41:16 +02:00
}
query := `
FOR el IN ["foo", "bar", "qaz"]
// conventionally all functions are registered in upper case
RETURN TRANSFORM(el)
`
comp := compiler.New()
2018-12-01 02:30:55 +02:00
if err := comp.RegisterFunction("transform", transform); err != nil {
return nil, err
}
2018-09-19 03:41:16 +02:00
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
out, err := program.Run(context.Background())
if err != nil {
return nil, err
}
res := make([]string, 0, 3)
err = json.Unmarshal(out, & res)
if err != nil {
return nil, err
}
return res, nil
}
```
2020-10-31 21:40:58 +02:00
You can completely turn off the ```ferret``` standard library, as follows:
2018-09-19 03:41:16 +02:00
```go
comp := compiler.New(compiler.WithoutStdlib())
```
2020-10-31 21:40:58 +02:00
After disabling ```stdlib```, you can register your own implementation of functions from standard library.
2018-09-19 03:41:16 +02:00
2020-10-31 21:40:58 +02:00
If you only need a subset of the ```stdlib``` functions, you can only have those enabled by disabling the entire ```stdlib```, then registering the individual packages that are needed:
2018-09-19 03:41:16 +02:00
```go
package main
import (
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/stdlib/strings"
)
func main() {
comp := compiler.New(compiler.WithoutStdlib())
comp.RegisterFunctions(strings.NewLib())
}
```
2018-12-10 21:09:25 +02:00
2020-10-31 17:04:20 +02:00
### Proxy
2018-12-10 21:09:25 +02:00
2020-10-31 21:40:58 +02:00
By default, ```ferret``` does not attempt to use a proxy. This is due to an inability to CDP-compatible browsers to use an arbitrary proxy. If you need to use a proxy, it should be defined while launching the browser.
2018-12-10 21:17:19 +02:00
2020-10-31 21:40:58 +02:00
However, if you are querying static pages, you can define a proxy while launching ``ferret``` from the CLI or from embedded applications.
2018-12-10 21:09:25 +02:00
2020-10-31 21:40:58 +02:00
#### CLI example
2018-12-10 21:09:25 +02:00
```sh
ferret --proxy=http://localhost:8888 my-query.fql
```
2020-10-31 21:40:58 +02:00
#### Embedded example
2018-12-10 21:09:25 +02:00
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
2018-12-22 18:24:06 +02:00
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/http"
2018-12-10 21:09:25 +02:00
)
func run(q string) ([]byte, error) {
2018-12-22 18:24:06 +02:00
proxy := "http://localhost:8888"
comp := compiler.New()
2019-03-02 20:00:57 +02:00
program := comp.MustCompile(q)
2018-12-10 21:09:25 +02:00
2019-03-02 20:00:57 +02:00
// create a root context
ctx := context.Background()
2019-02-22 02:08:33 +02:00
2019-03-02 20:00:57 +02:00
// we inform the driver what proxy to use
ctx = drivers.WithContext(ctx, http.NewDriver(http.WithProxy(proxy)), drivers.AsDefault())
2018-12-10 21:09:25 +02:00
2019-03-02 20:00:57 +02:00
return program.Run(ctx)
2018-12-10 21:09:25 +02:00
}
```
2019-03-16 01:59:05 +02:00
2020-10-31 17:04:20 +02:00
### Cookies
#### Get, Set, Delete
For more precise work, you can set/get/delete cookies manually before and after loading the page:
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
cookies: [
{
name: "foo",
value: "bar"
}
]
})
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
COOKIE_DEL(doc, "foo")
LET c = COOKIE_GET(doc, "baz")
FOR cookie IN doc.cookies
RETURN cookie.name
```
2019-03-16 01:59:05 +02:00
2020-10-31 17:04:20 +02:00
#### Access previously-set cookies (non-incognito mode)
2019-03-16 01:59:05 +02:00
2020-10-31 21:40:58 +02:00
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid collisions from cookies persisted by previous queries.
However, sometimes you might want access to persisted cookies (e.g. to avoid re-authenticating with a site).
In order to do that, we need to configure the driver to execute all queries in non-incognito tabs.
Here is how to do that:
2019-03-16 01:59:05 +02:00
2020-10-31 21:40:58 +02:00
##### CLI example
2019-03-16 01:59:05 +02:00
```sh
ferret --cdp-keep-cookies my-query.fql
```
2020-10-31 21:40:58 +02:00
##### Embedded example
2019-03-16 01:59:05 +02:00
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
)
func run(q string) ([]byte, error) {
comp := compiler.New()
program := comp.MustCompile(q)
// create a root context
ctx := context.Background()
// we inform the driver to keep cookies between queries
ctx = drivers.WithContext(
ctx,
cdp.NewDriver(cdp.WithKeepCookies()),
drivers.AsDefault(),
)
return program.Run(ctx)
}
```
2020-10-31 17:04:20 +02:00
##### Query
2019-03-16 01:59:05 +02:00
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
keepCookies: true
})
```
2020-10-31 17:04:20 +02:00
### File System
2020-05-06 17:53:00 +02:00
2020-10-31 21:40:58 +02:00
```ferret``` can also read and write to the file system.
#### Write example
2020-05-06 17:53:00 +02:00
```
USE IO::FS
LET favicon = DOWNLOAD("https://www.google.com/favicon.ico")
RETURN WRITE("google.favicon.ico", favicon)
```
2020-10-31 21:40:58 +02:00
#### Read example
2020-05-06 17:53:00 +02:00
```
USE IO::FS
LET urls_data = READ("urls.json")
LET urls = JSON_PARSE(urls_data)
FOR url IN urls
RETURN DOCUMENT(url)
```
2020-10-31 21:40:58 +02:00
## References
Further documentation is available [at our website ](https://www.montferret.dev/docs/introduction/ )