1
0
mirror of https://github.com/MontFerret/ferret.git synced 2025-01-22 03:39:08 +02:00
ferret/README.md

561 lines
14 KiB
Markdown
Raw Normal View History

2018-09-18 16:42:38 -04:00
# Ferret
<p align="center">
<a href="https://goreportcard.com/report/github.com/MontFerret/ferret">
<img alt="Go Report Status" src="https://goreportcard.com/badge/github.com/MontFerret/ferret">
</a>
<a href="https://travis-ci.com/MontFerret/ferret">
<img alt="Build Status" src="https://travis-ci.com/MontFerret/ferret.svg?branch=master">
</a>
<a href="https://codecov.io/gh/MontFerret/ferret">
<img src="https://codecov.io/gh/MontFerret/ferret/branch/master/graph/badge.svg" />
</a>
<a href="https://discord.gg/kzet32U">
<img alt="Discord Chat" src="https://img.shields.io/discord/501533080880676864.svg">
</a>
<a href="https://github.com/MontFerret/ferret/releases">
<img alt="Ferret release" src="https://img.shields.io/github/release/MontFerret/ferret.svg">
</a>
<a href="http://opensource.org/licenses/MIT">
<img alt="MIT License" src="http://img.shields.io/badge/license-MIT-brightgreen.svg">
</a>
</p>
2018-09-25 18:12:12 -04:00
![ferret](https://raw.githubusercontent.com/MontFerret/ferret/master/assets/intro.jpg)
2018-09-21 23:48:35 -04:00
## What is it?
```ferret``` is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.
```ferret``` allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language.
It is extremely portable, extensible and fast.
2018-09-18 21:41:16 -04:00
2019-02-08 13:47:04 -05:00
[Read the introductory blog post about Ferret here!](https://medium.com/@ziflex/say-hello-to-ferret-a-modern-web-scraping-tool-5c9cc85ba183)
2018-09-28 00:34:24 -04:00
## Show me some code
2018-09-28 13:43:29 -04:00
The following example demonstrates the use of dynamic pages.
We load the main Google Search page, type search criteria into an input box and then click a search button.
The click action triggers a redirect, so we wait until its end.
2018-10-05 14:47:22 +05:30
Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable.
2018-09-27 19:43:33 -04:00
The final for loop filters out empty elements that might be because of inaccurate use of selectors.
2018-09-27 13:37:37 -04:00
```aql
2019-07-23 14:55:26 -04:00
LET google = DOCUMENT("https://www.google.com/", {
2019-08-04 14:01:03 -04:00
driver: "cdp",
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
2019-07-23 14:55:26 -04:00
})
2018-09-27 19:43:33 -04:00
2019-08-04 14:01:03 -04:00
INPUT(google, 'input[name="q"]', "ferret")
CLICK(google, 'input[name="btnK"]')
2018-09-27 19:43:33 -04:00
WAIT_NAVIGATION(google)
2018-09-27 19:43:33 -04:00
2018-10-08 23:28:15 -04:00
FOR result IN ELEMENTS(google, '.g')
// filter out extra elements like videos and 'People also ask'
FILTER TRIM(result.attributes.class) == 'g'
RETURN {
title: INNER_TEXT(result, 'h3'),
description: INNER_TEXT(result, '.st'),
url: INNER_TEXT(result, 'cite')
}
2018-09-27 13:37:37 -04:00
```
2018-10-05 23:36:23 -04:00
More examples you can find [here](./examples)
2018-09-18 21:41:16 -04:00
## Features
* Declarative language
2018-09-21 23:48:35 -04:00
* Support of both static and dynamic web pages
2018-09-18 21:41:16 -04:00
* Embeddable
* Extensible
2018-09-18 16:42:38 -04:00
## Motivation
2018-09-18 21:47:54 -04:00
Nowadays data is everything and who owns data - owns the world.
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
2018-10-05 14:47:22 +05:30
```ferret``` project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
2018-09-23 02:34:26 -04:00
2018-09-18 16:42:38 -04:00
## Inspiration
2018-09-18 21:47:54 -04:00
FQL (Ferret Query Language) is heavily inspired by [AQL](https://www.arangodb.com/) (ArangoDB Query Language).
But due to the domain specifics, there are some differences in how things work.
2018-09-18 16:42:38 -04:00
## WIP
2018-10-05 14:47:22 +05:30
Be aware, that the project is under heavy development. There is no documentation and some things may change in the final release.
2018-10-03 00:10:11 +02:00
For query syntax, you may go to [ArangoDB web site](https://docs.arangodb.com/3.3/AQL/index.html) and use AQL docs as docs for FQL - since they are identical.
2018-09-18 16:42:38 -04:00
2018-09-21 23:48:35 -04:00
## Installation
2018-10-13 21:58:05 -04:00
### Binary
You can download latest binaries from [here](https://github.com/MontFerret/ferret/releases).
### Source code
2018-10-05 10:10:19 -04:00
#### Production
* Go >=1.11
2018-10-05 10:10:19 -04:00
* Chrome or Docker
#### Development
2018-09-21 23:48:35 -04:00
* GNU Make
2018-10-05 10:10:19 -04:00
* ANTLR4 >=4.7.1
2018-09-21 23:48:35 -04:00
```sh
2018-10-04 22:16:03 -04:00
go get github.com/MontFerret/ferret
2018-09-21 23:48:35 -04:00
```
2018-10-13 21:58:05 -04:00
## Environment
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker.
For ease of use we recommend to run Chrome inside a Docker container:
2018-09-28 23:47:40 -04:00
```sh
2019-03-06 22:14:57 -05:00
docker pull alpeware/chrome-headless-stable
docker run -d -p=0.0.0.0:9222:9222 --name=chrome-headless -v /tmp/chromedata/:/data alpeware/chrome-headless-stable
2018-09-28 23:47:40 -04:00
```
2018-10-04 16:19:26 -04:00
But if you want to see what's happening during query execution, just start your Chrome with remote debugging port:
```sh
chrome.exe --remote-debugging-port=9222
```
2018-09-18 21:41:16 -04:00
## Quick start
2018-09-18 16:42:38 -04:00
### Browserless mode
2018-09-18 21:41:16 -04:00
If you want to play with ```fql``` and check its syntax, you can run CLI with the following commands:
2018-09-18 16:42:38 -04:00
```
2018-10-04 22:16:03 -04:00
ferret
2018-09-18 16:42:38 -04:00
```
2018-09-18 21:41:16 -04:00
```ferret``` will run in REPL mode.
2018-09-18 16:42:38 -04:00
```shell
Welcome to Ferret REPL
Please use `Ctrl-D` to exit this program.
2018-09-23 02:34:26 -04:00
>%
>LET doc = DOCUMENT('https://news.ycombinator.com/')
>FOR post IN ELEMENTS(doc, '.storylink')
2018-09-18 16:42:38 -04:00
>RETURN post.attributes.href
2018-09-23 02:34:26 -04:00
>%
2018-09-18 16:42:38 -04:00
```
2018-10-05 14:47:22 +05:30
**Note:** symbol ```%``` is used to start and end multi-line queries. You also can use the heredoc format.
2018-09-18 16:42:38 -04:00
2018-09-18 21:41:16 -04:00
If you want to execute a query stored in a file, just pass a file name:
2018-09-18 16:42:38 -04:00
```
2018-10-04 22:17:33 -04:00
ferret ./docs/examples/static-page.fql
2018-09-18 16:42:38 -04:00
```
2018-09-23 02:34:26 -04:00
```
2018-10-04 22:17:33 -04:00
cat ./docs/examples/static-page.fql | ferret
2018-09-23 02:34:26 -04:00
```
```
2018-10-04 22:17:33 -04:00
ferret < ./docs/examples/static-page.fql
2018-09-23 02:34:26 -04:00
```
2018-09-18 16:42:38 -04:00
### Browser mode
2018-10-05 14:47:22 +05:30
By default, ``ferret`` loads HTML pages via HTTP protocol, because it's faster.
2018-09-18 21:47:54 -04:00
But nowadays, there are more and more websites rendered with JavaScript, and therefore, this 'old school' approach does not really work.
For such cases, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).
First, you need to make sure that you launched Chrome with ```remote-debugging-port=9222``` flag.
Second, you need to pass the address to ```ferret``` CLI.
2018-09-18 16:42:38 -04:00
2018-09-18 21:41:16 -04:00
```
2018-10-04 22:16:03 -04:00
ferret --cdp http://127.0.0.1:9222
2018-09-18 16:42:38 -04:00
```
2018-09-18 21:47:54 -04:00
**NOTE:** By default, ```ferret``` will try to use this local address as a default one, so it makes sense to explicitly pass the parameter only in case of either different port number or remote address.
2018-09-18 16:42:38 -04:00
2018-09-18 21:41:16 -04:00
Alternatively, you can tell CLI to launch Chrome for you.
2018-09-18 16:42:38 -04:00
2018-09-18 21:41:16 -04:00
```shell
2018-10-04 22:16:03 -04:00
ferret --cdp-launch
2018-09-18 16:42:38 -04:00
```
2018-09-18 21:47:54 -04:00
**NOTE:** Launch command is currently broken on MacOS.
2018-09-18 21:41:16 -04:00
2018-09-23 04:33:20 -04:00
Once ```ferret``` knows how to communicate with Chrome, you can use a function ```DOCUMENT(url, isDynamic)``` with ```true``` boolean value for dynamic pages:
2018-09-18 16:42:38 -04:00
```shell
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
2018-09-23 02:34:26 -04:00
>%
2019-07-23 14:55:26 -04:00
>LET doc = DOCUMENT('https://soundcloud.com/charts/top', { driver: "cdp" })
2018-09-23 04:33:20 -04:00
>WAIT_ELEMENT(doc, '.chartTrack__details', 5000)
2018-09-18 16:42:38 -04:00
>LET tracks = ELEMENTS(doc, '.chartTrack__details')
>FOR track IN tracks
> LET username = ELEMENT(track, '.chartTrack__username')
> LET title = ELEMENT(track, '.chartTrack__title')
> RETURN {
> artist: username.innerText,
> track: title.innerText
> }
2018-09-23 02:34:26 -04:00
>%
2018-09-18 21:41:16 -04:00
```
2018-09-25 18:12:12 -04:00
```shell
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
>%
2019-07-23 14:55:26 -04:00
>LET doc = DOCUMENT("https://github.com/", { driver: "cdp" })
2018-09-25 18:12:12 -04:00
>LET btn = ELEMENT(doc, ".HeaderMenu a")
>CLICK(btn)
>WAIT_NAVIGATION(doc)
>WAIT_ELEMENT(doc, '.IconNav')
>FOR el IN ELEMENTS(doc, '.IconNav a')
> RETURN TRIM(el.innerText)
>%
```
2018-09-18 21:41:16 -04:00
### Embedded mode
2018-09-18 21:47:54 -04:00
```ferret``` is a very modular system and therefore, can be easily be embedded into your Go application.
2018-09-18 21:41:16 -04:00
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
2018-12-22 11:24:06 -05:00
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
"github.com/MontFerret/ferret/pkg/drivers/http"
2018-09-18 21:41:16 -04:00
)
type Topic struct {
Name string `json:"name"`
Description string `json:"description"`
URL string `json:"url"`
2018-09-18 21:41:16 -04:00
}
func main() {
topics, err := getTopTenTrendingTopics()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, topic := range topics {
fmt.Println(fmt.Sprintf("%s: %s %s", topic.Name, topic.Description, topic.URL))
2018-09-18 21:41:16 -04:00
}
}
func getTopTenTrendingTopics() ([]*Topic, error) {
query := `
LET doc = DOCUMENT("https://github.com/topics")
FOR el IN ELEMENTS(doc, ".py-4.border-bottom")
LIMIT 10
LET url = ELEMENT(el, "a")
LET name = ELEMENT(el, ".f3")
LET desc = ELEMENT(el, ".f5")
RETURN {
name: TRIM(name.innerText),
description: TRIM(desc.innerText),
url: "https://github.com" + url.attributes.href
}
`
comp := compiler.New()
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
// create a root context
ctx := context.Background()
// enable HTML drivers
2018-12-22 11:24:06 -05:00
// by default, Ferret Runtime does not know about any HTML drivers
// all HTML manipulations are done via functions from standard library
2018-12-22 11:24:06 -05:00
// that assume that at least one driver is available
2019-02-21 09:46:36 -05:00
ctx = drivers.WithContext(ctx, cdp.NewDriver())
ctx = drivers.WithContext(ctx, http.NewDriver(), drivers.AsDefault())
out, err := program.Run(ctx)
2018-09-18 21:41:16 -04:00
if err != nil {
return nil, err
}
res := make([]*Topic, 0, 10)
err = json.Unmarshal(out, &res)
if err != nil {
return nil, err
}
return res, nil
}
2018-09-18 21:41:16 -04:00
```
## Extensibility
2018-09-18 21:47:54 -04:00
That said, ```ferret``` is a very modular system which also allows not only embed it, but extend its standard library.
2018-09-18 21:41:16 -04:00
2018-09-18 21:49:26 -04:00
```go
2018-09-18 21:41:16 -04:00
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"strings"
2018-09-18 21:41:16 -04:00
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/runtime/core"
"github.com/MontFerret/ferret/pkg/runtime/values"
)
func main() {
strs, err := getStrings()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, str := range strs {
fmt.Println(str)
}
}
func getStrings() ([]string, error) {
// function implements is a type of a function that ferret supports as a runtime function
transform := func(ctx context.Context, args ...core.Value) (core.Value, error) {
// it's just a helper function which helps to validate a number of passed args
err := core.ValidateArgs(args, 1, 1)
2018-09-18 21:41:16 -04:00
if err != nil {
// it's recommended to return built-in None type, instead of nil
return values.None, err
}
// this is another helper functions allowing to do type validation
err = core.ValidateType(args[0], core.StringType)
if err != nil {
return values.None, err
}
// cast to built-in string type
str := args[0].(values.String)
return values.NewString(strings.ToUpper(str.String() + "_ferret")), nil
2018-09-18 21:41:16 -04:00
}
query := `
FOR el IN ["foo", "bar", "qaz"]
// conventionally all functions are registered in upper case
RETURN TRANSFORM(el)
`
comp := compiler.New()
if err := comp.RegisterFunction("transform", transform); err != nil {
return nil, err
}
2018-09-18 21:41:16 -04:00
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
out, err := program.Run(context.Background())
if err != nil {
return nil, err
}
res := make([]string, 0, 3)
err = json.Unmarshal(out, &res)
if err != nil {
return nil, err
}
return res, nil
}
```
2018-10-05 14:47:22 +05:30
On top of that, you can completely turn off the standard library, bypassing the following option:
2018-09-18 21:41:16 -04:00
```go
comp := compiler.New(compiler.WithoutStdlib())
```
2018-09-18 21:47:54 -04:00
And after that, you can easily provide your own implementation of functions from standard library.
2018-09-18 21:41:16 -04:00
2018-09-18 21:47:54 -04:00
If you don't need a particular set of functions from standard library, you can turn off the entire ```stdlib``` and register separate packages from that:
2018-09-18 21:41:16 -04:00
```go
package main
import (
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/stdlib/strings"
)
func main() {
comp := compiler.New(compiler.WithoutStdlib())
comp.RegisterFunctions(strings.NewLib())
}
```
2018-12-10 14:09:25 -05:00
## Proxy
2018-12-10 14:17:19 -05:00
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a prticular proxy. It should be done during a browser launch.
But you can pass an address of a proxy server you want to use for static pages.
2018-12-10 14:09:25 -05:00
#### CLI
```sh
ferret --proxy=http://localhost:8888 my-query.fql
```
#### Code
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
2018-12-22 11:24:06 -05:00
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/http"
2018-12-10 14:09:25 -05:00
)
func run(q string) ([]byte, error) {
2018-12-22 11:24:06 -05:00
proxy := "http://localhost:8888"
comp := compiler.New()
2019-03-02 21:00:57 +03:00
program := comp.MustCompile(q)
2018-12-10 14:09:25 -05:00
2019-03-02 21:00:57 +03:00
// create a root context
ctx := context.Background()
2019-02-21 19:08:33 -05:00
2019-03-02 21:00:57 +03:00
// we inform the driver what proxy to use
ctx = drivers.WithContext(ctx, http.NewDriver(http.WithProxy(proxy)), drivers.AsDefault())
2018-12-10 14:09:25 -05:00
2019-03-02 21:00:57 +03:00
return program.Run(ctx)
2018-12-10 14:09:25 -05:00
}
```
## Cookies
### Non-incognito mode
By default, ``CDP`` driver execute each query in an incognito mode in order to avoid any collisions related to some persisted cookies from previous queries.
However, sometimes it might not be a desirable behavior and a query needs to be executed within a Chrome tab with earlier persisted cookies.
In order to do that, we need to inform the driver to execute all queries in regular tabs. Here is how to do that:
#### CLI
```sh
ferret --cdp-keep-cookies my-query.fql
```
#### Code
```go
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
)
func run(q string) ([]byte, error) {
comp := compiler.New()
program := comp.MustCompile(q)
// create a root context
ctx := context.Background()
// we inform the driver to keep cookies between queries
ctx = drivers.WithContext(
ctx,
cdp.NewDriver(cdp.WithKeepCookies()),
drivers.AsDefault(),
)
return program.Run(ctx)
}
```
#### Query
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
keepCookies: true
})
```
### Cookies manipulation
For more precise work, you can set/get/delete cookies manually during and after page load:
```
LET doc = DOCUMENT("https://www.google.com", {
driver: "cdp",
cookies: [
{
name: "foo",
value: "bar"
}
]
})
COOKIE_SET(doc, { name: "baz", value: "qaz"}, { name: "daz", value: "gag" })
COOKIE_DEL(doc, "foo")
LET c = COOKIE_GET(doc, "baz")
FOR cookie IN doc.cookies
RETURN cookie.name
```
## Sponsors
Support this project by becoming a sponsor. Your logo will show up here with a link to your website.
2019-05-14 12:56:44 -04:00
<p>
<a href="https://opencollective.com/ferret/sponsor/0/website" target="_blank"><img src="https://opencollective.com/ferret/sponsor/0/avatar.svg"></a>
</p>
<p align="center">
<a href="https://opencollective.com/ferret/donate" target="_blank">
<img src="https://opencollective.com/ferret/donate/button@2x.png?color=blue" width="300" />
</a>
</p>