* add pkg/stdlib/objects Length function * rename lenght.go -> length.go * fix tests according to other tests * add new tests to length tests * delete objects method Length * add objects method Has * add objects function Keys * small fixes in Keys and Has functions * change Has function * unit tests for Keys function * add unit tests for merge. also little change in lib.go * add doc to Keys function * Merge function prototype * add unit tests for KEEP function * added KEEP function * added doc for KEYS function * update lib.go * update lib.go * upd merge prototype * addded isEqualObjects function to objects tests * change object method Compare * added unit tests for Compare method * changed Compare method * fix Compare method * rename method Clone to Copy * added Cloneable interface * added Value to Cloneable interface * implemented Cloneable intefrace by array * added some more unit tests for values.Array * fix values.Array.Compare method * added one more unit test * implemented Cloneable interface by Object * unit tests for Object.Clone * move core.IsCloneable to value.go * change Clone function * move IsClonable to package values * updated MERGE unit tests * added MERGE function * added MERGE to lib * added one more test * changed MERGE function * rewrite a few comments according to Go Best Practices * rewrite comments * fix bug when result of the KEEP function was dependent on source object * some more changes in KEEP function * init VALUES function * push test with bug * add stress test * small changes in stress tests * changes in object.Comapare * change object.Compare * add more tests for object.Compare * added comments to object.Compare function * change object.Comapare * delete useless comment * one more change in object.Compare * init datetime * added test for datetime * added lib.go * add helpers functions * made values.DefaultTimeLayout public * added DATE function * added DATE_DAYOFWEEK function * added DATE_YEAR function * added DATE_MONTH function * added one more testCase for DATE_MONTH * added DATE_DAY function * added DateDay to lib * added DATE_HOUR, DATE_MINUTE and DATE_SECOND functions * added DATE_DAYOFYEAR, DATE_LEAPYEAR, DATE_MILLISECOND functions * fix names in tests * one more case into dayofyear_test * added DATE_QUARTER function * added DATE_DAYS_IN_MONTH function * added DATE_FORMAT function * added -v flag into go test * update DATE_FORMAT test cases * added one more test case * add helpers functions * made values.DefaultTimeLayout public * added DATE function * added DATE_DAYOFWEEK function * added DATE_YEAR function * added DATE_MONTH function * added one more testCase for DATE_MONTH * added DATE_DAY function * added DateDay to lib * added DATE_HOUR, DATE_MINUTE and DATE_SECOND functions * added DATE_DAYOFYEAR, DATE_LEAPYEAR, DATE_MILLISECOND functions * fix names in tests * one more case into dayofyear_test * added DATE_QUARTER function * added DATE_DAYS_IN_MONTH function * added DATE_FORMAT function * added -v flag into go test * Set codecov support for all branches * update DATE_FORMAT test cases * Updated codecov settings * Added panic recovery mechanism (#158) * Bump github.com/mafredri/cdp from 0.19.0 to 0.20.0 (#159) Bumps [github.com/mafredri/cdp](https://github.com/mafredri/cdp) from 0.19.0 to 0.20.0. - [Release notes](https://github.com/mafredri/cdp/releases) - [Commits](https://github.com/mafredri/cdp/compare/v0.19.0...v0.20.0) Signed-off-by: dependabot[bot] <support@dependabot.com> * Bump github.com/gofrs/uuid from 3.1.1 to 3.1.2 (#160) Bumps [github.com/gofrs/uuid](https://github.com/gofrs/uuid) from 3.1.1 to 3.1.2. - [Release notes](https://github.com/gofrs/uuid/releases) - [Commits](https://github.com/gofrs/uuid/compare/v3.1.1...v3.1.2) Signed-off-by: dependabot[bot] <support@dependabot.com> * added one more test case * sorter instead Compare now * rename utils.LOG -> utils.PRINT * rename utils.Logs -> utils.Print * added DATE_ADD, DATE_SUBTRACT functions * use keyed fields now * added DATE_DIFF function * delete unused var * delete useless type cast * fixed a bug when adding/subtrating did not take an amount of units * added DateCompare function * renames * fix small bug * fix * init autocompleter * init autocomplete * delete init tokens and add fql.LiteralNames in autocomplete
Ferret
What is it?
ferret
is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics.
Having its own declarative language, ferret
abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself.
It's extremely portable, extensible and fast.
Show me some code
The following example demonstrates the use of dynamic pages.
First of all, we load the main Google Search page, type search criteria into an input box and then click a search button.
The click action triggers a redirect, so we wait till its end.
Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable.
The final for loop filters out empty elements that might be because of inaccurate use of selectors.
LET google = DOCUMENT("https://www.google.com/", true)
INPUT(google, 'input[name="q"]', "ferret", 25)
CLICK(google, 'input[name="btnK"]')
WAIT_NAVIGATION(google)
FOR result IN ELEMENTS(google, '.g')
// filter out extra elements like videos and 'People also ask'
FILTER TRIM(result.attributes.class) == 'g'
RETURN {
title: INNER_TEXT(result, 'h3'),
description: INNER_TEXT(result, '.st'),
url: INNER_TEXT(result, 'cite')
}
More examples you can find here
Features
- Declarative language
- Support of both static and dynamic web pages
- Embeddable
- Extensible
Motivation
Nowadays data is everything and who owns data - owns the world.
I have worked on multiple data-driven projects where data was an essential part of a system and I realized how cumbersome writing tons of scrapers is.
After some time looking for a tool that would let me to not write a code, but just express what data I need, decided to come up with my own solution.
ferret
project is an ambitious initiative trying to bring the universal platform for writing scrapers without any hassle.
Inspiration
FQL (Ferret Query Language) is heavily inspired by AQL (ArangoDB Query Language).
But due to the domain specifics, there are some differences in how things work.
WIP
Be aware, that the project is under heavy development. There is no documentation and some things may change in the final release.
For query syntax, you may go to ArangoDB web site and use AQL docs as docs for FQL - since they are identical.
Installation
Binary
You can download latest binaries from here.
Source code
Production
- Go >=1.10
- Chrome or Docker
Development
- GoDep
- GNU Make
- ANTLR4 >=4.7.1
go get github.com/MontFerret/ferret
Environment
In order to use all Ferret features, you will need to have Chrome either installed locally or running in Docker. For ease of use we recommend to run Chrome inside a Docker container:
docker pull alpeware/chrome-headless-trunk
docker run -d -p=0.0.0.0:9222:9222 --name=chrome-headless -v /tmp/chromedata/:/data alpeware/chrome-headless-trunk
But if you want to see what's happening during query execution, just start your Chrome with remote debugging port:
chrome.exe --remote-debugging-port=9222
Quick start
Browserless mode
If you want to play with fql
and check its syntax, you can run CLI with the following commands:
ferret
ferret
will run in REPL mode.
Welcome to Ferret REPL
Please use `Ctrl-D` to exit this program.
>%
>LET doc = DOCUMENT('https://news.ycombinator.com/')
>FOR post IN ELEMENTS(doc, '.storylink')
>RETURN post.attributes.href
>%
Note: symbol %
is used to start and end multi-line queries. You also can use the heredoc format.
If you want to execute a query stored in a file, just pass a file name:
ferret ./docs/examples/static-page.fql
cat ./docs/examples/static-page.fql | ferret
ferret < ./docs/examples/static-page.fql
Browser mode
By default, ferret
loads HTML pages via HTTP protocol, because it's faster.
But nowadays, there are more and more websites rendered with JavaScript, and therefore, this 'old school' approach does not really work.
For such cases, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).
First, you need to make sure that you launched Chrome with remote-debugging-port=9222
flag.
Second, you need to pass the address to ferret
CLI.
ferret --cdp http://127.0.0.1:9222
NOTE: By default, ferret
will try to use this local address as a default one, so it makes sense to explicitly pass the parameter only in case of either different port number or remote address.
Alternatively, you can tell CLI to launch Chrome for you.
ferret --cdp-launch
NOTE: Launch command is currently broken on MacOS.
Once ferret
knows how to communicate with Chrome, you can use a function DOCUMENT(url, isDynamic)
with true
boolean value for dynamic pages:
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
>%
>LET doc = DOCUMENT('https://soundcloud.com/charts/top', true)
>WAIT_ELEMENT(doc, '.chartTrack__details', 5000)
>LET tracks = ELEMENTS(doc, '.chartTrack__details')
>FOR track IN tracks
> LET username = ELEMENT(track, '.chartTrack__username')
> LET title = ELEMENT(track, '.chartTrack__title')
> RETURN {
> artist: username.innerText,
> track: title.innerText
> }
>%
Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
>%
>LET doc = DOCUMENT("https://github.com/", true)
>LET btn = ELEMENT(doc, ".HeaderMenu a")
>CLICK(btn)
>WAIT_NAVIGATION(doc)
>WAIT_ELEMENT(doc, '.IconNav')
>FOR el IN ELEMENTS(doc, '.IconNav a')
> RETURN TRIM(el.innerText)
>%
Embedded mode
ferret
is a very modular system and therefore, can be easily be embedded into your Go application.
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
"github.com/MontFerret/ferret/pkg/drivers/http"
)
type Topic struct {
Name string `json:"name"`
Description string `json:"description"`
URL string `json:"url"`
}
func main() {
topics, err := getTopTenTrendingTopics()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, topic := range topics {
fmt.Println(fmt.Sprintf("%s: %s %s", topic.Name, topic.Description, topic.URL))
}
}
func getTopTenTrendingTopics() ([]*Topic, error) {
query := `
LET doc = DOCUMENT("https://github.com/topics")
FOR el IN ELEMENTS(doc, ".py-4.border-bottom")
LIMIT 10
LET url = ELEMENT(el, "a")
LET name = ELEMENT(el, ".f3")
LET desc = ELEMENT(el, ".f5")
RETURN {
name: TRIM(name.innerText),
description: TRIM(desc.innerText),
url: "https://github.com" + url.attributes.href
}
`
comp := compiler.New()
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
// create a root context
ctx := context.Background()
// enable HTML drivers
// by default, Ferret Runtime does not know about any HTML drivers
// all HTML manipulations are done via functions from standard library
// that assume that at least one driver is available
ctx = drivers.WithDynamic(ctx, cdp.NewDriver())
ctx = drivers.WithStatic(ctx, http.NewDriver())
out, err := program.Run(ctx)
if err != nil {
return nil, err
}
res := make([]*Topic, 0, 10)
err = json.Unmarshal(out, &res)
if err != nil {
return nil, err
}
return res, nil
}
Extensibility
That said, ferret
is a very modular system which also allows not only embed it, but extend its standard library.
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"strings"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/runtime/core"
"github.com/MontFerret/ferret/pkg/runtime/values"
)
func main() {
strs, err := getStrings()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
for _, str := range strs {
fmt.Println(str)
}
}
func getStrings() ([]string, error) {
// function implements is a type of a function that ferret supports as a runtime function
transform := func(ctx context.Context, args ...core.Value) (core.Value, error) {
// it's just a helper function which helps to validate a number of passed args
err := core.ValidateArgs(args, 1, 1)
if err != nil {
// it's recommended to return built-in None type, instead of nil
return values.None, err
}
// this is another helper functions allowing to do type validation
err = core.ValidateType(args[0], core.StringType)
if err != nil {
return values.None, err
}
// cast to built-in string type
str := args[0].(values.String)
return values.NewString(strings.ToUpper(str.String() + "_ferret")), nil
}
query := `
FOR el IN ["foo", "bar", "qaz"]
// conventionally all functions are registered in upper case
RETURN TRANSFORM(el)
`
comp := compiler.New()
if err := comp.RegisterFunction("transform", transform); err != nil {
return nil, err
}
program, err := comp.Compile(query)
if err != nil {
return nil, err
}
out, err := program.Run(context.Background())
if err != nil {
return nil, err
}
res := make([]string, 0, 3)
err = json.Unmarshal(out, &res)
if err != nil {
return nil, err
}
return res, nil
}
On top of that, you can completely turn off the standard library, bypassing the following option:
comp := compiler.New(compiler.WithoutStdlib())
And after that, you can easily provide your own implementation of functions from standard library.
If you don't need a particular set of functions from standard library, you can turn off the entire stdlib
and register separate packages from that:
package main
import (
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/stdlib/strings"
)
func main() {
comp := compiler.New(compiler.WithoutStdlib())
comp.RegisterFunctions(strings.NewLib())
}
Proxy
By default, Ferret does not use any proxies. Partially, due to inability to force Chrome/Chromium (or any other Chrome Devtools Protocol compatible browser) to use a prticular proxy. It should be done during a browser launch.
But you can pass an address of a proxy server you want to use for static pages.
CLI
ferret --proxy=http://localhost:8888 my-query.fql
Code
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/http"
)
func run(q string) ([]byte, error) {
proxy := "http://localhost:8888"
comp := compiler.New()
program := comp.MustCompile(q)
// create a root context
ctx := context.Background()
// we inform the driver what proxy to use
ctx = html.WithStatic(
ctx,
http.NewDriver(http.WithProxy(proxy)),
)
return program.Run(ctx)
}