1
0
mirror of https://github.com/MontFerret/ferret.git synced 2024-12-14 11:23:02 +02:00
ferret/README.md
2018-09-18 16:42:38 -04:00

2.9 KiB

Ferret

Web scraping query language

Motivation

Nowadays data is everything and who owns data - owns the world. I have worked on multiple data-driven projects where data was an essential part of the system where I realized how cumbersome writing tons of scrapers is. I was looking for some kind of tool that would let me to not write a code, but just express what data I need. Unfortunately, I didn't find anything, and therefore decided to create one. Ferret project is an ambitious initiative to bring universal platform for writing scrapers without any hassle.

Inspiration

FQL (Ferret Query Language) is heavily inspired by AQL (ArangoDB Query Language). But due to domain specifics, there are some differences in how things work.

WIP

Be aware, the the project is under heavy development. There is no documentation and some things may change in the final release. For query syntax, you may go to ArrangoDB web site and use AQL docs as a docs for FQL - since they are identical.

Quick stark

Browserless mode

If you want to play with fql and check its syntax, run CLI with the following commands:

go run ./cmd/cli/main.go

ferret will run REPL.

Welcome to Ferret REPL
Please use `Ctrl-D` to exit this program.
>LET doc = DOCUMENT('https://news.ycombinator.com/')\
>FOR post IN ELEMENTS(doc, '.storylink')\
>RETURN post.attributes.href

Note: blackslash is used for multiline queries.

If you want to execute a query store in a file, just type a file name

go run ./cmd/cli/main.go ./docs/examples/hackernews.fql

Browser mode

By default, ferret loads HTML pages via http protocol since it's faster. But nowadays, there are more and more websites rendered with JavaScript, and therefore, this 'old school' approach does not really work. For this case, you may fetch documents using Chrome or Chromium via Chrome DevTools protocol (aka CDP).

go run ./cmd/cli/main.go --cdp-launch

Note: Launch command is currently broken on MacOS.

Alternatively, you may open Chrome manually with remote-debugging-port=9222 arguments and bass the address to ferret:

./bin/ferret --cdp http://127.0.0.1:9222

In this case, you can use function DOCUMENT(url, isJsRendered) with true for loading JS rendered pages:

Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
>LET doc = DOCUMENT('https://soundcloud.com/charts/top', true)
>SLEEP(2000) // WAIT WHEN THE PAGE GETS RENDERED
>LET tracks = ELEMENTS(doc, '.chartTrack__details')
>LOG("found", LENGTH(tracks), "tracks")
>FOR track IN tracks
>    LET username = ELEMENT(track, '.chartTrack__username')
>    LET title = ELEMENT(track, '.chartTrack__title')
>    RETURN {
>       artist: username.innerText,
>        track: title.innerText
>    }