1
0
mirror of https://github.com/MontFerret/ferret.git synced 2024-12-04 10:35:08 +02:00

Moved exmples folder

This commit is contained in:
Tim Voronov 2018-10-05 23:36:23 -04:00
parent ec2d6a659b
commit e8d82d9396
18 changed files with 14 additions and 543 deletions

View File

@ -38,6 +38,8 @@ RETURN (
)
```
More examples you can find [here](./examples)
## Features
* Declarative language

View File

@ -1 +0,0 @@
# Docs

View File

@ -1,11 +0,0 @@
# High-level operations
The following high-level operations are described here after:
- [FOR]("operations/for.md"): Iterate over all elements of an array or an object.
- [RETURN]("operations/return.md"): Produce the result of a query.
- [FILTER]("operations/filter.md"): Restrict the results to elements that match arbitrary logical conditions.
- [SORT]("operations/sort.md"): Force a sort of the array of already produced intermediate results.
- [LIMIT]("operations/limit.md"): Reduce the number of elements in the result to at most the specified number, optionally skip elements (pagination).
- [LET]("operations/let.md"): Assign an arbitrary value to a variable.
- [COLLECT]("operations/collect.md"): Group an array by one or multiple group criteria. Can also count and aggregate.

View File

@ -1,210 +0,0 @@
# COLLECT
The ```COLLECT``` keyword can be used to group an array by one or multiple group criteria.
The ```COLLECT``` statement will eliminate all local variables in the current scope. After ```COLLECT``` only the variables introduced by ```COLLECT``` itself are available.
The general syntaxes for ```COLLECT``` are:
```
COLLECT variableName = expression options
COLLECT variableName = expression INTO groupsVariable options
COLLECT variableName = expression INTO groupsVariable = projectionExpression options
COLLECT variableName = expression INTO groupsVariable KEEP keepVariable options
COLLECT variableName = expression WITH COUNT INTO countVariable options
COLLECT variableName = expression AGGREGATE variableName = aggregateExpression options
COLLECT AGGREGATE variableName = aggregateExpression options
COLLECT WITH COUNT INTO countVariable options
```
```options``` is optional in all variants.
## Grouping syntaxes
The first syntax form of ```COLLECT``` only groups the result by the defined group criteria specified in expression. In order to further process the results produced by COLLECT, a new variable (specified by variableName) is introduced. This variable contains the group value.
Here's an example query that find the distinct values in u.city and makes them available in variable city:
```
FOR u IN users
```COLLECT``` city = u.city
RETURN {
"city" : city
}
```
The second form does the same as the first form, but additionally introduces a variable (specified by groupsVariable) that contains all elements that fell into the group. This works as follows: The groupsVariable variable is an array containing as many elements as there are in the group. Each member of that array is a JSON object in which the value of every variable that is defined in the AQL query is bound to the corresponding attribute. Note that this considers all variables that are defined before the ```COLLECT``` statement, but not those on the top level (outside of any FOR), unless the ```COLLECT``` statement is itself on the top level, in which case all variables are taken. Furthermore note that it is possible that the optimizer moves LET statements out of FOR statements to improve performance.
```
FOR u IN users
```COLLECT``` city = u.city INTO groups
RETURN {
"city" : city,
"usersInCity" : groups
}
```
In the above example, the array users will be grouped by the attribute city. The result is a new array of documents, with one element per distinct u.city value. The elements from the original array (here: users) per city are made available in the variable groups. This is due to the INTO clause.
```COLLECT``` also allows specifying multiple group criteria. Individual group criteria can be separated by commas:
```
FOR u IN users
```COLLECT``` country = u.country, city = u.city INTO groups
RETURN {
"country" : country,
"city" : city,
"usersInCity" : groups
}
```
In the above example, the array users is grouped by country first and then by city, and for each distinct combination of country and city, the users will be returned.
## Discarding obsolete variables
The third form of ```COLLECT``` allows rewriting the contents of the groupsVariable using an arbitrary projectionExpression:
```
FOR u IN users
```COLLECT``` country = u.country, city = u.city INTO groups = u.name
RETURN {
"country" : country,
"city" : city,
"userNames" : groups
}
```
In the above example, only the projectionExpression is u.name. Therefore, only this attribute is copied into the groupsVariable for each document. This is probably much more efficient than copying all variables from the scope into the groupsVariable as it would happen without a projectionExpression.
The expression following INTO can also be used for arbitrary computations:
```
FOR u IN users
```COLLECT``` country = u.country, city = u.city INTO groups = {
"name" : u.name,
"isActive" : u.status == "active"
}
RETURN {
"country" : country,
"city" : city,
"usersInCity" : groups
}
```
```COLLECT``` also provides an optional KEEP clause that can be used to control which variables will be copied into the variable created by INTO. If no KEEP clause is specified, all variables from the scope will be copied as sub-attributes into the groupsVariable. This is safe but can have a negative impact on performance if there are many variables in scope or the variables contain massive amounts of data.
The following example limits the variables that are copied into the groupsVariable to just name. The variables u and someCalculation also present in the scope will not be copied into groupsVariable because they are not listed in the KEEP clause:
```
FOR u IN users
LET name = u.name
LET someCalculation = u.value1 + u.value2
```COLLECT``` city = u.city INTO groups KEEP name
RETURN {
"city" : city,
"userNames" : groups[*].name
}
```
```KEEP``` is only valid in combination with INTO. Only valid variable names can be used in the KEEP clause. KEEP supports the specification of multiple variable names.
## Group length calculation
```COLLECT``` also provides a special ```WITH COUNT``` clause that can be used to determine the number of group members efficiently.
The simplest form just returns the number of items that made it into the COLLECT:
```
FOR u IN users
```COLLECT``` WITH COUNT INTO length
RETURN length
```
The above is equivalent to, but less efficient than:
```
RETURN LENGTH(users)
```
The ```WITH COUNT``` clause can also be used to efficiently count the number of items in each group:
```
FOR u IN users
```COLLECT``` age = u.age WITH COUNT INTO length
RETURN {
"age" : age,
"count" : length
}
```
Note: the ```WITH COUNT``` clause can only be used together with an INTO clause.
## Aggregation
A ```COLLECT``` statement can be used to perform aggregation of data per group. To only determine group lengths, the WITH COUNT INTO variant of ```COLLECT``` can be used as described before.
For other aggregations, it is possible to run aggregate functions on the ```COLLECT``` results:
FOR u IN users
```COLLECT``` ageGroup = FLOOR(u.age / 5) * 5 INTO g
RETURN {
"ageGroup" : ageGroup,
"minAge" : MIN(g[*].u.age),
"maxAge" : MAX(g[*].u.age)
}
The above however requires storing all group values during the collect operation for all groups, which can be inefficient.
The special AGGREGATE variant of ```COLLECT``` allows building the aggregate values incrementally during the collect operation, and is therefore often more efficient.
With the AGGREGATE variant the above query becomes:
FOR u IN users
```COLLECT``` ageGroup = FLOOR(u.age / 5) * 5
AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
RETURN {
ageGroup,
minAge,
maxAge
}
The AGGREGATE keyword can only be used after the ```COLLECT``` keyword. If used, it must directly follow the declaration of the grouping keys. If no grouping keys are used, it must follow the ```COLLECT``` keyword directly:
FOR u IN users
```COLLECT``` AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
RETURN {
minAge,
maxAge
}
Only specific expressions are allowed on the right-hand side of each AGGREGATE assignment:
on the top level, an aggregate expression must be a call to one of the supported aggregation functions LENGTH, MIN, MAX, SUM, AVERAGE, STDDEV_POPULATION, STDDEV_SAMPLE, VARIANCE_POPULATION, or VARIANCE_SAMPLE
an aggregate expression must not refer to variables introduced by the ```COLLECT``` itself
COLLECT variants
Since ArangoDB 2.6, there are two variants of ```COLLECT``` that the optimizer can choose from: the sorted variant and the hash variant. The hash variant only becomes a candidate for ```COLLECT``` statements that do not use an INTO clause.
The optimizer will always generate a plan that employs the sorted method. The sorted method requires its input to be sorted by the group criteria specified in the ```COLLECT``` clause. To ensure correctness of the result, the AQL optimizer will automatically insert a SORT statement into the query in front of the ```COLLECT``` statement. The optimizer may be able to optimize away that SORT statement later if a sorted index is present on the group criteria.
In case a ```COLLECT``` qualifies for using the hash variant, the optimizer will create an extra plan for it at the beginning of the planning phase. In this plan, no extra SORT statement will be added in front of the COLLECT. This is because the hash variant of ```COLLECT``` does not require sorted input. Instead, a SORT statement will be added after the ```COLLECT``` to sort its output. This SORT statement may be optimized away again in later stages. If the sort order of the ```COLLECT``` is irrelevant to the user, adding the extra instruction SORT null after the ```COLLECT``` will allow the optimizer to remove the sorts altogether:
FOR u IN users
```COLLECT``` age = u.age
SORT null /* note: will be optimized away */
RETURN age
Which ```COLLECT``` variant is used by the optimizer depends on the optimizer's cost estimations. The created plans with the different ```COLLECT``` variants will be shipped through the regular optimization pipeline. In the end, the optimizer will pick the plan with the lowest estimated total cost as usual.
In general, the sorted variant of ```COLLECT``` should be preferred in cases when there is a sorted index present on the group criteria. In this case the optimizer can eliminate the SORT statement in front of the COLLECT, so that no SORT will be left.
If there is no sorted index available on the group criteria, the up-front sort required by the sorted variant can be expensive. In this case it is likely that the optimizer will prefer the hash variant of COLLECT, which does not require its input to be sorted.
Which variant of ```COLLECT``` was actually used can be figured out by looking into the execution plan of a query, specifically the AggregateNode and its aggregationOptions attribute.
Setting ```COLLECT``` options
options can be used in a ```COLLECT``` statement to inform the optimizer about the preferred ```COLLECT``` method. When specifying the following appendix to a ```COLLECT``` statement, the optimizer will always use the sorted variant of ```COLLECT``` and not even create a plan using the hash variant:
OPTIONS { method: "sorted" }
Note that specifying hash as method will not make the optimizer use the hash variant. This is because the hash variant is not eligible for all queries. Instead, if no options or any other method than sorted are specified in OPTIONS, the optimizer will use its regular cost estimations.
COLLECT vs. RETURN DISTINCT
In order to make a result set unique, one can either use ```COLLECT``` or RETURN DISTINCT. Behind the scenes, both variants will work by creating an AggregateNode. For both variants, the optimizer may try the sorted and the hashed variant of COLLECT. The difference is therefore mainly syntactical, with RETURN DISTINCT saving a bit of typing when compared to an equivalent COLLECT:
FOR u IN users
RETURN DISTINCT u.age
FOR u IN users
```COLLECT``` age = u.age
RETURN age
However, ```COLLECT``` is vastly more flexible than RETURN DISTINCT. Additionally, the order of results is undefined for a RETURN DISTINCT, whereas for a ```COLLECT``` the results will be sorted.

View File

@ -1,73 +0,0 @@
# FILTER
The FILTER statement can be used to restrict the results to elements that match an arbitrary logical condition.
## General syntax
```
FILTER condition
```
```condition``` must be a condition that evaluates to either false or true. If the condition result is false, the current element is skipped, so it will not be processed further and not be part of the result.
If the ```condition``` is true, the current element is not skipped and can be further processed.
See Operators for a list of comparison operators, logical operators etc. that you can use in conditions.
```
FOR u IN users
FILTER u.active == true && u.age < 39
RETURN u
```
It is allowed to specify multiple FILTER statements in a query, even in the same block. If multiple FILTER statements are used, their results will be combined with a logical AND, meaning all filter conditions must be true to include an element.
```
FOR u IN users
FILTER u.active == true
FILTER u.age < 39
RETURN u
```
In the above example, all array elements of users that have an attribute active with value true and that have an attribute age with a value less than 39 (including null ones) will be included in the result. All other elements of users will be skipped and not be included in the result produced by RETURN. You may refer to the chapter Accessing Data from Collections for a description of the impact of non-existent or null attributes.
Order of operations
Note that the positions of FILTER statements can influence the result of a query. There are 16 active users in the test data for instance:
```
FOR u IN users
FILTER u.active == true
RETURN u
```
We can limit the result set to 5 users at most:
```
FOR u IN users
FILTER u.active == true
LIMIT 5
RETURN u
```
This may return the user documents of Jim, Diego, Anthony, Michael and Chloe for instance. Which ones are returned is undefined, since there is no SORT statement to ensure a particular order. If we add a second FILTER statement to only return women...
```
FOR u IN users
FILTER u.active == true
LIMIT 5
FILTER u.gender == "f"
RETURN u
```
... it might just return the Chloe document, because the LIMIT is applied before the second FILTER. No more than 5 documents arrive at the second FILTER block, and not all of them fulfill the gender criterion, eventhough there are more than 5 active female users in the collection. A more deterministic result can be achieved by adding a SORT block:
FOR u IN users
FILTER u.active == true
SORT u.age ASC
LIMIT 5
FILTER u.gender == "f"
RETURN u
This will return the users Mariah and Mary. If sorted by age in DESC order, then the Sophia, Emma and Madison documents are returned. A FILTER after a LIMIT is not very common however, and you probably want such a query instead:
FOR u IN users
FILTER u.active == true AND u.gender == "f"
SORT u.age ASC
LIMIT 5
RETURN u
The significance of where FILTER blocks are placed allows that this single keyword can assume the roles of two SQL keywords, WHERE as well as HAVING. AQL's FILTER thus works with COLLECT aggregates the same as with any other intermediate result, document attribute etc.

View File

@ -1,43 +0,0 @@
# FOR
The FOR keyword can be to iterate over all elements of an array. The general syntax is:
```
FOR variableName IN expression
```
There is also a special variant for graph traversals:
```
FOR valueVariableName, keyVariableName IN traversalExpression
```
Each array element returned by expression is visited exactly once. It is required that expression returns an array in all cases. The empty array is allowed, too. The current array element is made available for further processing in the variable specified by variableName.
```
FOR u IN users
RETURN u
```
This will iterate over all elements from the array users and make the current array element available in variable u.
The variable introduced by FOR is available until the scope the FOR is placed in is closed.
Another example that uses a statically declared array of values to iterate over:
```
FOR year IN [ 2011, 2012, 2013 ]
RETURN { "year" : year, "isLeapYear" : year % 4 == 0 && (year % 100 != 0 || year % 400 == 0) }
```
Nesting of multiple FOR statements is allowed, too. When FOR statements are nested, a cross product of the array elements returned by the individual FOR statements will be created.
```
FOR u IN users
FOR l IN locations
RETURN { "user" : u, "location" : l }
```
In this example, there are two array iterations: an outer iteration over the array users plus an inner iteration over the array locations. The inner array is traversed as many times as there are elements in the outer array.
For each iteration, the current values of users and locations are made available for further processing in the variable u and l.

View File

@ -1,167 +0,0 @@
# RETURN
The RETURN statement can be used to produce the result of a query.
It is mandatory to specify a RETURN statement at the end of each block in a data-selection query, otherwise the query result would be undefined.
Using RETURN on the main level in data-modification queries is optional.
The general syntax for RETURN is:
```
RETURN expression
```
The expression returned by RETURN is produced for each iteration in the block the RETURN statement is placed in.
That means the result of a RETURN statement is always an array. This includes an empty array if no documents matched the query and a single return value returned as array with one element.
To return all elements from the currently iterated array without modification, the following simple form can be used:
```
FOR variableName IN expression
RETURN variableName
```
As RETURN allows specifying an expression, arbitrary computations can be performed to calculate the result elements.
Any of the variables valid in the scope the RETURN is placed in can be used for the computations.
To iterate over all elements of an array, you can write:
```
FOR u IN [{name: "Bob", age: 30}, {name: "Tom", age: 31}, {name: "Jeff"< age: 38}]
RETURN u
```
In each iteration of the for-loop, an element of the array is assigned to a variable u and returned unmodified in this example.
To return only one attribute of each element, you could use a different return expression:
```
FOR u IN [{name: "Bob", age: 30}, {name: "Tom", age: 31}, {name: "Jeff"< age: 38}]
RETURN u.name
```
Or to return multiple attributes, an object can be constructed like this:
```
FOR u IN [{name: "Bob", age: 30}, {name: "Tom", age: 31}, {name: "Jeff"< age: 38}]
RETURN { name: u.name, age: u.age }
```
Note: RETURN will close the current scope and eliminate all local variables in it.
This is important to remember when working with subqueries.
Dynamic attribute names are supported as well:
```
FOR u IN [{id: "1", name: "Bob", age: 30}, {id: "2", name: "Tom", age: 31}, {id: "3", name: "Jeff"< age: 38}]
RETURN { [ u.id ]: u.age }
```
The document _id of every user is used as expression to compute the attribute key in this example:
```json
[
{
"1": 30
},
{
"2": 31
},
{
"3": 38
}
]
```
The result contains one object per user with a single key/value pair each.
This is usually not desired.
For a single object, that maps user IDs to ages, the individual results need to be merged and returned with another RETURN:
```
RETURN MERGE(
FOR u IN users
RETURN { [ u._id ]: u.age }
)
```
```json
[
{
"users/10074": 69,
"users/9883": 32,
"users/9915": 27
}
]
```
Keep in mind that if the key expression evaluates to the same value multiple times, only one of the key/value pairs with the duplicate name will survive MERGE(). To avoid this, you can go without dynamic attribute names, use static names instead and return all document properties as attribute values:
```
FOR u IN users
RETURN { name: u.name, age: u.age }
```
```json
[
{
"name": "John Smith",
"age": 32
},
{
"name": "James Hendrix",
"age": 69
},
{
"name": "Katie Foster",
"age": 27
}
]
```
## RETURN DISTINCT
RETURN can optionally be followed by the DISTINCT keyword. The DISTINCT keyword will ensure uniqueness of the values returned by the RETURN statement:
```
FOR variableName IN expression
RETURN DISTINCT expression
```
If the DISTINCT is applied on an expression that itself is an array or a subquery, the DISTINCT will not make the values in each array or subquery result unique, but instead ensure that the result contains only distinct arrays or subquery results. To make the result of an array or a subquery unique, simply apply the DISTINCT for the array or the subquery.
For example, the following query will apply DISTINCT on its subquery results, but not inside the subquery:
```
FOR what IN 1..2
RETURN DISTINCT (
FOR i IN [ 1, 2, 3, 4, 1, 3 ]
RETURN i
)
```
Here we'll have a FOR loop with two iterations that each execute a subquery. The DISTINCT here is applied on the two subquery results. Both subqueries return the same result value (that is [ 1, 2, 3, 4, 1, 3 ]), so after DISTINCT there will only be one occurrence of the value [ 1, 2, 3, 4, 1, 3 ] left:
```json
[
[ 1, 2, 3, 4, 1, 3 ]
]
```
If the goal is to apply the DISTINCT inside the subquery, it needs to be moved there:
```
FOR what IN 1..2
LET sub = (
FOR i IN [ 1, 2, 3, 4, 1, 3 ]
RETURN DISTINCT i
)
RETURN sub
```
In the above case, the DISTINCT will make the subquery results unique, so that each subquery will return a unique array of values ([ 1, 2, 3, 4 ]). As the subquery is executed twice and there is no DISTINCT on the top-level, that array will be returned twice:
```json
[
[ 1, 2, 3, 4 ],
[ 1, 2, 3, 4 ]
]
```
Note: the order of results was undefined for RETURN DISTINCT until before ArangoDB 3.3. Starting with ArangoDB 3.3, RETURN DISTINCT will not change the order of the results it is applied on.
Note: RETURN DISTINCT is not allowed on the top-level of a query if there is no FOR loop preceding it.

View File

@ -1,38 +0,0 @@
# SORT
The SORT statement will force a sort of the array of already produced intermediate results in the current block. SORT allows specifying one or multiple sort criteria and directions. The general syntax is:
```
SORT expression direction
```
Example query that is sorting by lastName (in ascending order), then firstName (in ascending order), then by id (in descending order):
```
FOR u IN users
SORT u.lastName, u.firstName, u.id DESC
RETURN u
```
Specifying the direction is optional. The default (implicit) direction for a sort expression is the ascending order. To explicitly specify the sort direction, the keywords ASC (ascending) and DESC can be used. Multiple sort criteria can be separated using commas. In this case the direction is specified for each expression sperately. For example
```
SORT doc.lastName, doc.firstName
```
will first sort documents by lastName in ascending order and then by firstName in ascending order.
```
SORT doc.lastName DESC, doc.firstName
```
will first sort documents by lastName in descending order and then by firstName in ascending order.
```
SORT doc.lastName, doc.firstName DESC
```
will first sort documents by lastName in ascending order and then by firstName in descending order.
Note: when iterating over collection-based arrays, the order of documents is always undefined unless an explicit sort order is defined using SORT.
Note that constant SORT expressions can be used to indicate that no particular sort order is desired. Constant SORT expressions will be optimized away by the AQL optimizer during optimization, but specifying them explicitly may enable further optimizations if the optimizer does not need to take into account any particular sort order. This is especially the case after a COLLECT statement, which is supposed to produce a sorted result. Specifying an extra SORT null after the COLLECT statement allows to AQL optimizer to remove the post-sorting of the collect results altogether.

12
examples/crawler.fql Normal file
View File

@ -0,0 +1,12 @@
LET doc = DOCUMENT('https://www.theverge.com/tech', true)
WAIT_ELEMENT(doc, '.c-compact-river__entry', 5000)
LET articles = ELEMENTS(doc, '.c-entry-box--compact__image-wrapper')
LET links = (
FOR article IN articles
RETURN article.attributes.href
)
FOR link IN links
NAVIGATE(doc, link)
WAIT_ELEMENT(doc, '.c-entry-content', 5000)
LET texter = ELEMENT(doc, '.c-entry-content')
RETURN texter.innerText