The AQL query optimizer
AQL queries are sent through an optimizer before execution that creates an initial execution plan, looks for optimization opportunities, and applies them
AQL queries are parsed and planned. The optimizer might produce multiple execution plans for a single query. It then calculates the costs for all plans and picks the plan with the lowest total cost. This resulting plan is considered to be the optimal plan, which is then executed.
The optimizer is designed to only perform optimizations if they are safe, in the sense that an optimization should not modify the result of a query. A notable exception to this is that the optimizer is allowed to change the order of results for queries that do not explicitly specify how results should be sorted.
Execution plans
The explain
command can be used to query the optimal executed plan or even all plans
the optimizer has generated. Additionally, explain
can reveal some more information
about the optimizer’s view of the query.
Inspecting plans using the explain helper
The explain
method of ArangoStatement
as shown in the next chapters creates very verbose output.
You can work on the output programmatically, or use this handsome tool that we created
to generate a more human readable representation.
You may use it like this: (we disable syntax highlighting here)
var coll = db._create("test");
for (i = 0; i < 100; ++i) { db.test.save({ value: i }); }
var idx = db.test.ensureIndex({ type: "persistent", fields: [ "value" ] });
var explain = require("@arangodb/aql/explainer").explain;
explain("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value", {colors:false});
Execution plans in detail
Let’s have a look at the raw json output of the same execution plan
using the explain
method of ArangoStatement
:
var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
stmt.explain();
As you can see, the result details are very verbose. They are not covered in detail for brevity in the next sections. Instead, let’s take a closer look at the results step by step.
Execution nodes of a query
In general, an execution plan can be considered to be a pipeline of processing steps. Each processing step is carried out by a so-called execution node
The nodes
attribute of the explain
result contains these execution nodes in
the execution plan. The output is still very verbose, so here’s a shorted form of it:
stmt.explain().plan.nodes.map(function (node) { return node.type; });
Note that the list of nodes might slightly change in future versions of ArangoDB if new execution node types get added or the optimizer create somewhat more optimized plans.
When a plan is executed, the query execution engine starts with the node at
the bottom of the list (i.e. the ReturnNode
).
The ReturnNode
’s purpose is to return data to the caller. It does not produce
data itself, but it asks the node above itself, which is the CalculationNode
in our example.
CalculationNode
s are responsible for evaluating arbitrary expressions. In our
example query, the CalculationNode
evaluates the value of i.value
, which
is needed by the ReturnNode
. The calculation is applied for all data the
CalculationNode
gets from the node above it, in our example the IndexNode
.
Finally, all of this needs to be done for documents of collection test
. This is
where the IndexNode
enters the game. It uses an index (thus its name)
to find certain documents in the collection and ships it down the pipeline in the
order required by SORT i.value
. The IndexNode
itself has a SingletonNode
as its input. The sole purpose of a SingletonNode
node is to provide a single empty
document as input for other processing steps. It is always the end of the pipeline.
Here is a summary:
- SingletonNode: produces an empty document as input for other processing steps.
- IndexNode: iterates over the index on attribute
value
in collectiontest
in the order required bySORT i.value
. - CalculationNode: evaluates the result of the calculation
i.value > 97
totrue
orfalse
- CalculationNode: calculates return value
i.value
- ReturnNode: returns data to the caller
Optimizer rules used for a query
Note that in the example, the optimizer has optimized the SORT
statement away.
It can do it safely because there is a sorted persistent index on i.value
, which it has
picked in the IndexNode
. As the index values are iterated over in sorted order
anyway, the extra SortNode
would have been redundant and was removed.
Additionally, the optimizer has done more work to generate an execution plan that avoids as much expensive operations as possible. Here is the list of optimizer rules that were applied to the plan:
stmt.explain().plan.rules;
Here is the meaning of these rules in context of this query:
move-calculations-up
: Moves aCalculationNode
and subqueries, when independent from the outer node, as far up in the processing pipeline as possible.move-filters-up
: Moves aFilterNode
as far up in the processing pipeline as possible.remove-redundant-calculations
: Replaces references to variables with references to other variables that contain the exact same result. In the example query,i.value
is calculated multiple times, but each calculation inside a loop iteration would produce the same value. Therefore, the expression result is shared by several nodes.remove-unnecessary-calculations
: RemovesCalculationNode
s whose result values are not used in the query. In the example this happens due to theremove-redundant-calculations
rule having made some calculations unnecessary.use-indexes
: Use an index to iterate over a collection instead of performing a full collection scan. In the example case this makes sense, as the index can be used for filtering and sorting.remove-filter-covered-by-index
: Remove an unnecessary filter whose functionality is already covered by an index. In this case the index only returns documents matching the filter.use-index-for-sort
: Removes aSORT
operation if it is already satisfied by traversing over a sorted index.
Note that some rules may appear multiple times in the list, with number suffixes. This is due to the same rule being applied multiple times, at different positions in the optimizer pipeline.
Also see the full list of optimizer rules below.
Collections used in a query
The list of collections used in a plan (and query) is contained in the collections
attribute of a plan:
stmt.explain().plan.collections
The name
attribute contains the name of the collection
, and type
is the
access type, which can be either read
or write
.
Variables used in a query
The optimizer returns a list of variables used in a plan (and query). This list contains auxiliary variables created by the optimizer itself. You can ignore this list in most cases.
Cost of a query
For each plan the optimizer generates, it calculates the total cost. The plan with the lowest total cost is considered to be the optimal plan. Costs are estimates only, as the actual execution costs are unknown to the optimizer. Costs are calculated based on heuristics that are hard-coded into execution nodes. Cost values do not have any unit.
Retrieving all execution plans
To retrieve not just the optimal plan but a list of all plans the optimizer has
generated, set the option allPlans
to true
:
This returns a list of all plans in the plans
attribute instead of in the
plan
attribute:
stmt.explain({ allPlans: true });
Retrieving the plan as it was generated by the parser / lexer
To retrieve the plan which closely matches your query, you may turn off most
optimization rules (i.e. cluster rules cannot be disabled if you’re running
the explain on a cluster Coordinator) set the option rules
to -all
:
This returns an unoptimized plan in the plan
:
stmt.explain({ optimizer: { rules: [ "-all" ] } });
Note that some optimizations are already done at parse time (i.e. evaluate simple constant
calculation as 1 + 1
)
Turning specific optimizer rules off
Optimizer rules can also be turned on or off individually, using the rules
attribute.
This can be used to enable or disable one or multiple rules. Rules that shall be enabled
need to be prefixed with a +
, rules to be disabled should be prefixed with a -
. The
pseudo-rule all
matches all rules.
Rules specified in rules
are evaluated from left to right, so the following works to
turn on just the one specific rule:
stmt.explain({ optimizer: { rules: [ "-all", "+use-index-range" ] } });
By default, all rules are turned on. To turn off just a few specific rules, use something like this:
stmt.explain({ optimizer: { rules: [ "-use-index-range", "-use-index-for-sort" ] } });
The maximum number of plans created by the optimizer can also be limited using the
maxNumberOfPlans
attribute:
stmt.explain({ maxNumberOfPlans: 1 });
Optimizer statistics
The optimizer provides statistics as a part of an explain
result.
The following attributes are returned in the stats
attribute:
plansCreated
: The total number of plans created by the optimizer.rulesExecuted
: The number of rules executed. Note that an executed rule does not indicate that a plan has actually been modified by a rule.rulesSkipped
: The number of rules skipped by the optimizer.executionTime
: The (wall-clock) time in seconds needed to explain the query.peakMemoryUsage
: The maximum memory usage of the query during explain.
Warnings
For some queries, the optimizer may produce warnings. These are returned in
the warnings
attribute of the explain
result:
var stmt = db._createStatement("FOR i IN 1..10 RETURN 1 / 0")
stmt.explain().warnings;
There is an upper bound on the number of warnings a query may produce. If that bound is reached, no further warnings are returned.
Optimization in a cluster
When you are running AQL in the cluster, the parsing of the query is done on the
Coordinator. The Coordinator then chops the query into snippets, which are either
to remain on the Coordinator or need to be distributed to the shards on the
DB-Servers over the network. The cutting sites are interconnected via ScatterNode
s,
GatherNode
s and RemoteNode
s. These nodes mark the network borders of the snippets.
The optimizer strives to reduce the amount of data transferred via these network
interfaces by pushing FILTER
s out to the shards, as it is vital to the query
performance to reduce that data amount to transfer over the network links.
Using a cluster, there is a Site column if you explain a query. Snippets marked with DBS are executed on DB-Servers, COOR ones are executed on the respective Coordinator.
Query String (57 chars, cacheable: false):
FOR doc IN test UPDATE doc WITH { updated: true } IN test
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode DBS 1 * ROOT
3 CalculationNode DBS 1 - LET #3 = { "updated" : true }
13 IndexNode DBS 1000000 - FOR doc IN test /* primary index scan, index only, projections: `_key`, 5 shard(s) */
4 UpdateNode DBS 0 - UPDATE doc WITH #3 IN test
7 RemoteNode COOR 0 - REMOTE
8 GatherNode COOR 0 - GATHER
Execution nodes
List of execution nodes
The following execution node types appear in the output of explain
:
CalculationNode: Evaluates an expression. The expression result may be used by other nodes, e.g.
FilterNode
,EnumerateListNode
,SortNode
etc.CollectNode: Aggregates its input and produces new output variables. This appears once per
COLLECT
statement.EnumerateCollectionNode: Enumeration over documents of a collection (given in its collection attribute) without using an index.
EnumerateListNode: Enumeration over a list of (non-collection) values.
EnumerateViewNode: Enumeration over documents of a View.
FilterNode: Only lets values pass that satisfy a filter condition. Appears once per
FILTER
statement.IndexNode: Enumeration over one or many indexes (given in its indexes attribute) of a collection. The index ranges are specified in the condition attribute of the node.
InsertNode: Inserts documents into a collection (given in its collection attribute). Appears exactly once in a query that contains an INSERT statement.
JoinNode: Replaces
EnumerateCollectionNode
andIndexNode
for eligible join queries to perform optimized joins.KShortestPathsNode: Indicates a traversal for k Shortest Paths (
K_SHORTEST_PATHS
in AQL).KPathsNode: Indicates a traversal for k Paths (
K_PATHS
in AQL).LimitNode: Limits the number of results passed to other processing steps. Appears once per
LIMIT
statement.MaterializeNode: The presence of this node means that the query is not fully covered by indexes and therefore needs to involve the storage engine.
RemoveNode: Removes documents from a collection (given in its collection attribute). Appears exactly once in a query that contains a
REMOVE
statement.ReplaceNode: Replaces documents in a collection (given in its collection attribute). Appears exactly once in a query that contains a
REPLACE
statement.ReturnNode: Returns data to the caller. Appears in each read-only query at least once. Subqueries also contain
ReturnNode
s.SingletonNode: The purpose of a
SingletonNode
is to produce an empty document that is used as input for other processing steps. Each execution plan contains exactly oneSingletonNode
as its top node.ShortestPathNode: Indicates a traversal for a Shortest Path (
SHORTEST_PATH
in AQL).SortNode: Performs a sort of its input values.
SubqueryEndNode: End of a spliced (inlined) subquery.
SubqueryNode: Executes a subquery.
SubqueryStartNode: Beginning of a spliced (inlined) subquery.
TraversalNode: Indicates a regular graph traversal, as opposed to a shortest path(s) traversal.
UpdateNode: Updates documents in a collection (given in its collection attribute). Appears exactly once in a query that contains an
UPDATE
statement.UpsertNode: Upserts documents in a collection (given in its collection attribute). Appears exactly once in a query that contains an
UPSERT
statement.
List of cluster execution nodes
For queries in the cluster, the following additional nodes may appear in execution plans:
DistributeNode: Used on a Coordinator to fan-out data to one or multiple shards, taking into account a collection’s shard key.
GatherNode: Used on a Coordinator to aggregate results from one or many shards into a combined stream of results. Parallelizes work for certain types of queries when there are multiple DB-Servers involved (shown as
GATHER /* parallel */
in query explain).RemoteNode: A
RemoteNode
performs communication with another ArangoDB instances in the cluster. For example, the cluster Coordinator needs to communicate with other servers to fetch the actual data from the shards. It does so viaRemoteNode
s. The data servers themselves might again pull further data from the Coordinator, and thus might also employRemoteNode
s. So, all of the above cluster relevant nodes are accompanied by aRemoteNode
.ScatterNode: Used on a Coordinator to fan-out data to one or multiple shards.
SingleRemoteOperationNode: Used on a Coordinator to directly work with a single document on a DB-Server that is referenced by its
_key
.MultipleRemoteExecutionNode: Used to optimize bulk
INSERT
operations in cluster deployments, reducing the setup and shutdown overhead and the number of internal network requests.
Optimizer rules
List of optimizer rules
The following user-facing optimizer rules exist and are enabled by default unless noted otherwise. You can enable and disable optimizer rules except for a few required rules.
Some rules exist multiple times with number suffixes like -2
,
(e.g. remove-unnecessary-calculations-2
). This is due to the same rule being
applied multiple times at different optimization stages.
replace-function-with-index
This rule cannot be turned off.
Replace deprecated index functions such as FULLTEXT()
,
NEAR()
, WITHIN()
, or WITHIN_RECTANGLE()
with a regular subquery.
inline-subqueries
Try to pull subqueries out into their surrounding scope, e.g.
FOR x IN (FOR y IN collection FILTER y.value >= 5 RETURN y.test) RETURN x.a
becomes FOR tmp IN collection FILTER tmp.value >= 5 LET x = tmp.test RETURN x.a
.
replace-like-with-range
This rule cannot be turned off.
Replace LIKE() function with range scans where possible.
simplify-conditions
Replace parts in CalculationNode
expressions with
simpler expressions.
move-calculations-up
Move calculations up in the processing pipeline as far as possible (ideally out of enumerations) so they are not executed in loops if not required. It is quite common that this rule enables further optimizations.
move-filters-up
Move filters up in the processing pipeline as far as possible (ideally out of inner loops) so they filter results as early as possible.
remove-redundant-calculations
Replace references to redundant calculations (expressions with the exact same result) with a single reference, allowing other rules to remove no longer needed calculations.
remove-unnecessary-filters
Remove FILTER
conditions that always evaluate to true
.
remove-unnecessary-calculations
Remove all calculations whose result is not referenced in the query. This can be a consequence of applying other optimizations.
remove-redundant-sorts
Try to merge multiple SORT
statements into fewer sorts.
optimize-subqueries
Apply optimizations to subqueries.
This rule adds a LIMIT
statement to qualifying subqueries to make them return
less data. It also modifies the result value of subqueries in case only the
number of subquery results is checked later. This saves copying the document
data from the subquery to the outer scope and may enable follow-up
optimizations.
interchange-adjacent-enumerations
Try out permutations of FOR
statements in queries that contain
multiple loops, which may enable further optimizations by other rules.
move-calculations-up-2
Second pass of moving calculations up in the processing pipeline as far as possible, to pull them out of inner loops etc.
move-filters-up-2
Second pass of moving filters up in the processing pipeline as far as possible so they filter results as early as possible.
replace-equal-attribute-accesses
This rule is disabled by default.
Replace attribute accesses that are equal due to a filter statement with the same value. This might enable other optimizations later on.
remove-redundant-sorts-2
Second pass of trying to merge multiple SORT
statements
into fewer sorts.
remove-sort-rand-limit-1
Remove SORT RAND() LIMIT 1
constructs by moving the random iteration
into EnumerateCollectionNode
.
The RocksDB storage engine doesn't allow to seek random documents efficiently. This optimization picks a pseudo-random document based on a limited number of seeks within the collection's key range, selecting a random start key in the key range, and then going a few steps before or after that.
remove-collect-variables
Remove INTO
and AGGREGATE
clauses from COLLECT
statements if the result is not used.
propagate-constant-attributes
Insert constant values into FILTER
conditions, replacing
dynamic attribute values.
remove-data-modification-out-variables
Avoid setting the pseudo-variables OLD
and NEW
if they
are not used in data modification queries.
replace-or-with-in
Combine multiple OR
equality conditions on the same
variable or attribute with an IN
condition.
remove-redundant-or
Combine multiple OR
conditions for the same variable or
attribute into a single condition.
geo-index-optimizer
Utilize geo-spatial indexes.
use-indexes
Use indexes to iterate over collections, replacing
EnumerateCollectionNode
with IndexNode
in the query plan.
remove-filter-covered-by-index
Replace or remove FilterNode
if the filter conditions are
already covered by IndexNode
.
remove-unnecessary-filters-2
Second pass of removing FILTER
conditions that always
evaluate to true
.
use-index-for-sort
Use indexes to avoid SORT
operations, removing SortNode
from the query plan.
sort-in-values
Use a binary search for in-list lookups with a logarithmic
complexity instead of the default linear complexity in-list lookup if the
comparison array on the right-hand side of an IN
operator is pre-sorted by an
extra function call.
optimize-traversal-last-element-access
Transform accesses to the last vertex or edge of the path
output variable (p.vertices[-1]
and p.edges[-1]
) emitted by AQL traversals
(FOR v, e, p IN ...
) with accesses to the vertex or edge variable
(v
and e
). This can avoid computing the path variable at all and enable
further optimizations that are not possible on the path variable p
.
optimize-traversals
Try to move FILTER
conditions into TraversalNode
for
early pruning of results, apply traversal projections, and avoid calculating
edge and path output variables that are not declared in the query for the
AQL traversal.
optimize-enumerate-path-filters
Move FILTER
conditions on the path output variable into the path search
optimize-paths
Check how the output variables of K_PATHS
, K_SHORTEST_PATHS
,
and ALL_SHORTEST_PATHS
path search graph algorithms are used and avoid
loading the vertex documents if they are not accessed in the query.
remove-filter-covered-by-traversal
Replace or remove FilterNode
if the filter conditions are
already covered by TraversalNode
.
handle-arangosearch-views
This rule cannot be turned off.
Appears whenever an arangosearch
or search-alias
View is accessed
in a query.
arangosearch-constrained-sort
Make nodes of type EnumerateViewNode
aware of SORT
with a
subsequent LIMIT
when using Views to reduce memory usage and avoid unnecessary
sorting that has already been carried out by ArangoSearch internally.
remove-unnecessary-calculations-2
Second pass of removing all calculations whose result is not referenced in the query. This can be a consequence of applying other optimizations
remove-redundant-path-var
Avoid computing the variables emitted by AQL traversals if they are declared but unused in the query, or only used in filters that are pulled into the traversal, significantly reducing overhead.
optimize-cluster-single-document-operations
Only available in cluster deployments.
Let a Coordinator work with a document directly if you
reference a document by its _key
. In this case, no AQL is executed on the
DB-Servers.
optimize-cluster-multiple-document-operations
Only available in cluster deployments.
For bulk INSERT
operations in cluster deployments, avoid
unnecessary overhead that AQL queries typically require for the setup and
shutdown in clusters, as well as for the internal batching.
This optimization also decreases the number of HTTP requests to the DB-Servers.
The following patterns are recognized:
FOR doc IN @docs INSERT doc INTO collection
, where@docs
is a bind parameter with an array of documents to be insertedFOR doc IN [ { … }, { … }, … ] INSERT doc INTO collection
, where theFOR
loop iterates over an array of input documents known at query compile timeLET docs = [ { … }, { … }, … ] FOR doc IN docs INSERT doc INTO collection
, where thedocs
variable is a static array of input documents known at query compile time
If a query has such a pattern, and all of the following restrictions are met, then the optimization is triggered:
- There are no following
RETURN
nodes (including anyRETURN OLD
orRETURN NEW
) - The
FOR
loop is not contained in another outerFOR
loop or subquery - There are no other operations (e.g.
LET
,FILTER
) betweenFOR
andINSERT
INSERT
is not used on a SmartGraph edge collection- The
FOR
loop iterates over a constant, deterministic expression
The optimization then replaces the InsertNode
and EnumerateListNode
with a
MultipleRemoteExecutionNode
in the query execution plan, which takes care of
inserting all documents into the collection in one go. Further optimizer rules
are skipped if the optimization is triggered.
move-calculations-down
Move calculations down in the processing pipeline as far as
possible (below FILTER
, LIMIT
and SUBQUERY
nodes) so they are executed as
late as possible and not before their results are required.
fuse-filters
Merges adjacent FILTER
nodes together into a single
FILTER
node.
cluster-one-shard
Enterprise Edition only
Only available in cluster deployments.
Offload the entire query to the DB-Server (except the client communication via a Coordinator). This saves all the back and forth that normally exists in regular cluster queries, benefitting traversals and joins in particular.
Only for eligible queries in the OneShard deployment mode as well as for
queries that only involve collection(s) with a single shard (and identical
sharding in case of multiple collections, e.g. via distributeShardsLike
).
Queries involving V8 / JavaScript (e.g. user-defined AQL functions) or
SmartGraphs cannot be optimized.
cluster-lift-constant-for-disjoint-graph-nodes
Enterprise Edition only
This rule cannot be turned off.
Only available in cluster deployments.
Detect SmartGraph traversals with a constant start vertex to prepare follow-up optimizations that can determine the shard location and push down calculations to a DB-Server.
distribute-in-cluster
This rule cannot be turned off.
Only available in cluster deployments.
Appears if query parts get distributed in a cluster.
smart-joins
Enterprise Edition only
Only available in cluster deployments.
Reduce inter-node joins to server-local joins. This rule is only employed when joining two collections with identical sharding setup via their shard keys.
scatter-in-cluster
This rule cannot be turned off.
Only available in cluster deployments.
Appears if nodes of the types ScatterNode
, GatherNode
,
and RemoteNode
are inserted into a distributed query plan.
scatter-satellite-graphs
Enterprise Edition only
Only available in cluster deployments.
Execute nodes of the types TraversalNode
,
ShortestPathNode
, and KShortestPathsNode
on a DB-Server instead of on a
Coordinator if the nodes operate on SatelliteGraphs, removing the need to
transfer data for these nodes.
remove-satellite-joins
Enterprise Edition only
Only available in cluster deployments.
Optimize nodes of the types ScatterNode
, GatherNode
, and
RemoteNode
for SatelliteCollections and SatelliteGraphs away. Execute the
respective query parts on each participating DB-Server independently, so that
the results become available locally without network communication.
Depends on the remove-unnecessary-remote-scatter
rule.
remove-distribute-nodes
Enterprise Edition only
Only available in cluster deployments.
Combine multiples nodes of type DistributeNode
into one if
two adjacent DistributeNode
nodes share the same input variables and
therefore can be optimized into a single DistributeNode
.
distribute-offset-info-to-cluster
Enterprise Edition only
This rule cannot be turned off.
Only available in cluster deployments.
Push the calculation of search highlighting information to DB-Servers where the data for determining the offsets is stored.
distribute-filtercalc-to-cluster
Only available in cluster deployments.
Move filters up in a distributed execution plan. Filters are moved as far up in the plan as possible to make result sets as small as possible, as early as possible.
distribute-sort-to-cluster
Only available in cluster deployments.
Move sort operations up in a distributed query. Sorts are moved as far up in the query plan as possible to make result sets as small as possible, as early as possible.
move-filters-into-enumerate
Move filters on non-indexed collection attributes into
IndexNode
or EnumerateCollectionNode
to allow early pruning of
non-matching documents. This optimization can help to avoid a lot of temporary
document copies. The optimization can also be applied to enumerations over
non-collection array.
remove-unnecessary-remote-scatter
Only available in cluster deployments.
Avoid distributing calculations and handle them centrally if
a RemoteNode
is followed by a ScatterNode
, and the ScatterNode
is only
followed by calculations or a SingletonNode
.
undistribute-remove-after-enum-coll
Only available in cluster deployments.
Push nodes of type RemoveNode
into the same query part that
enumerates over the documents of a collection. This saves inter-cluster
roundtrips between the EnumerateCollectionNode
and the RemoveNode
.
It includes simple UPDATE
and REPLACE
operations that modify multiple
documents and do not use LIMIT
.
collect-in-cluster
Only available in cluster deployments.
Perform the heavy processing for COLLECT
statements on
DB-Servers and only light-weight aggregation on a Coordinator. Both sides get
a CollectNode
in the query plan.
sort-limit
Make SORT
aware of a subsequent LIMIT
to enable
optimizations internal to the SortNode
that allow to reduce memory usage
and, in many cases, improve the sorting speed.
A SortNode
needs to be followed by a LimitNode
with no intervening nodes
that may change the element count (e.g. a FilterNode
which cannot be moved
before the sort, or a source node like EnumerateCollectionNode
).
The optimizer may choose not to apply the rule if it decides that it offers
little or no benefit. In particular, it does not apply the rule if the input
size is very small or if the output from the LimitNode
is similar in size to
the input. In exceptionally rare cases, this rule could result in some small
slowdown. If observed, you can disable the rule for the affected query at the
cost of increased memory usage.
reduce-extraction-to-projection
Modify EnumerationCollectionNode
and IndexNode
that would have
extracted entire documents to only return a projection of each document.
Projections are limited to at most 5 different document attributes by default.
The maximum number of projected attributes can optionally be adjusted by
setting the maxProjections
hint for an AQL FOR
operation since
ArangoDB 3.9.1.
restrict-to-single-shard
Only available in cluster deployments.
Restrict operations to a single shard instead of applying
them for all shards if a collection operation (IndexNode
or a
data modification node) only affects a single shard.
This optimization can be applied for queries that access a collection only once
in the query, and that do not use traversals, shortest path queries, and that
do not access collection data dynamically using the DOCUMENT()
, FULLTEXT()
,
NEAR()
or WITHIN()
AQL functions. Additionally, the optimizer can only
apply this optimization if it can safely determine the values of all the
collection's shard keys from the query, and when the shard keys are covered by
a single index (this is always true if the shard key is the default _key
).
optimize-count
Optimize subqueries to use an optimized code path for counting documents.
The requirements are that the subquery result must only be used with the
COUNT()
or LENGTH()
AQL function and not for anything else. The subquery
itself must be read-only (no data modification subquery), not use nested FOR
loops, no LIMIT
statement, and no FILTER
condition or calculation that
requires accessing document data. Accessing index data is supported for
filtering but not for further calculations.
parallelize-gather
Only available in cluster deployments.
Apply an optimization to execute Coordinator GatherNode
nodes in parallel. These notes cannot be parallelized if they depend on a
TraversalNode
, except for certain Disjoint SmartGraph traversals where the
traversal can run completely on the local DB-Server.
decay-unnecessary-sorted-gather
Only available in cluster deployments.
Avoid merge-sorting results on a Coordinator if they are all from a single shard and fully sorted by a DB-Server already.
push-subqueries-to-dbserver
Enterprise Edition only
Only available in cluster deployments.
Execute subqueries entirely on a DB-Server if possible. Subqueries need to contain exactly one distribute/gather section, and only one collection access or traversal, shortest path, or k-shortest paths query.
late-document-materialization-arangosearch
Try to read from the underlying collections of a View as late as possible if the involved attributes are covered by the View index.
late-document-materialization
Try to read from collections as late as possible if the involved attributes are covered by inverted indexes.
push-limit-into-index
Push limit into index node.
batch-materialize-documents
Batch document lookup from indexes.
late-materialization-offset-info
Enterprise Edition only
Get the search highlighting offsets as late as possible to avoid unnecessary reads.
join-index-nodes
Join adjacent index nodes and replace them with a join node in case the indexes qualify for it.
push-down-late-materialization
Push down late materialization.
materialize-into-separate-variable
This rule cannot be turned off.
Introduce a separate variable for late materialization.
optimize-projections
Remove projections that are no longer used and store projection results in separate output registers.
remove-unnecessary-calculations-4
Fourth pass of removing all calculations whose result is not referenced in the query. This can be a consequence of applying other optimizations
async-prefetch
Allow query execution nodes to asynchronously prefetch the next batch while processing the current batch, allowing parts of the query to run in parallel. This is only possible for certain operations in a query.
immutable-search-condition
Optimize immutable search condition for nested loops, we don't need to make real search many times, if we can cache results in bitset
splice-subqueries
This rule cannot be turned off.
Appears if subqueries are spliced into the surrounding query, reducing overhead for executing subqueries by inlining the execution. This mainly benefits queries which execute subqueries very often that only return a few results at a time.
This optimization is performed on all subqueries and is applied after all other optimizations.
Additional optimizations applied
Scan-Only Optimization
If a query iterates over a collection (for filtering or counting) but does not need
the actual document values later, the optimizer can apply a “scan-only” optimization
for EnumerateCollectionNode
and IndexNode
node types. In this case, it does not build up
a result with the document data at all, which may reduce work significantly.
In case the document data is actually not needed later on, it may be sensible to remove
it from query strings so the optimizer can apply the optimization.
If the optimization is applied, it shows up as scan only
in an AQL
query’s execution plan for an EnumerateCollectionNode
or an IndexNode
.
Index-Only Optimization
The optimizer can apply an “index-only” optimization for AQL queries that can satisfy the retrieval of all required document attributes directly from an index.
This optimization is triggered if an index is used
that covers all required attributes of the document used later on in the query.
If applied, it saves retrieving the actual document data (which would require
an extra lookup by the storage engine), but instead builds the document data solely
from the index values found. It only applies when using up to 5 (or
maxProjections
) attributes
from the document, and only if the rest of the document data is not used later
on in the query.
The optimization is available for the following index types: primary
,
edge
, and persistent
.
If the optimization is applied, it shows up as index only
in an AQL
query’s execution plan for an IndexNode
.
Filter Projections Optimizations
Introduced: v3.10.0
If an index is used that does not cover all required attributes for the query,
but if it is followed by filter conditions that only access attributes that are
part of the index, then an optimization is applied, to only fetch matching
documents. “Part of the index” here means, that all attributes referred to in
the post-filter conditions are contained in the fields
or storedValues
attributes of the index definition.
For example, the optimization is applied in the following case:
- There is a persistent index on the attributes
[ "value1", "value2" ]
(in this order), or there is a persistent index on just["value1"]
and with astoredValues
definition of["value2"]
. - There is a filter condition on
value1
that can use the index, and a filter condition onvalue2
that cannot use the index (post-filter condition).
Example query:
FOR doc IN collection
FILTER doc.value1 == @value1 /* uses the index */
FILTER ABS(doc.value2) != @value2 /* does not use the index */
RETURN doc
This query’s execution plan looks as follows:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 IndexNode 0 - FOR doc IN collection /* persistent index scan (filter projections: `value2`) */ FILTER (ABS(doc.`value2`) != 2) /* early pruning */
7 ReturnNode 0 - RETURN doc
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Ranges
8 idx_1737498319258648576 persistent collection false false false 99.96 % [ `value1`, `value2` ] (doc.`value1` == 1)
The first filter condition is transformed to an index lookup, as you can tell
from the persistent index scan
comment and the Indexes used
section that
shows the range doc.`value` == 1
. The post-filter condition
FILTER ABS(doc.value2) != 2
can be recognized as such by the early pruning
comment that follows it.
The filter projections
mentioned in the above execution plan is an indicator
of the optimization being triggered.
Instead of fetching the full documents from the storage engine for all index entries that matched the index lookup condition, only those that also satisfy the index lookup post-filter condition are fetched. If the post-filter condition filters out a lot of documents, this optimization can significantly speed up queries that produce large result sets from index lookups but filter many of the documents away with post-filter conditions.
Note that the optimization can also be combined with regular projections, e.g. for the following query that returns a specific attribute from the documents only:
FOR doc IN collection
FILTER doc.value1 == @value1 /* uses the index */
FILTER ABS(doc.value2) != @value2 /* does not use the index */
RETURN doc.value3
That query’s execution plan combines projections from the index for the
post-filter condition (filter projections
) as well as regular projections
(projections
) for the processing parts of the query that follow the
post-filter condition:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
9 IndexNode 5000 - FOR doc IN collection /* persistent index scan (filter projections: `value2`) (projections: `value3`) */ FILTER (ABS(doc.`value2`) != 2) /* early pruning */
7 CalculationNode 5000 - LET #5 = doc.`value3` /* attribute expression */ /* collections used: doc : collection */
8 ReturnNode 5000 - RETURN #5
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Ranges
9 idx_1737498319258648576 persistent collection false false false 99.96 % [ `value1`, `value2` ] (doc.`value1` == 1)
The optimization is most effective for queries in which many documents would be selected by the index lookup condition, but many are filtered out by the post-filter condition.