ArangoDB v3.13 is under development and not released yet. This documentation is not final and potentially incomplete.
Working with inverted indexes
Create an inverted index
collection-name
, if
it does not already exist. The call expects an object containing the index
details.cache boolean
Enable this option to always cache the field normalization values in memory for all fields by default. This can improve the performance of scoring and ranking queries. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Normalization values are computed for fields which are processed with Analyzers that have the
"norm"
feature enabled. These values are used to score fairer if the same tokens occur repeatedly, to emphasize these documents less.You can also enable this option to always cache auxiliary data used for querying fields that are indexed with Geo Analyzers in memory for all fields by default. This can improve the performance of geo-spatial queries.
Default:
false
This property is available in the Enterprise Edition only.
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).cleanupIntervalStep integer
Wait at least this many commits between removing unused files in the ArangoSearch data directory (default: 2, to disable use: 0). For the case where the consolidation policies merge segments often (i.e. a lot of commit+consolidate), a lower value causes a lot of disk space to be wasted. For the case where the consolidation policies rarely merge segments (i.e. few inserts/deletes), a higher value impacts performance without any added benefits.
Background: With every “commit” or “consolidate” operation, a new state of the inverted index’ internal data structures is created on disk. Old states/snapshots are released once there are no longer any users remaining. However, the files for the released states/snapshots are left on disk, and only removed by “cleanup” operation.
commitIntervalMsec integer
Wait at least this many milliseconds between committing inverted index data store changes and making documents visible to queries (default: 1000, to disable use: 0). For the case where there are a lot of inserts/updates, a higher value causes the index not to account for them and memory usage continues to grow until the commit. A lower value impacts performance, including the case where there are no or only a few inserts/updates because of synchronous locking, and it wastes disk space for each commit call.
Background: For data retrieval, ArangoSearch follows the concept of “eventually-consistent”, i.e. eventually all the data in ArangoDB will be matched by corresponding query expressions. The concept of ArangoSearch “commit” operations is introduced to control the upper-bound on the time until document addition/removals are actually reflected by corresponding query expressions. Once a “commit” operation is complete, all documents added/removed prior to the start of the “commit” operation will be reflected by queries invoked in subsequent ArangoDB transactions, in-progress ArangoDB transactions will still continue to return a repeatable-read state.
consolidationIntervalMsec integer
Wait at least this many milliseconds between applying
consolidationPolicy
to consolidate the inverted index data store and possibly release space on the filesystem (default: 1000, to disable use: 0). For the case where there are a lot of data modification operations, a higher value could potentially have the data store consume more space and file handles. For the case where there are a few data modification operations, a lower value impacts performance due to no segment candidates being available for consolidation.Background: For data modification, ArangoSearch follows the concept of a “versioned data store”. Thus old versions of data may be removed once there are no longer any users of the old data. The frequency of the cleanup and compaction operations are governed by
consolidationIntervalMsec
and the candidates for compaction are selected viaconsolidationPolicy
.consolidationPolicy object
The consolidation policy to apply for selecting which segments should be merged (default: {}).
Background: With each ArangoDB transaction that inserts documents, one or more ArangoSearch-internal segments get created. Similarly, for removed documents, the segments that contain such documents have these documents marked as ‘deleted’. Over time, this approach causes a lot of small and sparse segments to be created. A “consolidation” operation selects one or more segments and copies all of their valid documents into a single new segment, thereby allowing the search algorithm to perform more optimally and for extra file handles to be released once old segments are no longer used.
type string
The segment candidates for the “consolidation” operation are selected based upon several possible configurable formulas as defined by their types. The supported types are:
"tier"
(default): consolidate based on segment byte size and live document count as dictated by the customization attributes.
fields* array of objects
An array of attribute paths. You can use strings to index the fields with the default options, or objects to specify options for the fields (with the attribute path in the
name
property), or a mix of both.cache boolean
Enable this option to always cache the field normalization values in memory for this specific field. This can improve the performance of scoring and ranking queries. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Normalization values are computed for fields which are processed with Analyzers that have the
"norm"
feature enabled. These values are used to score fairer if the same tokens occur repeatedly, to emphasize these documents less.You can also enable this option to always cache auxiliary data used for querying fields that are indexed with Geo Analyzers in memory for this specific field. This can improve the performance of geo-spatial queries.
Default: the value defined by the top-level
cache
option.This property is available in the Enterprise Edition only.
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).features array of strings
A list of Analyzer features to use for this field. You can set this option to overwrite what features are enabled for the
analyzer
. Possible features:"frequency"
"norm"
"position"
"offset"
Default: the features as defined by the Analyzer itself, or inherited from the top-level
features
option if theanalyzer
option adjacent to this option is not set.includeAllFields boolean
This option only applies if you use the inverted index in a
search-alias
Views.If set to
true
, then all sub-attributes of this field are indexed, excluding any sub-attributes that are configured separately by other elements in thefields
array (and their sub-attributes). Theanalyzer
andfeatures
properties apply to the sub-attributes.If set to
false
, then sub-attributes are ignored.Default: the value defined by the top-level
includeAllFields
option.nested array of objects
Index the specified sub-objects that are stored in an array. Other than with the
fields
property, the values get indexed in a way that lets you query for co-occurring values. For example, you can search the sub-objects and all the conditions need to be met by a single sub-object instead of across all of them.This property is available in the Enterprise Edition only.
cache boolean
Enable this option to always cache the field normalization values in memory for this specific nested field. This can improve the performance of scoring and ranking queries. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Normalization values are computed for fields which are processed with Analyzers that have the
"norm"
feature enabled. These values are used to score fairer if the same tokens occur repeatedly, to emphasize these documents less.You can also enable this option to always cache auxiliary data used for querying fields that are indexed with Geo Analyzers in memory for this specific nested field. This can improve the performance of geo-spatial queries.
Default: the value defined by the top-level
cache
option.This property is available in the Enterprise Edition only.
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).features array of strings
A list of Analyzer features to use for this field. You can set this option to overwrite what features are enabled for the
analyzer
. Possible features:"frequency"
"norm"
"position"
"offset"
Default: the features as defined by the Analyzer itself, or inherited from the parent field’s or top-level
features
option if noanalyzer
option is set at a deeper level, closer to this option.searchField boolean
This option only applies if you use the inverted index in a
search-alias
Views.You can set the option to
true
to get the same behavior as witharangosearch
Views regarding the indexing of array values for this field. If enabled, both, array and primitive values (strings, numbers, etc.) are accepted. Every element of an array is indexed according to thetrackListPositions
option.If set to
false
, it depends on the attribute path. If it explicitly expands an array ([*]
), then the elements are indexed separately. Otherwise, the array is indexed as a whole, but onlygeopoint
andaql
Analyzers accept array inputs. You cannot use an array expansion ifsearchField
is enabled.Default: the value defined by the top-level
searchField
option.
searchField boolean
This option only applies if you use the inverted index in a
search-alias
Views.You can set the option to
true
to get the same behavior as witharangosearch
Views regarding the indexing of array values for this field. If enabled, both, array and primitive values (strings, numbers, etc.) are accepted. Every element of an array is indexed according to thetrackListPositions
option.If set to
false
, it depends on the attribute path. If it explicitly expands an array ([*]
), then the elements are indexed separately. Otherwise, the array is indexed as a whole, but onlygeopoint
andaql
Analyzers accept array inputs. You cannot use an array expansion ifsearchField
is enabled.Default: the value defined by the top-level
searchField
option.trackListPositions boolean
This option only applies if you use the inverted index in a
search-alias
Views, andsearchField
needs to betrue
.If set to
true
, then track the value position in arrays for array values. For example, when querying a document like{ attr: [ "valueX", "valueY", "valueZ" ] }
, you need to specify the array element, e.g.doc.attr[1] == "valueY"
.If set to
false
, all values in an array are treated as equal alternatives. You don’t specify an array element in queries, e.g.doc.attr == "valueY"
, and all elements are searched for a match.Default: the value defined by the top-level
trackListPositions
option.
includeAllFields boolean
This option only applies if you use the inverted index in a
search-alias
Views.If set to
true
, then all document attributes are indexed, excluding any sub-attributes that are configured in thefields
array (and their sub-attributes). Theanalyzer
andfeatures
properties apply to the sub-attributes.Default:
false
UsingincludeAllFields
for a lot of attributes in combination with complex Analyzers may significantly slow down the indexing process.optimizeTopK array of strings
This option only applies if you use the inverted index in a
search-alias
Views.An array of strings defining sort expressions that you want to optimize. This is also known as WAND optimization (introduced in v3.12.0).
If you query a View with the
SEARCH
operation in combination with aSORT
andLIMIT
operation, search results can be retrieved faster if theSORT
expression matches one of the optimized expressions.Only sorting by highest rank is supported, that is, sorting by the result of a scoring function in descending order (
DESC
). Use@doc
in the expression where you would normally pass the document variable emitted by theSEARCH
operation to the scoring function.You can define up to 64 expressions per View.
Example:
["BM25(@doc) DESC", "TFIDF(@doc, true) DESC"]
Default:
[]
This property is available in the Enterprise Edition only.
primaryKeyCache boolean
Enable this option to always cache the primary key column in memory. This can improve the performance of queries that return many documents. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Default:
false
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).primarySort object
You can define a primary sort order to enable an AQL optimization. If a query iterates over all documents of a collection, wants to sort them by attribute values, and the (left-most) fields to sort by, as well as their sorting direction, match with the
primarySort
definition, then theSORT
operation is optimized away.cache boolean
Enable this option to always cache the primary sort columns in memory. This can improve the performance of queries that utilize the primary sort order. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Default:
false
This property is available in the Enterprise Edition only.
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).
searchField boolean
This option only applies if you use the inverted index in a
search-alias
Views.You can set the option to
true
to get the same behavior as witharangosearch
Views regarding the indexing of array values as the default. If enabled, both, array and primitive values (strings, numbers, etc.) are accepted. Every element of an array is indexed according to thetrackListPositions
option.If set to
false
, it depends on the attribute path. If it explicitly expands an array ([*]
), then the elements are indexed separately. Otherwise, the array is indexed as a whole, but onlygeopoint
andaql
Analyzers accept array inputs. You cannot use an array expansion ifsearchField
is enabled.Default:
false
storedValues array of objects
The optional
storedValues
attribute can contain an array of objects with paths to additional attributes to store in the index. These additional attributes cannot be used for index lookups or for sorting, but they can be used for projections. This allows an index to fully cover more queries and avoid extra document lookups.You may use the following shorthand notations on index creation instead of an array of objects. The default compression and cache settings are used in this case:
An array of strings, like
["attr1", "attr2"]
, to place each attribute into a separate column of the index (introduced in v3.10.3).An array of arrays of strings, like
[["attr1", "attr2"]]
, to place the attributes into a single column of the index, or[["attr1"], ["attr2"]]
to place each attribute into a separate column. You can also mix it with the full form:[ ["attr1"], ["attr2", "attr3"], { "fields": ["attr4", "attr5"], "cache": true } ]
cache boolean
Enable this option to always cache stored values in memory. This can improve the query performance if stored values are involved. Otherwise, these values are memory-mapped and it is up to the operating system to load them from disk into memory and to evict them from memory.
Default:
false
This property is available in the Enterprise Edition only.
See the
--arangosearch.columns-cache-limit
startup option to control the memory consumption of this cache. You can reduce the memory usage of the column cache in cluster deployments by only using the cache for leader shards, see the--arangosearch.columns-cache-only-leader
startup option (introduced in v3.10.6).
trackListPositions boolean
This option only applies if you use the inverted index in a
search-alias
Views, andsearchField
needs to betrue
.If set to
true
, then track the value position in arrays for array values. For example, when querying a document like{ attr: [ "valueX", "valueY", "valueZ" ] }
, you need to specify the array element, e.g.doc.attr[1] == "valueY"
.If set to
false
, all values in an array are treated as equal alternatives. You don’t specify an array element in queries, e.g.doc.attr == "valueY"
, and all elements are searched for a match.writebufferSizeMax integer
Maximum memory byte size per writer (segment) before a writer (segment) flush is triggered.
0
value turns off this limit for any writer (buffer) and data is flushed periodically based on the value defined for the flush thread (ArangoDB server startup option).0
value should be used carefully due to high potential memory consumption (default: 33554432, use 0 to disable)
Examples
Creating an inverted index:
curl -X POST --header 'accept: application/json' --data-binary @- --dump - http://localhost:8529/_api/index?collection=products
{
"type": "inverted",
"fields": [
"a",
{
"name": "b",
"analyzer": "text_en"
}
]
}