ArangoDB v3.10 reached End of Life (EOL) and is no longer supported.
This documentation is outdated. Please see the most recent stable version.
ArangoSearch functions in AQL
ArangoSearch offers various AQL functions for search queries to control the search context, for filtering and scoring
You can form search expressions by composing ArangoSearch function calls, logical operators and comparison operators. This allows you to filter Views as well as to utilize inverted indexes to filter collections.
The AQL SEARCH
operation accepts search expressions,
such as PHRASE(doc.text, "foo bar", "text_en")
, for querying Views. You can
combine ArangoSearch filter and context functions as well as operators like
AND
and OR
to form complex search conditions. Similarly, the
FILTER
operation accepts such search expressions
when using inverted indexes.
Scoring functions allow you to rank matches and to sort results by relevance. They are limited to Views.
Search highlighting functions let you retrieve the string positions of matches. They are limited to Views.
You can use most functions also without an inverted index or a View and the
SEARCH
keyword, but then they are not accelerated by an index.
See Information Retrieval with ArangoSearch for an introduction.
Context Functions
ANALYZER()
ANALYZER(expr, analyzer) → retVal
Sets the Analyzer for the given search expression.
The ANALYZER()
function is only applicable for queries against arangosearch
Views.
In queries against search-alias
Views and inverted indexes, you don’t need to
specify Analyzers because every field can be indexed with a single Analyzer only
and they are inferred from the index definition.
The default Analyzer is identity
for any search expression that is used for
filtering arangosearch
Views. This utility function can be used
to wrap a complex expression to set a particular Analyzer. It also sets it for
all the nested functions which require such an argument to avoid repeating the
Analyzer parameter. If an Analyzer argument is passed to a nested function
regardless, then it takes precedence over the Analyzer set via ANALYZER()
.
The TOKENS()
function is an exception. It requires the Analyzer name to be
passed in in all cases even if wrapped in an ANALYZER()
call, because it is
not an ArangoSearch function but a regular string function which can be used
outside of SEARCH
operations.
- expr (expression): any valid search expression
- analyzer (string): name of an Analyzer.
- returns retVal (any): the expression result that it wraps
Example: Using a custom Analyzer
Assuming a View definition with an Analyzer whose name and type is delimiter
:
{
"links": {
"coll": {
"analyzers": [ "delimiter" ],
"includeAllFields": true,
}
},
...
}
… with the Analyzer properties { "delimiter": "|" }
and an example document
{ "text": "foo|bar|baz" }
in the collection coll
, the following query would
return the document:
FOR doc IN viewName
SEARCH ANALYZER(doc.text == "bar", "delimiter")
RETURN doc
The expression doc.text == "bar"
has to be wrapped by ANALYZER()
in order
to set the Analyzer to delimiter
. Otherwise the expression would be evaluated
with the default identity
Analyzer. "foo|bar|baz" == "bar"
would not match,
but the View does not even process the indexed fields with the identity
Analyzer. The following query would also return an empty result because of
the Analyzer mismatch:
FOR doc IN viewName
SEARCH doc.text == "foo|bar|baz"
//SEARCH ANALYZER(doc.text == "foo|bar|baz", "identity")
RETURN doc
Example: Setting the Analyzer context with and without ANALYZER()
In below query, the search expression is swapped by ANALYZER()
to set the
text_en
Analyzer for both PHRASE()
functions:
FOR doc IN viewName
SEARCH ANALYZER(PHRASE(doc.text, "foo") OR PHRASE(doc.text, "bar"), "text_en")
RETURN doc
Without the usage of ANALYZER()
:
FOR doc IN viewName
SEARCH PHRASE(doc.text, "foo", "text_en") OR PHRASE(doc.text, "bar", "text_en")
RETURN doc
Example: Analyzer precedence and specifics of the TOKENS()
function
In the following example ANALYZER()
is used to set the Analyzer text_en
,
but in the second call to PHRASE()
a different Analyzer is set (identity
)
which overrules ANALYZER()
. Therefore, the text_en
Analyzer is used to find
the phrase foo and the identity
Analyzer to find bar:
FOR doc IN viewName
SEARCH ANALYZER(PHRASE(doc.text, "foo") OR PHRASE(doc.text, "bar", "identity"), "text_en")
RETURN doc
Despite the wrapping ANALYZER()
function, the Analyzer name cannot be
omitted in calls to the TOKENS()
function. Both occurrences of text_en
are required, to set the Analyzer for the expression doc.text IN ...
and
for the TOKENS()
function itself. This is because the TOKENS()
function
is a regular string function that does not take the Analyzer context into
account:
FOR doc IN viewName
SEARCH ANALYZER(doc.text IN TOKENS("foo", "text_en"), "text_en")
RETURN doc
BOOST()
BOOST(expr, boost) → retVal
Override boost in the context of a search expression with a specified value,
making it available for scorer functions. By default, the context has a boost
value equal to 1.0
.
- expr (expression): any valid search expression
- boost (number): numeric boost value
- returns retVal (any): the expression result that it wraps
Example: Boosting a search sub-expression
FOR doc IN viewName
SEARCH ANALYZER(BOOST(doc.text == "foo", 2.5) OR doc.text == "bar", "text_en")
LET score = BM25(doc)
SORT score DESC
RETURN { text: doc.text, score }
Assuming a View with the following documents indexed and processed by the
text_en
Analyzer:
{ "text": "foo bar" }
{ "text": "foo" }
{ "text": "bar" }
{ "text": "foo baz" }
{ "text": "baz" }
… the result of above query would be:
[
{
"text": "foo bar",
"score": 2.787301540374756
},
{
"text": "foo baz",
"score": 1.6895781755447388
},
{
"text": "foo",
"score": 1.525835633277893
},
{
"text": "bar",
"score": 0.9913395643234253
}
]
Filter Functions
EXISTS()
arangosearch
Views, the EXISTS()
function only matches values if
you set the storeValues link property to "id"
in the View definition
(the default is "none"
).Testing for attribute presence
EXISTS(path)
Match documents where the attribute at path
is present.
- path (attribute path expression): the attribute to test in the document
- returns nothing: the function evaluates to a boolean, but this value cannot be
returned. The function can only be called in a search expression. It throws
an error if used outside of a
SEARCH
operation or aFILTER
operation that uses an inverted index.
FOR doc IN viewName
SEARCH EXISTS(doc.text)
RETURN doc
Testing for attribute type
EXISTS(path, type)
Match documents where the attribute at path
is present and is of the
specified data type.
- path (attribute path expression): the attribute to test in the document
- type (string): data type to test for, can be one of:
"null"
"bool"
/"boolean"
"numeric"
"type"
(matchesnull
,boolean
, andnumeric
values)"string"
"analyzer"
(see below)
- returns nothing: the function evaluates to a boolean, but this value cannot be
returned. The function can only be called in a search expression. It throws
an error if used outside of a
SEARCH
operation or aFILTER
operation that uses an inverted index.
FOR doc IN viewName
SEARCH EXISTS(doc.text, "string")
RETURN doc
Testing for Analyzer index status
EXISTS(path, "analyzer", analyzer)
Match documents where the attribute at path
is present and was indexed
by the specified analyzer
.
- path (attribute path expression): the attribute to test in the document
- type (string): string literal
"analyzer"
- analyzer (string, optional): name of an Analyzer.
Uses the Analyzer of a wrapping
ANALYZER()
call if not specified or defaults to"identity"
- returns nothing: the function evaluates to a boolean, but this value cannot be
returned. The function can only be called in a search expression. It throws
an error if used outside of a
SEARCH
operation or aFILTER
operation that uses an inverted index.
FOR doc IN viewName
SEARCH EXISTS(doc.text, "analyzer", "text_en")
RETURN doc
Testing for nested fields
EXISTS(path, "nested")
Match documents where the attribute at path
is present and is indexed
as a nested field for nested search with Views
or inverted indexes.
- path (attribute path expression): the attribute to test in the document
- type (string): string literal
"nested"
- returns nothing: the function evaluates to a boolean, but this value cannot be
returned. The function can only be called in a search expression. It throws
an error if used outside of a
SEARCH
operation or aFILTER
operation that uses an inverted index.
Examples
Only return documents from the View viewName
whose text
attribute is indexed
as a nested field:
FOR doc IN viewName
SEARCH EXISTS(doc.text, "nested")
RETURN doc
Only return documents whose attr
attribute and its nested text
attribute are
indexed as nested fields:
FOR doc IN viewName
SEARCH doc.attr[? FILTER EXISTS(CURRENT.text, "nested")]
RETURN doc
Only return documents from the collection coll
whose text
attribute is indexed
as a nested field by an inverted index:
FOR doc IN coll OPTIONS { indexHint: "inv-idx", forceIndexHint: true }
FILTER EXISTS(doc.text, "nested")
RETURN doc
Only return documents whose attr
attribute and its nested text
attribute are
indexed as nested fields:
FOR doc IN coll OPTIONS { indexHint: "inv-idx", forceIndexHint: true }
FILTER doc.attr[? FILTER EXISTS(CURRENT.text, "nested")]
RETURN doc
IN_RANGE()
IN_RANGE(path, low, high, includeLow, includeHigh) → included
Match documents where the attribute at path
is greater than (or equal to)
low
and less than (or equal to) high
.
You can use IN_RANGE()
for searching more efficiently compared to an equivalent
expression that combines two comparisons with a logical conjunction:
IN_RANGE(path, low, high, true, true)
instead oflow <= value AND value <= high
IN_RANGE(path, low, high, true, false)
instead oflow <= value AND value < high
IN_RANGE(path, low, high, false, true)
instead oflow < value AND value <= high
IN_RANGE(path, low, high, false, false)
instead oflow < value AND value < high
low
and high
can be numbers or strings (technically also null
, true
and false
), but the data type must be the same for both.
collation
Analyzer) nor the server language
(startup option --default-language
)!
Also see Known Issues.There is a corresponding IN_RANGE()
Miscellaneous Function
that is used outside of SEARCH
operations.
- path (attribute path expression): the path of the attribute to test in the document
- low (number|string): minimum value of the desired range
- high (number|string): maximum value of the desired range
- includeLow (bool): whether the minimum value shall be included in the range (left-closed interval) or not (left-open interval)
- includeHigh (bool): whether the maximum value shall be included in the range (right-closed interval) or not (right-open interval)
- returns included (bool): whether
value
is in the range
If low
and high
are the same, but includeLow
and/or includeHigh
is set
to false
, then nothing will match. If low
is greater than high
nothing will
match either.
Example: Using numeric ranges
To match documents with the attribute value >= 3
and value <= 5
using the
default "identity"
Analyzer you would write the following query:
FOR doc IN viewName
SEARCH IN_RANGE(doc.value, 3, 5, true, true)
RETURN doc.value
This will also match documents which have an array of numbers as value
attribute where at least one of the numbers is in the specified boundaries.
Example: Using string ranges
Using string boundaries and a text Analyzer allows to match documents which have at least one token within the specified character range:
FOR doc IN valView
SEARCH ANALYZER(IN_RANGE(doc.value, "a","f", true, false), "text_en")
RETURN doc
This will match { "value": "bar" }
and { "value": "foo bar" }
because the
b of bar is in the range ("a" <= "b" < "f"
), but not { "value": "foo" }
because the f of foo is excluded (high
is “f” but includeHigh
is false).
MIN_MATCH()
MIN_MATCH(expr1, ... exprN, minMatchCount) → fulfilled
Match documents where at least minMatchCount
of the specified
search expressions are satisfied.
There is a corresponding MIN_MATCH()
Miscellaneous function
that is used outside of SEARCH
operations.
- expr (expression, repeatable): any valid search expression
- minMatchCount (number): minimum number of search expressions that should be satisfied
- returns fulfilled (bool): whether at least
minMatchCount
of the specified expressions aretrue
Example: Matching a subset of search sub-expressions
Assuming a View with a text Analyzer, you may use it to match documents where the attribute contains at least two out of three tokens:
LET t = TOKENS("quick brown fox", "text_en")
FOR doc IN viewName
SEARCH ANALYZER(MIN_MATCH(doc.text == t[0], doc.text == t[1], doc.text == t[2], 2), "text_en")
RETURN doc.text
This will match { "text": "the quick brown fox" }
and { "text": "some brown fox" }
,
but not { "text": "snow fox" }
which only fulfills one of the conditions.
Note that you can also use the AT LEAST
array comparison operator
in the specific case of matching a subset of tokens against a single attribute:
FOR doc IN viewName
SEARCH ANALYZER(TOKENS("quick brown fox", "text_en") AT LEAST (2) == doc.text, "text_en")
RETURN doc.text
MINHASH_MATCH()
MINHASH_MATCH(path, target, threshold, analyzer) → fulfilled
Match documents with an approximate Jaccard similarity of at least the
threshold
, approximated with the specified minhash
Analyzer.
To only compute the MinHash signatures, see the
MINHASH()
Miscellaneous function.
- path (attribute path expression|string): the path of the attribute in a document or a string
- target (string): the string to hash with the specified Analyzer and to compare against the stored attribute
- threshold (number, optional): a value between
0.0
and1.0
. - analyzer (string): the name of a
minhash
Analyzer. - returns fulfilled (bool):
true
if the approximate Jaccard similarity is greater than or equal to the specified threshold,false
otherwise
Example: Find documents with a text similar to a target text
Assuming a View with a minhash
Analyzer, you can use the stored
MinHash signature to find candidates for the more expensive Jaccard similarity
calculation:
LET target = "the quick brown fox jumps over the lazy dog"
LET targetSignature = TOKENS(target, "myMinHash")
FOR doc IN viewName
SEARCH MINHASH_MATCH(doc.text, target, 0.5, "myMinHash") // approximation
LET jaccard = JACCARD(targetSignature, TOKENS(doc.text, "myMinHash"))
FILTER jaccard > 0.75
SORT jaccard DESC
RETURN doc.text
NGRAM_MATCH()
Introduced in: v3.7.0
NGRAM_MATCH(path, target, threshold, analyzer) → fulfilled
Match documents whose attribute value has an n-gram similarity higher than the specified threshold compared to the target value.
The similarity is calculated by counting how long the longest sequence of matching n-grams is, divided by the target’s total n-gram count. Only fully matching n-grams are counted.
The n-grams for both attribute and target are produced by the specified Analyzer. Increasing the n-gram length will increase accuracy, but reduce error tolerance. In most cases a size of 2 or 3 will be a good choice.
Also see the String Functions
NGRAM_POSITIONAL_SIMILARITY()
and NGRAM_SIMILARITY()
for calculating n-gram similarity that cannot be accelerated by a View index.
- path (attribute path expression|string): the path of the attribute in a document or a string
- target (string): the string to compare against the stored attribute
- threshold (number, optional): a value between
0.0
and1.0
. Defaults to0.7
if none is specified. - analyzer (string): the name of an Analyzer.
- returns fulfilled (bool):
true
if the evaluated n-gram similarity value is greater than or equal to the specified threshold,false
otherwise
Use an Analyzer of type ngram
with preserveOriginal: false
and min
equal
to max
. Otherwise, the similarity score calculated internally will be lower
than expected.
The Analyzer must have the "position"
and "frequency"
features enabled or
the NGRAM_MATCH()
function will not find anything.
Example: Using a custom bigram Analyzer
Given a View indexing an attribute text
, a custom n-gram Analyzer "bigram"
(min: 2, max: 2, preserveOriginal: false, streamType: "utf8"
) and a document
{ "text": "quick red fox" }
, the following query would match it (with a
threshold of 1.0
):
FOR doc IN viewName
SEARCH NGRAM_MATCH(doc.text, "quick fox", "bigram")
RETURN doc.text
The following will also match (note the low threshold value):
FOR doc IN viewName
SEARCH NGRAM_MATCH(doc.text, "quick blue fox", 0.4, "bigram")
RETURN doc.text
The following will not match (note the high threshold value):
FOR doc IN viewName
SEARCH NGRAM_MATCH(doc.text, "quick blue fox", 0.9, "bigram")
RETURN doc.text
Example: Using constant values
NGRAM_MATCH()
can be called with constant arguments, but for such calls the
analyzer
argument is mandatory (even for calls inside of a SEARCH
clause):
FOR doc IN viewName
SEARCH NGRAM_MATCH("quick fox", "quick blue fox", 0.9, "bigram")
RETURN doc.text
RETURN NGRAM_MATCH("quick fox", "quick blue fox", "bigram")
PHRASE()
PHRASE(path, phrasePart, analyzer)
PHRASE(path, phrasePart1, skipTokens1, ... phrasePartN, skipTokensN, analyzer)
PHRASE(path, [ phrasePart1, skipTokens1, ... phrasePartN, skipTokensN ], analyzer)
Search for a phrase in the referenced attribute. It only matches documents in
which the tokens appear in the specified order. To search for tokens in any
order use TOKENS()
instead.
The phrase can be expressed as an arbitrary number of phraseParts
separated by
skipTokens number of tokens (wildcards), either as separate arguments or as
array as second argument.
- path (attribute path expression): the attribute to test in the document
- phrasePart (string|array|object): text to search for in the tokens.
Can also be an array
comprised of string, array and object tokens, or tokens
interleaved with numbers of
skipTokens
. The specifiedanalyzer
is applied to string and array tokens, but not for object tokens. - skipTokens (number, optional): amount of tokens to treat as wildcards
- analyzer (string, optional): name of an Analyzer.
Uses the Analyzer of a wrapping
ANALYZER()
call if not specified or defaults to"identity"
- returns nothing: the function evaluates to a boolean, but this value cannot be
returned. The function can only be called in a search expression. It throws
an error if used outside of a
SEARCH
operation or aFILTER
operation that uses an inverted index.
"position"
and "frequency"
features
enabled. The PHRASE()
function will otherwise not find anything.Object tokens
Introduced in v3.7.0
{IN_RANGE: [low, high, includeLow, includeHigh]}
: see IN_RANGE(). low and high can only be strings.{LEVENSHTEIN_MATCH: [token, maxDistance, transpositions, maxTerms, prefix]}
:token
(string): a string to searchmaxDistance
(number): maximum Levenshtein / Damerau-Levenshtein distancetranspositions
(bool, optional): if set tofalse
, a Levenshtein distance is computed, otherwise a Damerau-Levenshtein distance (default)maxTerms
(number, optional): consider only a specified number of the most relevant terms. One can pass0
to consider all matched terms, but it may impact performance negatively. The default value is64
.prefix
(string, optional): if defined, then a search for the exact prefix is carried out, using the matches as candidates. The Levenshtein / Damerau-Levenshtein distance is then computed for each candidate using the remainders of the strings. This option can improve performance in cases where there is a known common prefix. The default value is an empty string (introduced in v3.7.13, v3.8.1).
{STARTS_WITH: [prefix]}
: see STARTS_WITH(). Array brackets are optional{TERM: [token]}
: equal totoken
but without Analyzer tokenization. Array brackets are optional{TERMS: [token1, ..., tokenN]}
: one oftoken1, ..., tokenN
can be found in specified position. Inside an array the object syntax can be replaced with the object field value, e.g.,[..., [token1, ..., tokenN], ...]
.{WILDCARD: [token]}
: see LIKE(). Array brackets are optional
An array token inside an array can be used in the TERMS
case only.
Also see Example: Using object tokens.
Example: Using a text Analyzer for a phrase search
Given a View indexing an attribute text
with the "text_en"
Analyzer and a
document { "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit" }
,
the following query would match it:
FOR doc IN viewName
SEARCH PHRASE(doc.text, "lorem ipsum", "text_en")
RETURN doc.text
However, this search expression does not because the tokens "ipsum"
and
"lorem"
do not appear in this order:
PHRASE(doc.text, "ipsum lorem", "text_en")
Example: Skip tokens for a proximity search
To match "ipsum"
and "amet"
with any two tokens in between, you can use the
following search expression:
PHRASE(doc.text, "ipsum", 2, "amet", "text_en")
The skipTokens
value of 2
defines how many wildcard tokens have to appear
between ipsum and amet. A skipTokens
value of 0
means that the tokens
must be adjacent. Negative values are allowed, but not very useful. These three
search expressions are equivalent:
PHRASE(doc.text, "lorem ipsum", "text_en")
PHRASE(doc.text, "lorem", 0, "ipsum", "text_en")
PHRASE(doc.text, "ipsum", -1, "lorem", "text_en")
Example: Using PHRASE()
with an array of tokens
The PHRASE()
function also accepts an array as second argument with
phrasePart
and skipTokens
parameters as elements.
FOR doc IN myView SEARCH PHRASE(doc.title, ["quick brown fox"], "text_en") RETURN doc
FOR doc IN myView SEARCH PHRASE(doc.title, ["quick", "brown", "fox"], "text_en") RETURN doc
This syntax variation enables the usage of computed expressions:
LET proximityCondition = [ "foo", ROUND(RAND()*10), "bar" ]
FOR doc IN viewName
SEARCH PHRASE(doc.text, proximityCondition, "text_en")
RETURN doc
LET tokens = TOKENS("quick brown fox", "text_en") // ["quick", "brown", "fox"]
FOR doc IN myView SEARCH PHRASE(doc.title, tokens, "text_en") RETURN doc
Above example is equivalent to the more cumbersome and static form:
FOR doc IN myView SEARCH PHRASE(doc.title, "quick", 0, "brown", 0, "fox", "text_en") RETURN doc
You can optionally specify the number of skipTokens in the array form before every string element:
FOR doc IN myView SEARCH PHRASE(doc.title, ["quick", 1, "fox", "jumps"], "text_en") RETURN doc
It is the same as the following:
FOR doc IN myView SEARCH PHRASE(doc.title, "quick", 1, "fox", 0, "jumps", "text_en") RETURN doc
Example: Handling of arrays with no members
Empty arrays are skipped:
FOR doc IN myView SEARCH PHRASE(doc.title, "quick", 1, [], 1, "jumps", "text_en") RETURN doc
The query is equivalent to:
FOR doc IN myView SEARCH PHRASE(doc.title, "quick", 2 "jumps", "text_en") RETURN doc
Providing only empty arrays is valid, but will yield no results.
Example: Using object tokens
Using object tokens STARTS_WITH
, WILDCARD
, LEVENSHTEIN_MATCH
, TERMS
and
IN_RANGE
:
FOR doc IN myView SEARCH PHRASE(doc.title,
{STARTS_WITH: ["qui"]}, 0,
{WILDCARD: ["b%o_n"]}, 0,
{LEVENSHTEIN_MATCH: ["foks", 2]}, 0,
{TERMS: ["jump", "run"]}, 0, // Analyzer not applied!
{IN_RANGE: ["over", "through", true, false]},
"text_en") RETURN doc
Note that the text_en
Analyzer has stemming enabled, but for object tokens
the Analyzer isn’t applied. {TERMS: ["jumps", "runs"]}
would not match the
indexed (and stemmed!) attribute value. Therefore, the trailing s
which would
be stemmed away is removed from both words manually in the example.
Above example is equivalent to:
FOR doc IN myView SEARCH PHRASE(doc.title,
[
{STARTS_WITH: "qui"}, 0,
{WILDCARD: "b%o_n"}, 0,
{LEVENSHTEIN_MATCH: ["foks", 2]}, 0,
["jumps", "runs"], 0, // Analyzer is applied using this syntax
{IN_RANGE: ["over", "through", true, false]}
], "text_en") RETURN doc
STARTS_WITH()
STARTS_WITH(path, prefix) → startsWith
Match the value of the attribute that starts with prefix
. If the attribute
is processed by a tokenizing Analyzer (type "text"
or "delimiter"
) or if it
is an array, then a single token/element starting with the prefix is sufficient
to match the document.
collation
Analyzer) nor the server language
(startup option --default-language
)!
Also see Known Issues.There is a corresponding STARTS_WITH()
String function
that is used outside of SEARCH
operations.
- path (attribute path expression): the path of the attribute to compare against in the document
- prefix (string): a string to search at the start of the text
- returns startsWith (bool): whether the specified attribute starts with the given prefix
STARTS_WITH(path, prefixes, minMatchCount) → startsWith
Introduced in: v3.7.1
Match the value of the attribute that starts with one of the prefixes
, or
optionally with at least minMatchCount
of the prefixes.
- path (attribute path expression): the path of the attribute to compare against in the document
- prefixes (array): an array of strings to search at the start of the text
- minMatchCount (number, optional): minimum number of search prefixes
that should be satisfied (see
example). The default is
1
- returns startsWith (bool): whether the specified attribute starts with at
least
minMatchCount
of the given prefixes
Example: Searching for an exact value prefix
To match a document { "text": "lorem ipsum..." }
using a prefix and the
"identity"
Analyzer you can use it like this:
FOR doc IN viewName
SEARCH STARTS_WITH(doc.text, "lorem ip")
RETURN doc
Example: Searching for a prefix in text
This query will match { "text": "lorem ipsum" }
as well as
{ "text": [ "lorem", "ipsum" ] }
given a View which indexes the text
attribute and processes it with the "text_en"
Analyzer:
FOR doc IN viewName
SEARCH ANALYZER(STARTS_WITH(doc.text, "ips"), "text_en")
RETURN doc.text
Note that it will not match { "text": "IPS (in-plane switching)" }
without
modification to the query. The prefixes were passed to STARTS_WITH()
as-is,
but the built-in text_en
Analyzer used for indexing has stemming enabled.
So the indexed values are the following:
RETURN TOKENS("IPS (in-plane switching)", "text_en")
[
[
"ip",
"in",
"plane",
"switch"
]
]
The s is removed from ips, which leads to the prefix ips not matching the indexed token ip. You may either create a custom text Analyzer with stemming disabled to avoid this issue, or apply stemming to the prefixes:
FOR doc IN viewName
SEARCH ANALYZER(STARTS_WITH(doc.text, TOKENS("ips", "text_en")), "text_en")
RETURN doc.text
Example: Searching for one or multiple prefixes
The STARTS_WITH()
function accepts an array of prefix alternatives of which
only one has to match:
FOR doc IN viewName
SEARCH ANALYZER(STARTS_WITH(doc.text, ["something", "ips"]), "text_en")
RETURN doc.text
It will match a document { "text": "lorem ipsum" }
but also
{ "text": "that is something" }
, as at least one of the words start with a
given prefix.
The same query again, but with an explicit minMatchCount
:
FOR doc IN viewName
SEARCH ANALYZER(STARTS_WITH(doc.text, ["wrong", "ips"], 1), "text_en")
RETURN doc.text
The number can be increased to require that at least this many prefixes must be present:
FOR doc IN viewName
SEARCH ANALYZER(STARTS_WITH(doc.text, ["lo", "ips", "something"], 2), "text_en")
RETURN doc.text
This will still match { "text": "lorem ipsum" }
because at least two prefixes
(lo
and ips
) are found, but not { "text": "that is something" }
which only
contains one of the prefixes (something
).
LEVENSHTEIN_MATCH()
Introduced in: v3.7.0
LEVENSHTEIN_MATCH(path, target, distance, transpositions, maxTerms, prefix) → fulfilled
Match documents with a Damerau-Levenshtein distance
lower than or equal to distance
between the stored attribute value and
target
. It can optionally match documents using a pure Levenshtein distance.
See LEVENSHTEIN_DISTANCE() if you want to calculate the edit distance of two strings.
- path (attribute path expression|string): the path of the attribute to compare against in the document or a string
- target (string): the string to compare against the stored attribute
- distance (number): the maximum edit distance, which can be between
0
and4
iftranspositions
isfalse
, and between0
and3
if it istrue
- transpositions (bool, optional): if set to
false
, a Levenshtein distance is computed, otherwise a Damerau-Levenshtein distance (default) - maxTerms (number, optional): consider only a specified number of the
most relevant terms. One can pass
0
to consider all matched terms, but it may impact performance negatively. The default value is64
. - returns fulfilled (bool):
true
if the calculated distance is less than or equal to distance,false
otherwise - prefix (string, optional): if defined, then a search for the exact
prefix is carried out, using the matches as candidates. The Levenshtein /
Damerau-Levenshtein distance is then computed for each candidate using
the
target
value and the remainders of the strings, which means that the prefix needs to be removed fromtarget
(see example). This option can improve performance in cases where there is a known common prefix. The default value is an empty string (introduced in v3.7.13, v3.8.1).
Example: Matching with and without transpositions
The Levenshtein distance between quick and quikc is 2
because it requires
two operations to go from one to the other (remove k, insert k at a
different position).
FOR doc IN viewName
SEARCH LEVENSHTEIN_MATCH(doc.text, "quikc", 2, false) // matches "quick"
RETURN doc.text
The Damerau-Levenshtein distance is 1
(move k to the end).
FOR doc IN viewName
SEARCH LEVENSHTEIN_MATCH(doc.text, "quikc", 1) // matches "quick"
RETURN doc.text
Example: Matching with prefix search
Match documents with a Levenshtein distance of 1 with the prefix qui
. The edit
distance is calculated using the search term kc
(quikc
with the prefix qui
removed) and the stored value without the prefix (e.g. ck
). The prefix qui
is constant.
FOR doc IN viewName
SEARCH LEVENSHTEIN_MATCH(doc.text, "kc", 1, false, 64, "qui") // matches "quick"
RETURN doc.text
You may compute the prefix and suffix from the input string as follows:
LET input = "quikc"
LET prefixSize = 3
LET prefix = LEFT(input, prefixSize)
LET suffix = SUBSTRING(input, prefixSize)
FOR doc IN viewName
SEARCH LEVENSHTEIN_MATCH(doc.text, suffix, 1, false, 64, prefix) // matches "quick"
RETURN doc.text
Example: Basing the edit distance on string length
You may want to pick the maximum edit distance based on string length. If the stored attribute is the string quick and the target string is quicksands, then the Levenshtein distance is 5, with 50% of the characters mismatching. If the inputs are q and qu, then the distance is only 1, although it is also a 50% mismatch.
LET target = "input"
LET targetLength = LENGTH(target)
LET maxDistance = (targetLength > 5 ? 2 : (targetLength >= 3 ? 1 : 0))
FOR doc IN viewName
SEARCH LEVENSHTEIN_MATCH(doc.text, target, maxDistance, true)
RETURN doc.text
LIKE()
Introduced in: v3.7.2
LIKE(path, search) → bool
Check whether the pattern search
is contained in the attribute denoted by path
,
using wildcard matching.
_
: A single arbitrary character%
: Zero, one or many arbitrary characters\\_
: A literal underscore\\%
: A literal percent sign
Literal backlashes require different amounts of escaping depending on the context:
\
in bind variables (Table view mode) in the web interface (automatically escaped to\\
unless the value is wrapped in double quotes and already escaped properly)\\
in bind variables (JSON view mode) and queries in the web interface\\
in bind variables in arangosh\\\\
in queries in arangosh- Double the amount compared to arangosh in shells that use backslashes for
escaping (
\\\\
in bind variables and\\\\\\\\
in queries)
Searching with the LIKE()
function in the context of a SEARCH
operation
is backed by View indexes. The String LIKE()
function
is used in other contexts such as in FILTER
operations and cannot be
accelerated by any sort of index on the other hand. Another difference is that
the ArangoSearch variant does not accept a third argument to enable
case-insensitive matching. This can be controlled with Analyzers instead.
- path (attribute path expression): the path of the attribute to compare against in the document
- search (string): a search pattern that can contain the wildcard characters
%
(meaning any sequence of characters, including none) and_
(any single character). Literal%
and_
must be escaped with backslashes. - returns bool (bool):
true
if the pattern is contained intext
, andfalse
otherwise
Example: Searching with wildcards
FOR doc IN viewName
SEARCH ANALYZER(LIKE(doc.text, "foo%b_r"), "text_en")
RETURN doc.text
LIKE
can also be used in operator form:
FOR doc IN viewName
SEARCH ANALYZER(doc.text LIKE "foo%b_r", "text_en")
RETURN doc.text
Geo functions
The following functions can be accelerated by View indexes. There are corresponding Geo Functions for the regular geo index type, but also general purpose functions such as GeoJSON constructors that can be used in conjunction with ArangoSearch.
GEO_CONTAINS()
Introduced in: v3.8.0
GEO_CONTAINS(geoJsonA, geoJsonB) → bool
Checks whether the GeoJSON object geoJsonA
fully contains geoJsonB
(every point in B is also in A).
- geoJsonA (object|array): first GeoJSON object or coordinate array (in longitude, latitude order)
- geoJsonB (object|array): second GeoJSON object or coordinate array (in longitude, latitude order)
- returns bool (bool):
true
when every point in B is also contained in A,false
otherwise
GEO_DISTANCE()
Introduced in: v3.8.0
GEO_DISTANCE(geoJsonA, geoJsonB) → distance
Return the distance between two GeoJSON objects,
measured from the centroid
of each shape.
- geoJsonA (object|array): first GeoJSON object or coordinate array (in longitude, latitude order)
- geoJsonB (object|array): second GeoJSON object or coordinate array (in longitude, latitude order)
- returns distance (number): the distance between the centroid points of the two objects on the reference ellipsoid
GEO_IN_RANGE()
Introduced in: v3.8.0
GEO_IN_RANGE(geoJsonA, geoJsonB, low, high, includeLow, includeHigh) → bool
Checks whether the distance between two GeoJSON objects
lies within a given interval. The distance is measured from the centroid
of
each shape.
- geoJsonA (object|array): first GeoJSON object or coordinate array (in longitude, latitude order)
- geoJsonB (object|array): second GeoJSON object or coordinate array (in longitude, latitude order)
- low (number): minimum value of the desired range
- high (number): maximum value of the desired range
- includeLow (bool, optional): whether the minimum value shall be included
in the range (left-closed interval) or not (left-open interval). The default
value is
true
- includeHigh (bool): whether the maximum value shall be included in the
range (right-closed interval) or not (right-open interval). The default value
is
true
- returns bool (bool): whether the evaluated distance lies within the range
GEO_INTERSECTS()
Introduced in: v3.8.0
GEO_INTERSECTS(geoJsonA, geoJsonB) → bool
Checks whether the GeoJSON object geoJsonA
intersects with geoJsonB
(i.e. at least one point of B is in A or vice versa).
- geoJsonA (object|array): first GeoJSON object or coordinate array (in longitude, latitude order)
- geoJsonB (object|array): second GeoJSON object or coordinate array (in longitude, latitude order)
- returns bool (bool):
true
if A and B intersect,false
otherwise
Scoring Functions
Scoring functions return a ranking value for the documents found by a SEARCH operation. The better the documents match the search expression the higher the returned number.
The first argument to any scoring function is always the document emitted by
a FOR
operation over an arangosearch
View.
To sort the result set by relevance, with the more relevant documents coming
first, sort in descending order by the score (e.g. SORT BM25(...) DESC
).
You may calculate custom scores based on a scoring function using document
attributes and numeric functions (e.g. TFIDF(doc) * LOG(doc.value)
):
FOR movie IN imdbView
SEARCH PHRASE(movie.title, "Star Wars", "text_en")
SORT BM25(movie) * LOG(movie.runtime + 1) DESC
RETURN movie
Sorting by more than one score is allowed. You may also sort by a mix of scores and attributes from multiple Views as well as collections:
FOR a IN viewA
FOR c IN coll
FOR b IN viewB
SORT TFIDF(b), c.name, BM25(a)
...
BM25()
BM25(doc, k, b) → score
Sorts documents using the Best Matching 25 algorithm (Okapi BM25).
- doc (document): must be emitted by
FOR ... IN viewName
- k (number, optional): calibrates the text term frequency scaling.
The value needs to be non-negative (
0.0
or higher), or the returned score is an undefined value that may cause unpredictable results. The default is1.2
. Ak
value of0
corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency - b (number, optional): determines the scaling by the total text length.
The value needs to be between
0.0
and1.0
(inclusive), or the returned score is an undefined value that may cause unpredictable results. The default is0.75
. At the extreme values of the coefficientb
, BM25 turns into the ranking functions known as:- BM11 for
b
=1
(corresponds to fully scaling the term weight by the total text length) - BM15 for
b
=0
(corresponds to no length normalization)
- BM11 for
- returns score (number): computed ranking value
"frequency"
feature enabled. The BM25()
function will otherwise return a score of 0.
The Analyzers should have the "norm"
feature enabled, too, or normalization
will be disabled, which is not meaningful for BM25 and BM11. BM15 does not need
the "norm"
feature as it has no length normalization.Example: Sorting by default BM25()
score
Sorting by relevance with BM25 at default settings:
FOR doc IN viewName
SEARCH ...
SORT BM25(doc) DESC
RETURN doc
Example: Sorting with tuned BM25()
ranking
Sorting by relevance, with double-weighted term frequency and with full text length normalization:
FOR doc IN viewName
SEARCH ...
SORT BM25(doc, 2.4, 1) DESC
RETURN doc
TFIDF()
TFIDF(doc, normalize) → score
Sorts documents using the term frequency–inverse document frequency algorithm (TF-IDF).
- doc (document): must be emitted by
FOR ... IN viewName
- normalize (bool, optional): specifies whether scores should be
normalized. The default is
false
. - returns score (number): computed ranking value
"frequency"
feature enabled. The TFIDF()
function will otherwise return a score of 0.
The Analyzers need to have the "norm"
feature enabled, too, if you want to use
TFIDF()
with the normalize
parameter set to true
.Example: Sorting by default TFIDF()
score
Sort by relevance using the TF-IDF score:
FOR doc IN viewName
SEARCH ...
SORT TFIDF(doc) DESC
RETURN doc
Example: Sorting by TFIDF()
score with normalization
Sort by relevance using a normalized TF-IDF score:
FOR doc IN viewName
SEARCH ...
SORT TFIDF(doc, true) DESC
RETURN doc
Example: Sort by value and TFIDF()
Sort by the value of the text
attribute in ascending order, then by the TFIDF
score in descending order where the attribute values are equivalent:
FOR doc IN viewName
SEARCH ...
SORT doc.text, TFIDF(doc) DESC
RETURN doc
Search Highlighting Functions
ArangoDB Enterprise Edition ArangoGraph
OFFSET_INFO()
OFFSET_INFO(doc, paths) → offsetInfo
Returns the attribute paths and substring offsets of matched terms, phrases, or n-grams for search highlighting purposes.
- doc (document): must be emitted by
FOR ... IN viewName
- paths (string|array): a string or an array of strings, each describing an
attribute and array element path you want to get the offsets for. Use
.
to access nested objects, and[n]
withn
being an array index to specify array elements. The attributes need to be indexed by Analyzers with theoffset
feature enabled. - returns offsetInfo (array): an array of objects, limited to a default of
10 offsets per path. Each object has the following attributes:
name (array): the attribute and array element path as an array of strings and numbers. You can pass this name to the
VALUE()
function to dynamically look up the value.offsets (array): an array of arrays with the matched positions. Each inner array has two elements with the start offset and the length of a match.
The offsets describe the positions in bytes, not characters. You may need to account for characters encoded using multiple bytes.
OFFSET_INFO(doc, rules) → offsetInfo
- doc (document): must be emitted by
FOR ... IN viewName
- rules (array): an array of objects with the following attributes:
- name (string): an attribute and array element path
you want to get the offsets for. Use
.
to access nested objects, and[n]
withn
being an array index to specify array elements. The attributes need to be indexed by Analyzers with theoffset
feature enabled. - options (object): an object with the following attributes:
- maxOffsets (number, optional): the total number of offsets to
collect per path. Default:
10
. - limits (object, optional): an object with the following attributes:
- term (number, optional): the total number of term offsets to collect per path. Default: 232.
- phrase (number, optional): the total number of phrase offsets to collect per path. Default: 232.
- ngram (number, optional): the total number of n-gram offsets to collect per path. Default: 232.
- maxOffsets (number, optional): the total number of offsets to
collect per path. Default:
- name (string): an attribute and array element path
you want to get the offsets for. Use
- returns offsetInfo (array): an array of objects, each with the following
attributes:
name (array): the attribute and array element path as an array of strings and numbers. You can pass this name to the VALUE() to dynamically look up the value.
offsets (array): an array of arrays with the matched positions, capped to the specified limits. Each inner array has two elements with the start offset and the length of a match.
The start offsets and lengths describe the positions in bytes, not characters. You may need to account for characters encoded using multiple bytes.
Examples
Search a View and get the offset information for the matches:
db._query(`
FOR doc IN food_view
SEARCH ANALYZER(TOKENS("avocado tomato", "text_en_offset") ANY == doc.description.en, "text_en_offset")
RETURN OFFSET_INFO(doc, ["description.en"])`);
For full examples, see Search Highlighting.