Wildcard Search with ArangoSearch
Search for strings with placeholders that stand for one or many arbitrary characters
You can use the LIKE()
function and LIKE
operator for this search technique
to find strings that start with, contain, or end with a certain substring. You
can also search for complex patterns with multiple placeholders. Place the
special characters _
and %
as wildcards for any single or zero-or-more
characters in the search string to match multiple partial strings.
prefix%
%infix%
%suffix
%complex%pat_ern
Wildcard searching can be an alternative to tokenizing text into words and then searching for words in a particular order (Phrase and Proximity Search). It is especially useful if you want to search for substrings that include characters that are considered word boundaries like punctuation and whitespace and would normally get removed when tokenizing text.
Index acceleration
The ArangoSearch LIKE()
function
is backed by View indexes. In contrast, the
string LIKE()
function cannot utilize any
sort of index. This applies to the LIKE
operator,
too, which you can use instead of the function.
Another difference is that the ArangoSearch variant of the LIKE()
function does
not accept a third argument to make matching case-insensitive. You can control
this via Analyzers instead, also see
Case-insensitive Search with ArangoSearch.
Which of the two equally named functions is used is determined by the context.
The ArangoSearch variant is used in SEARCH
operations. It is also used when
you have an inverted index and use the LIKE()
function with two arguments or
the LIKE
operator in FILTER
operations. The string variant is used everywhere
else, including using the LIKE()
function with three arguments in FILTER
operations together with an inverted index.
You can further accelerate wildcard queries by creating a specialized wildcard
Analyzer and using it in a View or inverted index on the desired document
attributes. The wildcard
Analyzer creates n-grams that can internally be
used to quickly find candidates, followed by post-filtering to remove
false-positive matches. The n-gram size should be configured depending on the
data and expected wildcard search strings. The substrings between wildcards of a
search string that are at least as long as the n-gram size are used to find
candidates. For example, the longest substring of %up%if%ref%
is ref
.
With an n-gram size of 3
, the ref
substring can be looked up quickly to
find candidates, achieving a faster search as with a differently indexed attribute,
e.g. using the identity
Analyzer. With an n-gram size of 2
, it is also
fast to find candidates, as up
, if
, re
, and ef
can be used to look up
potential matches. However, with an n-gram size of 4
or greater, ArangoSearch
has to treat all entries as candidates for this search string because the
substrings between wildcards are shorter than the n-gram size. It needs to
post-filter more, which is slower as with a differently indexed attribute.
Therefore, the n-gram size should generally be set to a small number so that
virtually all wildcard searches can get a speedup from an attribute being indexed
with the wildcard
Analyzer. On the other hand, it should not be set
unnecessarily small, as the lookup performance degrades with many duplicate
n-grams in the index, which is more likely with shorter n-grams.
Wildcard Syntax
_
: A single arbitrary character%
: Zero, one or many arbitrary characters\\_
: A literal underscore\\%
: A literal percent sign
Literal backlashes require different amounts of escaping depending on the context:
\
in bind variables (Table view mode) in the web interface (automatically escaped to\\
unless the value is wrapped in double quotes and already escaped properly)\\
in bind variables (JSON view mode) and queries in the web interface\\
in bind variables in arangosh\\\\
in queries in arangosh- Double the amount compared to arangosh in shells that use backslashes for
escaping (
\\\\
in bind variables and\\\\\\\\
in queries)
Wildcard Search Examples
Dataset
View definition
db.imdb_vertices.ensureIndex({ name: "inv-exact", type: "inverted", fields: [ "title" ] });
db._createView("imdb", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-exact" } ] });
{
"links": {
"imdb_vertices": {
"fields": {
"title": {
"analyzers": [
"identity"
]
}
}
}
}
}
AQL queries
Match all titles that starts with The Matr
using LIKE()
,
where _
stands for a single wildcard character and %
for an arbitrary amount:
FOR doc IN imdb
SEARCH LIKE(doc.title, "The Matr%")
RETURN doc.title
Result |
---|
The Matrix Revisited |
The Matrix |
The Matrix Reloaded |
The Matrix Revolutions |
The Matrix Trilogy |
You can achieve the same with the STARTS_WITH()
function:
FOR doc IN imdb
SEARCH STARTS_WITH(doc.title, "The Matr")
RETURN doc.title
Match all titles that contain Mat
using LIKE()
:
FOR doc IN imdb
SEARCH LIKE(doc.title, "%Mat%")
RETURN doc.title
Result |
---|
The Matrix Revisited |
Gray Matters |
Show: A Night In The Life of Matchbox Twenty |
The Mating Habits of the Earthbound Human |
Dark Matter |
Dave Matthews & Tim Reynolds: Live at Radio City |
Once Upon A Mattress |
Tarzan and His Mate |
Donald in Mathmagic Land |
Das Geheimnis der Materie |
… |
Match all titles that end with rix
using LIKE()
:
FOR doc IN imdb
SEARCH LIKE(doc.title, "%rix")
RETURN doc.title
Result |
---|
Ben 10: Secret of the Omnitrix |
Pinchcliffe Grand Prix |
Hendrix |
The Matrix |
The Animatrix |
Les Douze travaux d’Astérix |
Vercingétorix |
Match all titles that have an H
as first letter, followed by two arbitrary
characters, followed by ry
and any amount of characters after that. It will
match titles starting with Harry
and Henry
:
FOR doc IN imdb
SEARCH LIKE(doc.title, "H__ry%")
RETURN doc.title
Result |
---|
Henry & June |
Henry Rollins: Live in the Conversation Pit |
Henry Rollins: Uncut from NYC |
Harry Potter and the Sorcerer’s Stone |
Harry Potter and the Chamber Of Secrets |
… |
Use a bind parameter as input, but escape the characters with special meaning and perform a contains-style search by prepending and appending a percent sign:
FOR doc IN imdb
SEARCH LIKE(doc.title, CONCAT("%", SUBSTITUTE(@term, ["_", "%"], ["\\_", "\\%"]), "%"))
RETURN doc.title
Bind parameters:
{ "term": "y_" }
The query constructs the wildcard string %y\\_%
and will match Cry_Wolf
.
Wildcard Analyzer Examples
You can index attributes with a wildcard
Analyzer to improve the wildcard
search performance.
Dataset
Custom Analyzer
Create a wildcard
Analyzer with a small n-gram size, for example, in arangosh.
You can let the Analyzer pre-process the inputs, like with a norm
Analyzer to
enable case-insensitive and accent-insensitive wildcard search. Create a matching
standalone Analyzer so that you can pre-process wildcard search strings in the
same fashion:
var analyzers = require("@arangodb/analyzers");
var wildcardNormAnalyzer = analyzers.save("wildcard_norm", "wildcard", { ngramSize: 3, analyzer: { type: "norm", properties: { locale: "en", accent: false, case: "lower" } } }, ["frequency", "norm"]);
var matchingNormAnalyzer = analyzers.save("only_norm", "norm", { locale: "en", accent: false, case: "lower" });
The frequency
and norm
Analyzer features
are set for improved performance of simple wildcard searches at the expense of
higher memory usage.
View definition
db.imdb_vertices.ensureIndex({ name: "inv-wildcard", type: "inverted", fields: [ { name: "description", analyzer: "wildcard_norm" } ] });
db._createView("imdb", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-wildcard" } ] });
{
"links": {
"imdb_vertices": {
"fields": {
"description": {
"analyzers": [
"wildcard_norm"
]
}
}
}
}
}
AQL queries
Match movie descriptions that end with AY!
case-insensitively by taking the
search string and applying the norm
Analyzer that matches the one of the
wildcard
Analyzer (both normalizing to lowercase), and then using the LIKE
operator to perform a wildcard search:
LET search = "%AY!"
LET searchNormalized = TOKENS(search, "only_norm")[0]
FOR doc IN imdb
SEARCH doc.description LIKE searchNormalized
RETURN doc.description
LET search = "%AY!"
LET searchNormalized = TOKENS(search, "only_norm")[0]
FOR doc IN imdb
SEARCH ANALYZER(doc.description LIKE searchNormalized, "wildcard_norm")
RETURN doc.description
Result |
---|
Pitch, the mean-spirited devil, is trying to ruin Christmas. Santa Claus teams up … to save the day! |
Join Little Joe … all the way to Dodge Ball City -- with Little Joe’s faith being tested every step of the way! |
Surrounded by huge walls … they have to decide whether it’s better to do things their way or God’s way! |
Tautly directed by Jack Sholder … It’s Back to the Future meets Groundhog Day! |
Match movie descriptions that contain the following pattern (case-insensitive):
a letter C
, an arbitrary character, the letter T
, an arbitrary character,
a space, two hyphens, and another space.
LET search = "%C_T_ -- %"
LET searchNormalized = TOKENS(search, "only_norm")[0]
FOR doc IN imdb
SEARCH doc.description LIKE searchNormalized
RETURN {d: doc.description }
LET search = "%C_T_ -- %"
LET searchNormalized = TOKENS(search, "only_norm")[0]
FOR doc IN imdb
SEARCH ANALYZER(doc.description LIKE searchNormalized, "wildcard_norm")
RETURN {d: doc.description }
Result |
---|
New York detective John McClane is back … -- and his beloved city -- in a deadly game that demands their concentration. |
When Madame Adelaide Bonfamille leaves her … domesticated house cats -- the butler plots to steal the money and kidnaps the heirs, … |
Join Little Joe … all the way to Dodge Ball City -- with Little Joe’s faith being tested every step of the way! |