Wildcard Search with ArangoSearch

Search for strings with placeholders that stand for one or many arbitrary characters

You can use the LIKE() function and LIKE operator for this search technique to find strings that start with, contain, or end with a certain substring. You can also search for complex patterns with multiple placeholders. Place the special characters _ and % as wildcards for any single or zero-or-more characters in the search string to match multiple partial strings.

prefix%
%infix%
%suffix
%complex%pat_ern

Wildcard searching can be an alternative to tokenizing text into words and then searching for words in a particular order (Phrase and Proximity Search). It is especially useful if you want to search for substrings that include characters that are considered word boundaries like punctuation and whitespace and would normally get removed when tokenizing text.

Index acceleration

The ArangoSearch LIKE() function is backed by View indexes. In contrast, the string LIKE() function cannot utilize any sort of index. This applies to the LIKE operator, too, which you can use instead of the function. Another difference is that the ArangoSearch variant of the LIKE() function does not accept a third argument to make matching case-insensitive. You can control this via Analyzers instead, also see Case-insensitive Search with ArangoSearch. Which of the two equally named functions is used is determined by the context. The ArangoSearch variant is used in SEARCH operations. It is also used when you have an inverted index and use the LIKE() function with two arguments or the LIKE operator in FILTER operations. The string variant is used everywhere else, including using the LIKE() function with three arguments in FILTER operations together with an inverted index.

Wildcard Syntax

  • _: A single arbitrary character
  • %: Zero, one or many arbitrary characters
  • \\_: A literal underscore
  • \\%: A literal percent sign

Literal backlashes require different amounts of escaping depending on the context:

  • \ in bind variables (Table view mode) in the web interface (automatically escaped to \\ unless the value is wrapped in double quotes and already escaped properly)
  • \\ in bind variables (JSON view mode) and queries in the web interface
  • \\ in bind variables in arangosh
  • \\\\ in queries in arangosh
  • Double the amount compared to arangosh in shells that use backslashes for escaping (\\\\ in bind variables and \\\\\\\\ in queries)

Wildcard Search Examples

Dataset

IMDB movie dataset

View definition

db.imdb_vertices.ensureIndex({ name: "inv-exact", type: "inverted", fields: [ "title" ] });
db._createView("imdb", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-exact" } ] });
{
  "links": {
    "imdb_vertices": {
      "fields": {
        "title": {
          "analyzers": [
            "identity"
          ]
        }
      }
    }
  }
}

AQL queries

Match all titles that starts with The Matr using LIKE(), where _ stands for a single wildcard character and % for an arbitrary amount:

FOR doc IN imdb
  SEARCH LIKE(doc.title, "The Matr%")
  RETURN doc.title
Result
The Matrix Revisited
The Matrix
The Matrix Reloaded
The Matrix Revolutions
The Matrix Trilogy

You can achieve the same with the STARTS_WITH() function:

FOR doc IN imdb
  SEARCH STARTS_WITH(doc.title, "The Matr")
  RETURN doc.title

Match all titles that contain Mat using LIKE():

FOR doc IN imdb
  SEARCH LIKE(doc.title, "%Mat%")
  RETURN doc.title
Result
The Matrix Revisited
Gray Matters
Show: A Night In The Life of Matchbox Twenty
The Mating Habits of the Earthbound Human
Dark Matter
Dave Matthews & Tim Reynolds: Live at Radio City
Once Upon A Mattress
Tarzan and His Mate
Donald in Mathmagic Land
Das Geheimnis der Materie

Match all titles that end with rix using LIKE():

FOR doc IN imdb
  SEARCH LIKE(doc.title, "%rix")
  RETURN doc.title
Result
Ben 10: Secret of the Omnitrix
Pinchcliffe Grand Prix
Hendrix
The Matrix
The Animatrix
Les Douze travaux d’Astérix
Vercingétorix

Match all titles that have an H as first letter, followed by two arbitrary characters, followed by ry and any amount of characters after that. It will match titles starting with Harry and Henry:

FOR doc IN imdb
  SEARCH LIKE(doc.title, "H__ry%")
  RETURN doc.title
Result
Henry & June
Henry Rollins: Live in the Conversation Pit
Henry Rollins: Uncut from NYC
Harry Potter and the Sorcerer’s Stone
Harry Potter and the Chamber Of Secrets

Use a bind parameter as input, but escape the characters with special meaning and perform a contains-style search by prepending and appending a percent sign:

FOR doc IN imdb
  SEARCH LIKE(doc.title, CONCAT("%", SUBSTITUTE(@term, ["_", "%"], ["\\_", "\\%"]), "%"))
  RETURN doc.title

Bind parameters:

{ "term": "y_" }

The query constructs the wildcard string %y\\_% and will match Cry_Wolf.