ArangoDB v3.13 is under development and not released yet. This documentation is not final and potentially incomplete.

Using Joins in AQL

Query examples for joining documents with one-to-many and many-to-many relationships

The two common scenarios when you want to join documents of collections are:

  • One-to-Many: You may have a users collection and a cities collection. A user lives in a city and you need the city information during a query about the user.

  • Many-To-Many: You may have a authors collection and a books collection. An author can write many books and a book can have many authors. You want to return a list of books with their authors. Therefore you need to join the authors and books.

Unlike many NoSQL databases, ArangoDB does support joins in AQL queries. This is similar to the way traditional relational databases handle this. However, because documents allow for more flexibility, joins are also more flexible. The following sections provide solutions for common questions.

So far, we have only dealt with one collection (users) at a time. We also have a collection relations that stores relationships between users. We now use this extra collection to create a result from two collections.

First of all, we query a few users together with their friends’ IDs. For that, we use all relations that have a value of friend in their type attribute. Relationships are established by using the friendOf and thisUser attributes in the relations collection, which point to the userId values in the users collection.

One-To-Many

You have a collection called users. Users live in city and a city is identified by its primary key. In principle you can embedded the city document into the users document and be happy with it.

{
  "_id" : "users/2151975421",
  "_key" : "2151975421",
  "_rev" : "2151975421",
  "name" : {
    "first" : "John",
    "last" : "Doe"
  },
  "city" : {
    "name" : "Metropolis"
  }
}

This works well for many use cases. Now assume that you have additional information about the city, like the number of people living in it. It would be impractical to change each and every user document if this numbers changes. Therefore it is good idea to hold the city information in a separate collection.

arangosh> db.cities.document("cities/2241300989");
{ 
  "population" : 1000, 
  "name" : "Metropolis", 
  "_id" : "cities/2241300989", 
  "_rev" : "2241300989", 
  "_key" : "2241300989" 
}

Instead of embedding the city directly in the user document, you can use the key of the city.

arangosh> db.users.document("users/2290649597");
{ 
  "name" : { 
    "first" : "John", 
    "last" : "Doe" 
  }, 
  "city" : "cities/2241300989", 
  "_id" : "users/2290649597", 
  "_rev" : "2290649597", 
  "_key" : "2290649597" 
}

We can now join these two collections very easily.

arangosh> db._query(
........>"FOR u IN users " + 
........>"  FOR c IN cities " + 
........>"    FILTER u.city == c._id RETURN { user: u, city: c }"
........>).toArray()
[ 
  { 
    "user" : { 
      "name" : { 
        "first" : "John", 
        "last" : "Doe" 
      }, 
      "city" : "cities/2241300989", 
      "_id" : "users/2290649597", 
      "_rev" : "2290649597", 
      "_key" : "2290649597" 
    }, 
    "city" : { 
      "population" : 1000, 
      "name" : "Metropolis", 
      "_id" : "cities/2241300989", 
      "_rev" : "2241300989", 
      "_key" : "2241300989" 
    } 
  } 
]

Unlike in SQL, there is no special JOIN keyword. The optimizer ensures that the primary index is used in the above query.

However, very often it is much more convenient for the client of the query if a single document would be returned, where the city information is embedded in the user document - as in the simple example above. With AQL, you do not need to forgo this simplification.

arangosh> db._query(
........>"FOR u IN users " + 
........>"  FOR c IN cities " + 
........>"    FILTER u.city == c._id RETURN merge(u, {city: c})"
........>).toArray()
[ 
  { 
    "_id" : "users/2290649597", 
    "_key" : "2290649597", 
    "_rev" : "2290649597", 
    "name" : { 
      "first" : "John", 
      "last" : "Doe" 
    }, 
    "city" : { 
      "_id" : "cities/2241300989", 
      "_key" : "2241300989", 
      "_rev" : "2241300989", 
      "population" : 1000, 
      "name" : "Metropolis" 
    } 
  } 
]

You can have both: the convenient representation of the result for your client and the flexibility of joins for your data model.

Many-To-Many

In the relational world, you need a third table to model the many-to-many relation. In ArangoDB, you have a choice depending on the information you are going to store and the type of questions you are going to ask.

Assume that authors are stored in one collection and books in a second. If all you need is “who are the authors of a book”, then you can easily model this as a list attribute in users.

If you want to store more information, for example, which author wrote which page in a conference proceeding, or if you also want to know “which books were written by which author”, you can use edge collections. This is very similar to the “join table” from the relational world.

Embedded Lists

If you only want to store the authors of a book, you can embed them as list in the book document. There is no need for a separate collection.

arangosh> db.authors.toArray()
[ 
  { 
    "_id" : "authors/2661190141", 
    "_key" : "2661190141", 
    "_rev" : "2661190141", 
    "name" : { 
      "first" : "Maxima", 
      "last" : "Musterfrau" 
    } 
  }, 
  { 
    "_id" : "authors/2658437629", 
    "_key" : "2658437629", 
    "_rev" : "2658437629", 
    "name" : { 
      "first" : "John", 
      "last" : "Doe" 
    } 
  } 
]

You can query books:

arangosh> db._query("FOR b IN books RETURN b").toArray();
[ 
  { 
    "_id" : "books/2681506301", 
    "_key" : "2681506301", 
    "_rev" : "2681506301", 
    "title" : "The beauty of JOINS", 
    "authors" : [ 
      "authors/2661190141", 
      "authors/2658437629" 
    ] 
  } 
]

And you can join the authors in a very similar manner given in the one-to-many section:

arangosh> db._query(
........>"FOR b IN books " +
........>"  LET a = (FOR x IN b.authors " + 
........>"             FOR a IN authors FILTER x == a._id RETURN a) " +
........>"   RETURN { book: b, authors: a }"
........>).toArray();
[ 
  { 
    "book" : { 
      "title" : "The beauty of JOINS", 
      "authors" : [ 
        "authors/2661190141", 
        "authors/2658437629" 
      ], 
      "_id" : "books/2681506301", 
      "_rev" : "2681506301", 
      "_key" : "2681506301" 
    }, 
    "authors" : [ 
      { 
        "name" : { 
          "first" : "Maxima", 
          "last" : "Musterfrau" 
        }, 
        "_id" : "authors/2661190141", 
        "_rev" : "2661190141", 
        "_key" : "2661190141" 
      }, 
      { 
        "name" : { 
          "first" : "John", 
          "last" : "Doe" 
        }, 
        "_id" : "authors/2658437629", 
        "_rev" : "2658437629", 
        "_key" : "2658437629" 
      } 
    ] 
  } 
]

Or you can embed the authors directly:

arangosh> db._query(
........>"FOR b IN books LET a = (" + 
........>"     FOR x IN b.authors " + 
........>"        FOR a IN authors FILTER x == a._id RETURN a)" +
........>"  RETURN merge(b, { authors: a })"
........>).toArray();
[ 
  { 
    "_id" : "books/2681506301", 
    "_key" : "2681506301", 
    "_rev" : "2681506301", 
    "title" : "The beauty of JOINS", 
    "authors" : [ 
      { 
        "_id" : "authors/2661190141", 
        "_key" : "2661190141", 
        "_rev" : "2661190141", 
        "name" : { 
          "first" : "Maxima", 
          "last" : "Musterfrau" 
        } 
      }, 
      { 
        "_id" : "authors/2658437629", 
        "_key" : "2658437629", 
        "_rev" : "2658437629", 
        "name" : { 
          "first" : "John", 
          "last" : "Doe" 
        } 
      } 
    ] 
  } 
]

Using Edge Collections

If you also want to query which books are written by a given author, embedding authors in the book document is possible, but it is more efficient to use a edge collections for speed.

Or you are publishing a proceeding, then you want to store the pages the author has written as well. This information can be stored in the edge document.

First off, create the users:

arangosh> db._create("authors");
[ArangoCollection 2926807549, "authors" (type document, status loaded)]
arangosh> db.authors.save({ name: { first: "John", last: "Doe" } })
{ 
  "error" : false, 
  "_id" : "authors/2935261693", 
  "_rev" : "2935261693", 
  "_key" : "2935261693" 
}
arangosh> db.authors.save({ name: { first: "Maxima", last: "Musterfrau" } })
{ 
  "error" : false, 
  "_id" : "authors/2938210813", 
  "_rev" : "2938210813", 
  "_key" : "2938210813" 
}

Now, create the books without any author information:

arangosh> db._create("books");
[ArangoCollection 2928380413, "books" (type document, status loaded)]
arangosh> db.books.save({ title: "The beauty of JOINS" });
{ 
  "error" : false, 
  "_id" : "books/2980088317", 
  "_rev" : "2980088317", 
  "_key" : "2980088317" 
}

An edge collection is now used to link authors and books:

arangosh> db._createEdgeCollection("written");
[ArangoCollection 2931132925, "written" (type edge, status loaded)]
arangosh> db.written.save("authors/2935261693",
........>"books/2980088317",
........>{ pages: "1-10" })
{ 
  "error" : false, 
  "_id" : "written/3006237181", 
  "_rev" : "3006237181", 
  "_key" : "3006237181" 
}
arangosh> db.written.save("authors/2938210813",
........>"books/2980088317",
........>{ pages: "11-20" })
{ 
  "error" : false, 
  "_id" : "written/3012856317", 
  "_rev" : "3012856317", 
  "_key" : "3012856317" 
}

In order to get all books with their authors, you can use a graph traversal:

arangosh> db._query(
...> "FOR b IN books " +
...> "LET authorsByBook = ( " +
...> "    FOR author, writtenBy IN INBOUND b written " +
...> "    RETURN { " +
...> "        vertex: author, " +
...> "        edge: writtenBy " +
...> "    } " +
...> ") " +
...> "RETURN { " +
...> "    book: b, " +
...> "    authors: authorsByBook " +
...> "} "
...> ).toArray();
[
  {
    "book" : {
      "_key" : "2980088317",
      "_id" : "books/2980088317",
      "_rev" : "2980088317",
      "title" : "The beauty of JOINS"
    },
    "authors" : [
      {
        "vertex" : {
          "_key" : "2935261693",
          "_id" : "authors/2935261693",
          "_rev" : "2935261693",
          "name" : {
            "first" : "John",
            "last" : "Doe"
          }
        },
        "edge" : {
          "_key" : "2935261693",
          "_id" : "written/2935261693",
          "_from" : "authors/2935261693",
          "_to" : "books/2980088317",
          "_rev" : "3006237181",
          "pages" : "1-10"
        }
      },
      {
        "vertex" : {
          "_key" : "2938210813",
          "_id" : "authors/2938210813",
          "_rev" : "2938210813",
          "name" : {
            "first" : "Maxima",
            "last" : "Musterfrau"
          }
        },
        "edge" : {
          "_key" : "6833274",
          "_id" : "written/6833274",
          "_from" : "authors/2938210813",
          "_to" : "books/2980088317",
          "_rev" : "3012856317",
          "pages" : "11-20"
        }
      }
    ]
  }
]

Or if you want only the information stored in the vertices, you can use this query:

arangosh> db._query(
...> "FOR b IN books " +
...> "LET authorsByBook = ( " +
...> "    FOR author IN INBOUND b written " +
...> "    OPTIONS { " +
...> "        order: 'bfs', " +
...> "        uniqueVertices: 'global' " +
...> "    } " +
...> "    RETURN author " +
...> ") " +
...> "RETURN { " +
...> "    book: b, " +
...> "    authors: authorsByBook " +
...> "} "
...> ).toArray();
[
  {
    "book" : {
      "_key" : "2980088317",
      "_id" : "books/2980088317",
      "_rev" : "2980088317",
      "title" : "The beauty of JOINS"
    },
    "authors" : [
      {
        "_key" : "2938210813",
        "_id" : "authors/2938210813",
        "_rev" : "2938210813",
        "name" : {
          "first" : "Maxima",
          "last" : "Musterfrau"
        }
      },
      {
        "_key" : "2935261693",
        "_id" : "authors/2935261693",
        "_rev" : "2935261693",
        "name" : {
          "first" : "John",
          "last" : "Doe"
        }
      }
    ]
  }
]

Or again embed the authors directly into the book document:

arangosh> db._query(
...> "FOR b IN books " +
...> "LET authors = ( " +
...> "    FOR author IN INBOUND b written " +
...> "    OPTIONS { " +
...> "        order: 'bfs', " +
...> "        uniqueVertices: 'global' " +
...> "    } " +
...> "    RETURN author " +
...> ") " +
...> "RETURN MERGE(b, {authors: authors}) "
...> ).toArray();
[
  {
    "_id" : "books/2980088317",
    "_key" : "2980088317",
    "_rev" : "2980088317",
    "title" : "The beauty of JOINS",
    "authors" : [
      {
        "_key" : "2938210813",
        "_id" : "authors/2938210813",
        "_rev" : "2938210813",
        "name" : {
          "first" : "Maxima",
          "last" : "Musterfrau"
        }
      },
      {
        "_key" : "2935261693",
        "_id" : "authors/2935261693",
        "_rev" : "2935261693",
        "name" : {
          "first" : "John",
          "last" : "Doe"
        }
      }
    ]
  }
]

If you need the authors and their books, simply reverse the direction:

> db._query(
...> "FOR a IN authors " +
...> "LET booksByAuthor = ( " +
...> "    FOR b IN OUTBOUND a written " +
...> "    OPTIONS { " +
...> "        order: 'bfs', " +
...> "        uniqueVertices: 'global' " +
...> "    } " +
...> "    RETURN b" +
...> ") " +
...> "RETURN MERGE(a, {books: booksByAuthor}) "
...> ).toArray();
[
  {
    "_id" : "authors/2935261693",
    "_key" : "2935261693",
    "_rev" : "2935261693",
    "name" : {
      "first" : "John",
      "last" : "Doe"
    },
    "books" : [
      {
        "_key" : "2980088317",
        "_id" : "books/2980088317",
        "_rev" : "2980088317",
        "title" : "The beauty of JOINS"
      }
    ]
  },
  {
    "_id" : "authors/2938210813",
    "_key" : "2938210813",
    "_rev" : "2938210813",
    "name" : {
      "first" : "Maxima",
      "last" : "Musterfrau"
    },
    "books" : [
      {
        "_key" : "2980088317",
        "_id" : "books/2980088317",
        "_rev" : "2980088317",
        "title" : "The beauty of JOINS"
      }
    ]
  }
]

More examples

Join tuples

We will start with a SQL-ish result set and return each tuple (user name, friends userId) separately. The AQL query to generate such result is:

FOR u IN users
  FILTER u.active == true
  LIMIT 0, 4
  FOR f IN relations
    FILTER f.type == @friend && f.friendOf == u.userId
    RETURN {
      "user" : u.name,
      "friendId" : f.thisUser
    }
Bind Parameters
{
  "friend": "friend"
}
Show output

We iterate over the collection users. Only the ‘active’ users will be examined. For each of these users we will search for up to 4 friends. We locate friends by comparing the userId of our current user with the friendOf attribute of the relations document. For each of those relations found we return the users name and the userId of the friend.

Horizontal lists

Note that in the above result, a user can be returned multiple times. This is the SQL way of returning data. If this is not desired, the friends’ ids of each user can be returned in a horizontal list. This will return each user at most once.

The AQL query for doing so is:

FOR u IN users
  FILTER u.active == true LIMIT 0, 4
  RETURN {
    "user" : u.name,
    "friendIds" : (
      FOR f IN relations
        FILTER f.friendOf == u.userId && f.type == "friend"
        RETURN f.thisUser
    )
  }
[
  {
    "user" : "Abigail",
    "friendIds" : [
      108,
      102,
      106
    ]
  },
  {
    "user" : "Fred",
    "friendIds" : [
      209
    ]
  },
  {
    "user" : "Mary",
    "friendIds" : [
      207,
      104
    ]
  },
  {
    "user" : "Mariah",
    "friendIds" : [
      203,
      205
    ]
  }
]

In this query we are still iterating over the users in the users collection and for each matching user we are executing a subquery to create the matching list of related users.

Self joins

To not only return friend ids but also the names of friends, we could “join” the users collection once more (something like a “self join”):

FOR u IN users
  FILTER u.active == true
  LIMIT 0, 4
  RETURN {
    "user" : u.name,
    "friendIds" : (
      FOR f IN relations
        FILTER f.friendOf == u.userId && f.type == "friend"
        FOR u2 IN users
          FILTER f.thisUser == u2.useId
          RETURN u2.name
    )
  }
[
  {
    "user" : "Abigail",
    "friendIds" : [
      "Jim",
      "Jacob",
      "Daniel"
    ]
  },
  {
    "user" : "Fred",
    "friendIds" : [
      "Mariah"
    ]
  },
  {
    "user" : "Mary",
    "friendIds" : [
      "Isabella",
      "Michael"
    ]
  },
  {
    "user" : "Mariah",
    "friendIds" : [
      "Madison",
      "Eva"
    ]
  }
]

This query will then again in term fetch the clear text name of the friend from the users collection. So here we iterate the users collection, and for each hit the relations collection, and for each hit once more the users collection.

Outer joins

Lets find the lonely people in our database - those without friends.

FOR user IN users
  LET friendList = (
    FOR f IN relations
      FILTER f.friendOf == u.userId
      RETURN 1
  )
  FILTER LENGTH(friendList) == 0
  RETURN { "user" : user.name }
[
  {
    "user" : "Abigail"
  },
  {
    "user" : "Fred"
  }
]

So, for each user we pick the list of their friends and count them. The ones where count equals zero are the lonely people. Using RETURN 1 in the subquery saves even more precious CPU cycles and gives the optimizer more alternatives.

Index usage

For joins in particular, you should make sure indexes can be utilized to speed up your queries.

Note that sparse indexes don’t qualify for joins. Often, You also want to join documents not containing the property you join with. However, sparse indexes don’t contain references to documents that don’t contain the indexed attributes - thus they would be missing from the join operation. For this reason, you should provide non-sparse indexes.

The join-index-nodes AQL optimizer rule automatically recognizes whether a better strategy for joining collections can be used if suitable indexes are present.

If two or more collections are joined using nested FOR loops and the attributes you join on are indexed by primary indexes or persistent indexes, then a merge join can be performed. This is possible because these indexes are always sorted.

Note that returning document attributes from the outer loop is limited to attributes covered by the index, or the improved join strategy cannot be used. The outer loop can be a different FOR operation in the execution plan than defined in your query.

The following example query shows an inner join between orders and users on user ID. Each document in the orders collection references a user, and the users collection stores the user ID in the _key attribute. The query returns the total attribute of every order along with the user information:

FOR o IN orders
  FOR u IN users
    FILTER o.user == u._key
    RETURN { orderTotal: o.total, user: u }

The _key attribute is covered by the primary index of the users collection. If the orders collection has a persistent index defined over the user attribute and additionally includes the total attribute in storedValues, then the query is eligible for a merge join. You can check the query explain output for JoinNode entries:

Execution plan:
 Id   NodeType          Par     Est.   Comment
  1   SingletonNode                1   * ROOT
 10   JoinNode               500000     - JOIN
 10   JoinNode                500000       - FOR o IN orders   LET #8 = o.`total`   /* index scan (projections: `total`) */
 10   JoinNode                     1       - FOR u IN users   /* index scan + document lookup */
  6   CalculationNode        500000     - LET #4 = { "orderTotal" : #8, "user" : u }   /* simple expression */   /* collections used: u : users */
  7   ReturnNode              500000     - RETURN #4

Indexes used:
 By   Name                      Type         Collection   Unique   Sparse   Cache   Selectivity   Fields       Stored values   Ranges
 10   idx_1784521139132825600   persistent   orders       false    false    false      100.00 %   [ `user` ]   [ `total` ]     *
 10   primary                   primary      users        true     false    false      100.00 %   [ `_key` ]   [  ]            (o.`user` == u.`_key`)

More complex merge joins are supported as well, for example, joining three or two-by-two collections, also with additional filters, sort and limit. Optimizations like early pruning and late materialization may get applied.

FOR doc1 IN coll1
  FOR doc2 IN coll2
    FOR doc3 IN coll3
      FOR doc4 IN coll4
      FILTER doc1.x == doc3.y AND doc3.z IN ["foo", "bar"]
      FILTER doc2.x == doc4.y
      SORT doc4.y
      LIMIT 20
      RETURN [doc1.x, doc2.x, doc3.y, doc4.y]

In exceptional cases, merge joins can be slower than the regular joining strategy. For example, if the values of two collections are interleaved, it can better to disable the join-index-nodes optimizer rule. For example, in arangosh:

db._query(
  `FOR o IN odd FOR e IN even FILTER o.val == e.val RETURN e.val`,
  { },
  { optimizer: { rules: ["-join-index-nodes"] } }
)

Pitfalls

Since we’re free of schemata, there is by default no way to tell the format of the documents. So, if your documents don’t contain an attribute, it defaults to null. We can however check our data for accuracy like this:

RETURN LENGTH(FOR u IN users FILTER u.userId == null RETURN 1)
[
  10000
]
RETURN LENGTH(FOR f IN relations FILTER f.friendOf == null RETURN 1)
[
  10000
]

So if the above queries return 10k matches each, the result of the Join tuples query will become 100,000,000 items larger and use much memory plus computation time. So it is generally a good idea to revalidate that the criteria for your join conditions exist.

Using indexes on the properties can speed up the operation significantly. You can use the explain helper to revalidate your query actually uses them.

If you work with joins on edge collections you would typically aggregate over the internal fields _id, _from and _to (where _id equals userId, _from friendOf and _to would be thisUser in our examples). ArangoDB implicitly creates indexes on them.