Graph Analytics

ArangoGraph offers Graph Analytics Engines to run graph algorithms on your data separately from your ArangoDB deployments

Graph analytics is a branch of data science that deals with analyzing information networks known as graphs, and extracting information from the data relationships. It ranges from basic measures that characterize graphs, over PageRank, to complex algorithms. Common use cases include fraud detection, recommender systems, and network flow analysis.

ArangoDB offers a feature for running algorithms on your graph data, called Graph Analytics Engines (GAEs). It is available on request for the ArangoGraph Insights Platform .

Key features:

  • Separation of storage and compute: GAEs are a solution that lets you run graph analytics independent of your ArangoDB deployments on dedicated machines optimized for compute tasks. This separation of OLAP and OLTP workloads avoids affecting the performance of the transaction-oriented database systems.

  • Fast data loading: You can easily and efficiently import graph data from ArangoDB and export results back to ArangoDB.

  • In-memory processing: All imported data is held and processed in the main memory of the compute machines for very fast execution of graph algorithms such as connected components, label propagation, and PageRank.

Workflow

The following lists outlines how you can use Graph Analytics Engines (GAEs). How to perform the steps is detailed in the subsequent sections.

Before you can use Graph Analytics Engines, you need to request the feature via Request help in the ArangoGraph dashboard for a deployment.

The deployment needs to use AWS as the cloud provider.

Single server deployments using ArangoDB version 3.11 are not supported.

  1. Determine the approximate size of the data that you will load into the GAE to select an engine size with sufficient memory. The data as well as the temporarily needed space for computations and results needs to fit in memory.
  2. Deploy an engine of the desired size and of type gral. It only takes a few seconds until the engine can be used. The engine runs adjacent to a particular ArangoGraph deployment.
  3. Load graph data from ArangoDB into the engine. You can load named graphs or sets of vertex and edge collections. This loads the edge information and a configurable subset of the vertex attributes.
  4. Run graph algorithms on the data. You only need to load the data once per engine and can then run various algorithms with different settings.
  5. Write the computation results back to ArangoDB.
  6. Delete the engine once you are done.

Authentication

The Management API for deploying and deleting engines requires an ArangoGraph API key. See Generating an API Key on how to create one.

You then need to generate an access token using the API key. See Authenticating with Oasisctl on how to do so using oasisctl login.

The Engine API uses one of two authentication methods, depending on the auto login to database UI setting in ArangoGraph:

  • Enabled: You can use an ArangoGraph access token created with an API key (see above), allowing you to use one token for both the Management API and the Engine API.
  • Disabled: You need use a JWT user token created from ArangoDB credentials. These session tokens need to be renewed every hour by default. See HTTP API Authentication for details.

Management API

You can save an ArangoGraph access token created with oasisctl login in a variable to ease scripting. Note that this should be the token string only and not include quote marks. The following examples assume Bash as the shell and that the curl and jq commands are available.

ARANGO_GRAPH_TOKEN="$(oasisctl login --key-id "<AG_KEY_ID>" --key-secret "<AG_KEY_SECRET>")"

To determine the base URL of the management API, use the ArangoGraph dashboard and copy the APPLICATION ENDPOINT of the deployment that holds the graph data you want to analyze. Replace the port with 8829 and append /graph-analytics/api/graphanalytics/v1, e.g. https://123456abcdef.arangodb.cloud:8829/graph-analytics/api/graphanalytics/v1.

Store the base URL in a variable called BASE_URL:

BASE_URL='https://...'

To authenticate requests, you need to use the following HTTP header:

Authorization: bearer <ARANGO_GRAPH_TOKEN>

For example, with cURL and using the token variable:

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/api-version"

Request and response payloads are JSON-encoded in the management API.

Get the API version

GET <BASE_URL>/api-version

Retrieve the version information of the management API.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/api-version"

List engine sizes

GET <BASE_URL>/enginesizes

List the available engine sizes, which is a combination of the number of cores and the size of the RAM, starting at 1 CPU and 4 GiB of memory (e4).

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/enginesizes"

List engine types

GET <BASE_URL>/enginetypes

List the available engine types. The only type supported for GAE workloads is called gral.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/enginetypes"

Deploy an engine

POST <BASE_URL>/engines

Set up a GAE adjacent to the ArangoGraph deployment, for example, using an engine size of e4.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" -X POST -d '{"type_id":"gral","size_id":"e4"}' "$BASE_URL/engines"

List all engines

GET <BASE_URL>/engines

List all deployed GAEs of a ArangoGraph deployment.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/engines"

Get an engine

GET <BASE_URL>/engines/<ENGINE_ID>

List the detailed information about a specific GAE.

ENGINE_ID="zYxWvU9876"
curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/engines/$ENGINE_ID"

Delete an engine

DELETE <BASE_URL>/engines/<ENGINE_ID>

Delete a no longer needed GAE, freeing any data it holds in memory.

ENGINE_ID="zYxWvU9876"
curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" -X DELETE "$BASE_URL/engines/$ENGINE_ID"

Engine API

To determine the base URL of the engine API, use the ArangoGraph dashboard and copy the APPLICATION ENDPOINT of the deployment that holds the graph data you want to analyze. Replace the port with 8829 and append /graph-analytics/engines/<ENGINE_ID>, e.g. https://<123456abcdef>.arangodb.cloud:8829/graph-analytics/engines/zYxWvU9876.

Store the base URL in a variable called ENGINE_URL:

ENGINE_URL='https://...'

To authenticate requests, you need to use a bearer token in HTTP header:

Authorization: bearer <TOKEN>
  • If Auto login to database UI is enabled for the ArangoGraph deployment, this can be the same access token as used for the management API.
  • If it is disabled, use an ArangoDB session token (JWT user token) instead.

You can save the token in a variable to ease scripting. Note that this should be the token string only and not include quote marks. The following examples assume Bash as the shell and that the curl and jq commands are available.

An example of authenticating a request using cURL and a session token:

APPLICATION_ENDPOINT="https://123456abcdef.arangodb.cloud:8529"

ADB_TOKEN=$(curl -X POST -d "{\"username\":\"<ADB_USER>\",\"password\":\"<ADB_PASS>\"}" "$APPLICATION_ENDPOINT/_open/auth" | jq -r '.jwt')

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs"

All requests to the engine API start jobs, each representing an operation. You can check the progress of operations and check if errors occurred. You can submit jobs concurrently and they also run concurrently.

You can find the API reference documentation with detailed descriptions of the request and response data structures at https://arangodb.github.io/graph-analytics .

Request and response payloads are JSON-encoded in the engine API.

Load data

POST <ENGINE_URL>/v1/loaddata

Import graph data from a database of the ArangoDB deployment. You can import named graphs as well as sets of vertex and edge collections (see Managed and unmanaged graphs).

curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d '{"database":"_system","graph_name":"connectedComponentsGraph"}' "$ENGINE_URL/v1/loaddata"

Run algorithms

PageRank

POST <ENGINE_URL>/v1/pagerank

PageRank is a well known algorithm to rank vertices in a graph: the more important a vertex, the higher rank it gets. It goes back to L. Page and S. Brin’s paper  and is used to rank pages in in search engines (hence the name). The algorithm runs until the execution converges. To run for a fixed number of iterations, use the maximum_supersteps parameter.

The rank of a vertex is a positive real number. The algorithm starts with every vertex having the same rank (one divided by the number of vertices) and sends its rank to its out-neighbors. The computation proceeds in iterations. In each iteration, the new rank is computed according to the formula ( (1 - damping_factor) / total number of vertices) + (damping_factor * the sum of all incoming ranks). The value sent to each of the out-neighbors is the new rank divided by the number of those neighbors, thus every out-neighbor gets the same part of the new rank.

The algorithm stops when at least one of the two conditions is satisfied:

  • The maximum number of iterations is reached. This is the same maximum_supersteps parameter as for the other algorithms.
  • Every vertex changes its rank in the last iteration by less than a certain threshold. The threshold is hardcoded to 0.0000001.

It is possible to specify an initial distribution for the vertex documents in your graph. To define these seed ranks / centralities, you can specify a seeding_attribute in the properties for this algorithm. If the specified field is set on a document and the value is numeric, then it is used instead of the default initial rank of 1 / numVertices.

Parameters:

  • graph_id
  • damping_factor
  • maximum_supersteps
  • seeding_attribute (optional, for seeded PageRank)

Result: the rank of each vertex

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"damping_factor\":0.85,\"maximum_supersteps\":500,\"seeding_attribute\":\"seed_attr\"}" "$ENGINE_URL/v1/pagerank"

Weakly Connected Components (WCC)

POST <ENGINE_URL>/v1/wcc

The weakly connected component algorithm partitions a graph into maximal groups of vertices, so that within a group, all vertices are reachable from each vertex by following the edges, ignoring their direction.

In other words, each weakly connected component is a maximal subgraph such that there is a path between each pair of vertices where one can also follow edges against their direction in a directed graph.

Parameters:

  • graph_id

Result: a component ID for each vertex. All vertices from the same component obtain the same component ID, every two vertices from different components obtain different IDs.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID}" "$ENGINE_URL/v1/wcc"

Strongly Connected Components (SCC)

POST <ENGINE_URL>/v1/scc

The strongly connected components algorithm partitions a graph into maximal groups of vertices, so that within a group, all vertices are reachable from each vertex by following the edges in their direction.

In other words, a strongly connected component is a maximal subgraph, where for every two vertices, there is a path from one of them to the other, forming a cycle. In contrast to a weakly connected component, one cannot follow edges against their direction.

Parameters:

  • graph_id

Result: a component ID for each vertex. All vertices from the same component obtain the same component ID, every two vertices from different components obtain different IDs.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID}" "$ENGINE_URL/v1/scc"

Vertex Centrality

Centrality measures help identify the most important vertices in a graph. They can be used in a wide range of applications: to identify influencers in social networks, or middlemen in terrorist networks.

There are various definitions for centrality, the simplest one being the vertex degree. These definitions were not designed with scalability in mind. It is probably impossible to discover an efficient algorithm which computes them in a distributed way. Fortunately there are scalable substitutions available, which should be equally usable for most use cases.

Illustration of an execution of different centrality measures (Freeman 1977)

LineRank

POST <ENGINE_URL>/v1/linerank

Another common measure is the betweenness centrality : It measures the number of times a vertex is part of shortest paths between any pairs of vertices. For a vertex v betweenness is defined as:

Vertex Betweenness Formula

Where the σ represents the number of shortest paths between x and y, and σ(v) represents the number of paths also passing through a vertex v. By intuition a vertex with higher betweenness centrality has more information passing through it.

LineRank approximates the random walk betweenness of every vertex in a graph. This is the probability that someone, starting on an arbitrary vertex, visits this node when they randomly choose edges to visit.

The algorithm essentially builds a line graph out of your graph (switches the vertices and edges), and then computes a score similar to PageRank. This can be considered a scalable equivalent to vertex betweenness, which can be executed distributedly in ArangoDB. The algorithm is from the paper Centralities in Large Networks: Algorithms and Observations (U Kang et.al. 2011).

Parameters:

  • graph_id
  • damping_factor
  • maximum_supersteps

Result: the line rank of each vertex

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"damping_factor\":0.0000001,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/linerank"

Community Detection

Graphs based on real world networks often have a community structure. This means it is possible to find groups of vertices such that each vertex group is internally more densely connected than outside the group. This has many applications when you want to analyze your networks, for example Social networks include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc.

Label Propagation

POST <ENGINE_URL>/v1/labelpropagation

Label Propagation  can be used to implement community detection on large graphs.

The algorithm assigns an initial community identifier to every vertex in the graph using a user-defined attribute. The idea is that each vertex should be in the community that most of its neighbors are in at the end of the computation.

In each iteration of the computation, a vertex sends its current community ID to all its neighbor vertices, inbound and outbound (ignoring edge directions). After that, each vertex adopts the community ID it received most frequently in the last step.

It can happen that a vertex receives multiple most frequent community IDs. In this case, one is chosen either randomly or using a deterministic choice depending on a setting for the algorithm. The rules for a deterministic tiebreak are as follows:

  • If a vertex obtains only one community ID and the ID of the vertex from the previous step, its old ID, is less than the obtained ID, the old ID is kept.
  • If a vertex obtains more than one ID, its new ID is the lowest ID among the most frequently obtained IDs. For example, if the initial IDs are numbers and the obtained IDs are 1, 2, 2, 3, 3, then 2 is the new ID.
  • If, however, no ID arrives more than once, the new ID is the minimum of the lowest obtained IDs and the old ID. For example, if the old ID is 5 and the obtained IDs are 3, 4, 6, then the new ID is 3. If the old ID is 2, it is kept.

The algorithm runs until it converges or reaches the maximum iteration bound. It may not converge on large graphs if the synchronous variant is used.

  • Synchronous: The new community ID of a vertex is based on the community IDs of its neighbors from the previous iteration. With (nearly) bipartite  subgraphs, this may lead to the community IDs changing back and forth in each iteration within the two halves of the subgraph.
  • Asynchronous: A vertex determines the new community ID using the most up-to-date community IDs of its neighbors, whether those updates occurred in the current iteration or the previous one. The order in which vertices are updated in each iteration is chosen randomly. This leads to more stable community IDs.

Parameters:

  • graph_id
  • start_label_attribute
  • synchronous
  • random_tiebreak
  • maximum_supersteps

Result: a community ID for each vertex

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"start_label_attribute\":\"start_attr\",\"synchronous\":false,\"random_tiebreak\":false,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/labelpropagation"
Attribute Propagation

POST <ENGINE_URL>/v1/attributepropagation

The attribute propagation algorithm can be used to implement community detection. It works similar to the label propagation algorithm, but every node additionally accumulates a memory of observed labels instead of forgetting all but one label.

The algorithm assigns an initial value to every vertex in the graph using a user-defined attribute. The attribute value can be a list of strings to initialize the set of labels with multiple labels.

In each iteration of the computation, the following steps are executed:

  1. Each vertex propagates its set of labels along the edges to all direct neighbor vertices. Whether inbound or outbound edges are followed depends on an algorithm setting.
  2. Each vertex adds the labels it receives to its own set of labels.

After a specified maximal number of iterations or if no label set changes any more, the algorithm stops.

If there are many labels and the graph is well-connected, the result set can be very large.

Parameters:

  • graph_id
  • start_label_attribute: The attribute to initialize labels with. Use "@id" to use the document IDs of the vertices.
  • synchronous: Whether synchronous or asynchronous label propagation is used.
  • backwards: Whether labels are propagated in edge direction (false) or the opposite direction (true).
  • maximum_supersteps: Maximum number of iterations.

Result: The set of accumulated labels of each vertex.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"start_label_attribute\":\"start_attr\",\"synchronous\":false,\"backwards\":false,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/attributepropagation"

Store job results

POST <ENGINE_URL>/v1/storeresults

You need to specify to which ArangoDB database and target_collection to save the results to. They need to exist already.

You also need to specify a list of job_ids with one or more jobs that have run graph algorithms.

Each algorithm outputs one value for each vertex, and you can define the target attribute to store the information in with attribute_names. It has to be a list with one attribute name for every job in the job_ids list.

You can optionally set the degree of parallelism and the batch_size for saving the data.

Parameters:

  • database
  • target_collection
  • job_ids
  • attribute_names
  • parallelism
  • batch_size
JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" -X POST -d "{\"database\":\"_system\",\"target_collection\":\"coll\",\"job_ids\":[$JOB_ID],\"attribute_names\":[\"attr\"]}" "$ENGINE_URL/v1/storeresults"

List all jobs

GET <ENGINE_URL>/v1/jobs

List all active and finished jobs.

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs"

Get a job

GET <ENGINE_URL>/v1/jobs/<JOB_ID>

Get detailed information about a specific job.

JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs/$JOB_ID"

Delete a job

DELETE <ENGINE_URL>/v1/jobs/<JOB_ID>

Delete a specific job.

JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" -X DELETE "$ENGINE_URL/v1/jobs/$JOB_ID"

List all graphs

GET <ENGINE_URL>/v1/graphs

List all loaded sets of graph data that reside in the memory of the engine node.

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/graphs"

Get a graph

GET <ENGINE_URL>/v1/graphs/<GRAPH_ID>

Get detailed information about a specific set of graph data.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/graphs/$GRAPH_ID"

Delete a graph

DELETE <ENGINE_URL>/v1/graphs/<GRAPH_ID>

Delete a specific set of graph data, removing it from the memory of the engine node.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -X DELETE "$ENGINE_URL/v1/graphs/$GRAPH_ID"