Manually Upgrading an Active Failover Deployment

This page will guide you through the process of a manual upgrade of an Active Failover setup. The different nodes can be upgraded one at a time without incurring a prolonged downtime of the entire system. The downtimes of the individual nodes should also stay fairly low.

The manual upgrade procedure described in this section can be used to upgrade to a new hotfix version, or to perform an upgrade to a new minor version of ArangoDB. Please refer to the Upgrade Paths section for detailed information.

Preparations

The ArangoDB installation packages (e.g. for Debian or Ubuntu) set up a convenient standalone instance of arangod. During installation, this instance's database will be upgraded (see --database.auto-upgrade) and the service will be (re)started.

You have to make sure that your Active Failover deployment is independent of this standalone instance. Specifically, make sure that the database directory as well as the socket used by the standalone instance provided by the package are separate from the ones in your Active Failover configuration. Also, that you haven't modified the init script or systemd unit file for the standalone instance in a way that it would start or stop your Active Failover instance instead.

Install the new ArangoDB version binary

The first step is to install the new ArangoDB package.

Note: you do not have to stop the Active Failover (arangod) processes before upgrading it.

For example, if you want to upgrade to 3.3.16 on Debian or Ubuntu, either call

$ apt install arangodb=3.3.16-1

(apt-get on older versions) if you have added the ArangoDB repository. Or install a specific package using

$ dpkg -i arangodb3-3.3.16-1_amd64.deb

after you have downloaded the corresponding file from https://download.arangodb.com/.

Stop the Standalone Instance

As the package will automatically start the standalone instance, you might want to stop that instance now, as otherwise it can create some confusion later. As you are starting the Active Failover processes manually you will not need the automatically installed and started standalone instance, and you should hence stop it via:

$ service arangodb3 stop

Also, you might want to remove the standalone instance from the default runlevels to prevent it from starting on the next reboots of your machine. How this is done depends on your distribution and init system. For example, on older Debian and Ubuntu systems using a SystemV-compatible init, you can use:

$ update-rc.d -f arangodb3 remove

Set supervision into maintenance mode

Important: Supervision maintenance mode is supported from ArangoDB versions 3.3.8/3.2.14 or higher.

You have two main choices when performing an upgrade of the Active Failover setup:

  • Upgrade while incurring a leader-to-follower switch (with reduced downtime)
  • An upgrade with no leader-to-follower switch.

Turning the maintenance mode on will enable the latter case. You might have a short downtime during the leader upgrade, but there will be no potential loss of acknowledged operations.

To enable the maintenance mode means to essentially disable the Agency supervision for a limited amount of time during the upgrade procedure. The following API calls will activate and deactivate the maintenance mode of the supervision job. You might use curl to send the API calls. The following examples assume there is an Active Failover node running on localhost on port 7002.

Activate Maintenance mode

curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"on"'

For example:

curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"on"'

{"error":false,"warning":"Cluster supervision deactivated. 
It will be reactivated automatically in 60 minutes unless this call is repeated until then."}

Note: In case the manual upgrade takes longer than 60 minutes, the API call has to be resent.

Deactivate Maintenance mode

The cluster supervision resumes automatically 60 minutes after disabling it. It can be manually reactivated earlier at any point using the following API call:

curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"off"'

For example:

curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"off"'

{"error":false,"warning":"Cluster supervision reactivated."}

Upgrade the Active Failover processes

Now all the Active Failover (Agents, Single-Server) processes (arangod) have to be upgraded on each node.

Note: Please read the section regarding the maintenance mode above.

In order to stop an arangod process we will need to use a command like kill -15:

kill -15 <pid-of-arangod-process>

The pid associated to your Active Failover setup can be checked using a command like ps:

ps -C arangod -fww

The output of the command above does not only show the process ids of all arangod processes but also the used commands, which is useful for the following restarts of all arangod processes.

The output below is from a test machine where three Agents and two Single-Servers were running locally. In a more production-like scenario, you will find only one instance of each type running per machine:

ps -C arangod -fww
UID        PID  PPID  C STIME TTY          TIME CMD
max      29075  8072  0 13:50 pts/2    00:00:42 arangod --server.endpoint tcp://0.0.0.0:5001 --agency.my-address=tcp://127.0.0.1:5001 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a1 --javascript.app-path /tmp --database.directory agent1
max      29208  8072  2 13:51 pts/2    00:02:08 arangod --server.endpoint tcp://0.0.0.0:5002 --agency.my-address=tcp://127.0.0.1:5002 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a2 --javascript.app-path /tmp --database.directory agent2
max      29329 16224  0 13:51 pts/3    00:00:42 arangod --server.endpoint tcp://0.0.0.0:5003 --agency.my-address=tcp://127.0.0.1:5003 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a3 --javascript.app-path /tmp --database.directory agent3
max      29824 16224  1 13:55 pts/3    00:01:53 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7001 --cluster.my-address tcp://127.0.0.1:7001 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c1 --javascript.app-path /tmp --database.directory single1
max      29938 16224  2 13:56 pts/3    00:02:13 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7002 --cluster.my-address tcp://127.0.0.1:7002 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c2 --javascript.app-path /tmp --database.directory single2

Note: The start commands of Agent and Single Server are required for restarting the processes later.

The recommended procedure for upgrading an Active Failover setup is to stop, upgrade and restart the arangod instances one by one on all participating servers, starting first with all Agent instances, and then following with the Active Failover instances themselves. When upgrading the Active Failover instances, the followers should be upgraded first.

To figure out the node containing the followers you can consult the cluster endpoints API:

curl http://<single-server>:7002/_api/cluster/endpoints

This will yield a list of endpoints, the first of which is always the leader node.

Stopping, upgrading and restarting an instance

To stop an instance, the currently running process has to be identified using the ps command above.

Let's assume we are about to upgrade an Agent instance, so we have to look in the ps output for an agent instance first, and note its process id (pid) and start command.

The process can then be stopped using the following command:

kill -15 <pid-of-agent>

The instance then has to be upgraded using the same command that was used before (in the ps output), but with the additional option:

--database.auto-upgrade=true

After the upgrade procecure has finishing successfully, the instance will remain stopped. So it has to be restarted using the command from the ps output before (this time without the --database.auto-upgrade option).

Once an Agent was upgraded and restarted successfully, repeat the procedure for the other Agent instances in the setup and then repeat the procedure for the Active Failover instances, there starting with the followers.

Final words

The Agency supervision then needs to be reactivated by issuing the following API call to the leader:

curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"off"'