Metrics
You can use ArangoDB server metrics to monitor the healthiness and performance of the system
arangod exports metrics in the Prometheus format . The thresholds for alerts are described for relevant metrics.
Whether the /_admin/metrics*
endpoints are available depends on the setting of
the --server.export-metrics-api
startup option.
For additional document read and write metrics, the
--server.export-read-write-metrics
startup option
needs to be enabled.
Metrics API v2
Get the metrics
Returns the instance’s current metrics in Prometheus format. The returned document collects all instance metrics, which are measured at any given time and exposes them for collection by Prometheus.
The document contains different metrics and metrics groups dependent
on the role of the queried instance. All exported metrics are
published with the prefix arangodb_
or rocksdb_
to distinguish them from
other collected data.
The API then needs to be added to the Prometheus configuration file for collection.
Examples
curl --header 'accept: application/json' --dump - 'http://localhost:8529/_admin/metrics/v2'
AQL
Total number of AQL queries finished
arangodb_aql_all_query_total
Total number of AQL queries finished.
This metric was named arangodb_aql_all_query
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_aql_all_query
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Current number of AQL queries executing
arangodb_aql_current_query
Current number of AQL queries executing.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total memory limit for all AQL queries combined
arangodb_aql_global_memory_limit
Total memory limit for all AQL queries combined, in bytes.
If this value is reported as 0
, it means there is no total memory
limit in place for AQL queries. The value can be adjusted by the setting
the --query.global-memory-limit
startup option.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total memory usage of all AQL queries executing; granularity: 32768 bytes steps
arangodb_aql_global_memory_usage
Total memory usage of all AQL queries currently executing.
The granularity of this metric is steps of 32768 bytes. The current
memory usage of all AQL queries is compared against the configured
limit in the --query.global-memory-limit
startup option.
If the startup option has a value of 0
, then no global memory limit
are enforced. If the startup option has a non-zero value, queries
are aborted once the total query memory usage goes above the configured
limit.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of times the global query memory limit threshold was reached
arangodb_aql_global_query_memory_limit_reached_total
Total number of times the global query memory limit threshold was reached.
This can happen if all running AQL queries in total try to use more memory than
configured via the --query.global-memory-limit
startup option.
Every time this counter increases, an AQL query aborted with a
“resource limit exceeded” error.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of times a local query memory limit threshold was reached
arangodb_aql_local_query_memory_limit_reached_total
Total number of times a local query memory limit threshold was reached, i.e.
a single query tried to allocate more memory than configured in the query’s
memoryLimit
attribute or the value configured via the startup option
--query.memory-limit
.
Every time this counter increases, an AQL query aborted with a
“resource limit exceeded” error.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Execution time histogram for all AQL queries
arangodb_aql_query_time
(basename)arangodb_aql_query_time_bucket
arangodb_aql_query_time_sum
arangodb_aql_query_time_count
Execution time histogram for all AQL queries, in seconds. The histogram includes all slow queries.
Introduced in: v3.6.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Execution time histogram for slow AQL queries
arangodb_aql_slow_query_time
(basename)arangodb_aql_slow_query_time_bucket
arangodb_aql_slow_query_time_sum
arangodb_aql_slow_query_time_count
Execution time histogram for slow AQL queries, in seconds.
Queries are considered “slow” if their execution time is above the
threshold configured in the startup options --query.slow-threshold
or --query.slow-streaming-threshold
, respectively.
Introduced in: v3.6.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total execution time of all AQL queries
arangodb_aql_total_query_time_msec_total
Total execution time of all AQL queries, in milliseconds, including all slow queries.
Introduced in: v3.8.0
Renamed from: arangodb_aql_total_query_time_msec
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of AQL queries which have been executed with dirty reads
arangodb_dirty_read_queries_total
This counter exposes the number of AQL queries which have been executed with “dirty reads”. A dirty read is one which may also use follower shards and not only leader shards. Note that it is the transaction in the context of which the AQL runs which determines, if dirty reads are allowed.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators |
Agency
Agency RAFT follower append time histogram
arangodb_agency_append_hist
(basename)arangodb_agency_append_hist_bucket
arangodb_agency_append_hist_sum
arangodb_agency_append_hist_count
This measures the time an Agency follower needs for individual
append operations resulting from AppendEntriesRPC
requests.
Every event contributes a measurement to the histogram, which
also exposes the number of events and the total sum of all
measurements.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Agents |
Threshold: Normally these times should be clearly sub-second.
Troubleshoot: If you see delays here, the Agents might not have enough IO bandwidth or might be overloaded. Try to provision more IOPS or more CPU capacity, potentially moving Agents to separate machines.
Current number of entries in Agency cache callbacks table
arangodb_agency_cache_callback_number
This reflects the current number of callbacks the local AgencyCache
has registered.
This metric was named arangodb_agency_cache_callback_count
in
previous versions of ArangoDB.
Note that on single servers this metric only has a non-zero value
in the Active Failover deployment mode.
Introduced in: v3.8.0
Renamed from: arangodb_agency_cache_callback_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Coordinators and Single Servers |
Threshold:
This number is usually very low, something between 10
and 20
.
Troubleshoot: If this number is considerably higher, this should be investigated. Please contact support.
Current number of Agency callbacks registered
arangodb_agency_callback_number
This metric reflects the current number of Agency callbacks being
registered, including Agency cache callbacks.
This metric was named arangodb_agency_callback_count
in previous versions
of ArangoDB.
Note that on single servers this metric only has a non-zero value
in the Active Failover deployment mode.
Introduced in: v3.8.0
Renamed from: arangodb_agency_callback_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | Coordinators, DB-Servers and Single Servers |
Threshold:
This number is usually low, between 10
or 20
. It can temporarily
increase while there are ongoing DDL operations in the cluster. The
number should go down again once the DDL operations have finished.
Troubleshoot: If this number is considerably higher, this should be investigated. Please contact support.
Total number of Agency callbacks ever registered
arangodb_agency_callback_registered_total
This metric was named arangodb_agency_callback_registered
in previous versions
of ArangoDB.
Note that on single servers this metric only has a non-zero value
in the Active Failover deployment mode.
Introduced in: v3.8.0
Renamed from: arangodb_agency_callback_registered
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers and Single Servers |
Current number of entries in Agency client id lookup table
arangodb_agency_client_lookup_table_size
Current number of entries in Agency client id lookup table. The lookup table is used internally for Agency inquire operations and should be compacted at the same time when the Agency’s in-memory log is compacted.
Introduced in: v3.6.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | Agents |
Agency RAFT commit histogram
arangodb_agency_commit_hist
(basename)arangodb_agency_commit_hist_bucket
arangodb_agency_commit_hist_sum
arangodb_agency_commit_hist_count
Agency RAFT commit time histogram. Provides a distribution of commit times for all Agency write operations.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Agents |
Agency compaction time histogram
arangodb_agency_compaction_hist
(basename)arangodb_agency_compaction_hist_bucket
arangodb_agency_compaction_hist_sum
arangodb_agency_compaction_hist_count
Agency compaction time histogram. Provides a distribution
of Agency compaction run times. Compactions are triggered after
--agency.compaction-keep-size
entries have accumulated in the
RAFT log.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Agents |
Troubleshoot:
If compaction takes too long, it may be useful to reduce the number
of log entries to keep in --agency.compaction-keep-size
.
This Agent’s commit index
arangodb_agency_local_commit_index
This Agent’s commit index (i.e. the index until it has advanced in the Agency’s RAFT protocol).
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Agents |
Agency replicated log size
arangodb_agency_log_size_bytes
Size of the Agency’s in-memory part of replicated log in bytes.
The replicated log grows in memory until a certain number of
log entries have been accumulated. Then the in-memory log is
compacted. The number of in-memory log entries to keep before
log compaction kicks in can be controlled via the startup option
--agency.compaction-keep-size
.
Introduced in: v3.6.9
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Agents |
Agency read operations with no leader or on followers
arangodb_agency_read_no_leader_total
Total number of Agency read operations with no leader or on followers.
Introduced in: v3.8.0
Renamed from: arangodb_agency_read_no_leader
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents |
Threshold: This should normally not happen. If it happens regularly, the Agency is reelecting its leader often.
Troubleshoot: The latency of the network between the Agents might be too high or the Agents may be overloaded. It might help to move Agent instances to separate machines.
Number of successful Agency read operations
arangodb_agency_read_ok_total
Number of Agency read operations which were successful (i.e. completed
without any error). Successful reads can only be executed on the leader, so
this metric is supposed to increase only on Agency leaders, but not on
followers. Read requests that are executed on followers are rejected
and can be tracked via the metric arangodb_agency_read_no_leader_total
.
This metric was named arangodb_agency_read_ok
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_agency_read_ok
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents |
Counter for FailedServer jobs
arangodb_agency_supervision_failed_server_total
Counter for FailedServer jobs. This counter is increased whenever a
supervision run encounters a failed server and starts a FailedServer job.
This metric was named arangodb_agency_supervision_failed_server_count
in previous versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_agency_supervision_failed_server_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Agents |
Threshold: Many FailedServer jobs indicate frequent failures of DB-Servers. This is generally not good.
Troubleshoot: Find the root cause of server failures. Overload and bad network latency can lead to misdetected server failures.
Agency supervision runtime histogram
arangodb_agency_supervision_runtime_msec
(basename)arangodb_agency_supervision_runtime_msec_bucket
arangodb_agency_supervision_runtime_msec_sum
arangodb_agency_supervision_runtime_msec_count
Agency supervision runtime histogram. A new value is recorded for each run of the supervision.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | simple | Agents |
Threshold: The supervision runtime goes up linearly with the number of collections and shards.
Agency supervision wait for replication time
arangodb_agency_supervision_runtime_wait_for_replication_msec
(basename)arangodb_agency_supervision_runtime_wait_for_replication_msec_bucket
arangodb_agency_supervision_runtime_wait_for_replication_msec_sum
arangodb_agency_supervision_runtime_wait_for_replication_msec_count
Agency supervision replication time histogram. Whenever the Agency supervision carries out changes, it writes them to the leader’s log and replicates the changes to followers. This metric provides a histogram of the time it took to replicate the supervision changes to followers.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Agents |
Agency’s term
arangodb_agency_term
The Agency’s current term.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Agents |
Threshold: This number should usually not grow. If it does, the Agency is doing repeated reelections, which suggests overload or bad network latency between Agents.
Troubleshoot: It might help to reduce network latency between Agents or move Agent instances to separate machines.
Agency write time histogram
arangodb_agency_write_hist
(basename)arangodb_agency_write_hist_bucket
arangodb_agency_write_hist_sum
arangodb_agency_write_hist_count
Agency write time histogram. This histogram provides the distribution of the times spent in Agency write operations, in milliseconds. This only includes the time required to write the data into the leader’s log, but does not include the time required to replicate the writes to the followers.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Agents |
Agency write operations with no leader or on followers
arangodb_agency_write_no_leader_total
Total number of Agency write operations with no leader or on followers.
Introduced in: v3.8.0
Renamed from: arangodb_agency_write_no_leader
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Agents |
Threshold: This should normally not happen. If it happens regularly, the Agency is reelecting its leader often.
Troubleshoot: The latency of the network between the Agents might be too high or the Agents may be overloaded. It might help to move Agent instances to separate machines.
Number of successful Agency write operations
arangodb_agency_write_ok_total
Number of Agency write operations which were successful (i.e. completed
without any error). Successful writes can only be executed on the leader, so
this metric is supposed to increase only on Agency leaders, but not on
followers. Write requests that are executed on followers are rejected
and can be tracked via the metric arangodb_agency_write_no_leader_total
.
This metric was named arangodb_agency_write_ok
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_agency_write_ok
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents |
ArangoSearch
Average time of few last cleanups
arangodb_search_cleanup_time
Average time of few last cleanups.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ms | advanced | DB-Servers and Single Servers |
ArangoSearch columns cache usage
arangodb_search_columns_cache_size
Size of all ArangoSearch columns currently loaded into the cache.
Introduced in: v3.9.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers and Single Servers |
Troubleshoot:
If this metric contains a value close to configured
--arangosearch.columns-cache-limit
, there might be columns
that are marked to be cached but do not fit into the cache.
That may result in query performance degradation. Check the
log for pattern “Failed to allocate memory for buffered column”
Average time of few last commits
arangodb_search_commit_time
Average time of few last commits.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ms | advanced | DB-Servers and Single Servers |
Average time of few last consolidations
arangodb_search_consolidation_time
Average time of few last consolidations.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ms | advanced | DB-Servers and Single Servers |
Number of Arangosearch parallel execution threads requested by queries
arangodb_search_execution_threads_demand
Number of Arangosearch parallel execution threads requested by queries.
Introduced in: v3.11.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Troubleshoot: If this metric contains a value lower than configured “--arangosearch.execution-threads-limit” (number of cores * 2 by default) then there is enought threads for runninng queries with parallel execution. The value of the metric represents the number of currently used threads. If the value is greater than “--arangosearch.execution-threads-limit” that means currently some queries can not get enough threads to achieve the requested parallelism. In that state queries are less parallel up to single-threaded execution. Query performance migth degrade.
Size of the index in bytes for current snapshot
arangodb_search_index_size
Size of the index in bytes for current snapshot.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of documents for current snapshot
arangodb_search_num_docs
Number of documents for current snapshot.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of failed cleanups
arangodb_search_num_failed_cleanups
Number of failed cleanups.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of failed commits
arangodb_search_num_failed_commits
Number of failed commits.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of failed consolidations
arangodb_search_num_failed_consolidations
Number of failed consolidations.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of files for current snapshot
arangodb_search_num_files
Number of files for current snapshot.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of live documents for current snapshot
arangodb_search_num_live_docs
Number of live documents for current snapshot.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of out-of-sync arangosearch links/inverted indexes
arangodb_search_num_out_of_sync_links
Number of arangosearch
View links and inverted indexes that are
currently out of sync. A link or inverted index is out of sync
if the recovery for it is intentionally skipped or a commit
operation on the link/index has failed.
Introduced in: v3.9.4
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Troubleshoot: If this metric contains a value greater than zero, the log files should be checked to find out which links/indexes are affected. The out-of-sync links/indexes should then be dropped and recreated.
Number of primary documents for current snapshot
arangodb_search_num_primary_docs
Number of primary documents for current snapshot.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Number of segments for current snapshot
arangodb_search_num_segments
Number of segments for current snapshot.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers and Single Servers |
Connectivity
Total number of connections created for connection pool
arangodb_connection_pool_connections_created_total
Total number of connections created for connection pool. There are
two pools, one for the Agency communication with label AgencyComm
and one for the other cluster internal communication with label
ClusterComm
.
Introduced in: v3.8.0
Renamed from: arangodb_connection_pool_connections_created
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators and DB-Servers |
Threshold: Because of idle timeouts, the total number of connections ever created grows. However, under high load, most connections should usually be reused and a fast growth of this number can indicate underlying connectivity issues.
Current number of connections in pool
arangodb_connection_pool_connections_current
Current number of connections in pool. There are two pools, one for the
Agency communication with label AgencyComm
and one for the other
cluster internal communication with label ClusterComm
.
Introduced in: v3.8.0
Renamed from: arangodb_connection_connections_current
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators and DB-Servers |
Threshold: Normally, one should not see an excessive amount of open connections here, unless a very high amount of operations happens concurrently.
Time to lease a connection from the connection pool
arangodb_connection_pool_lease_time_hist
(basename)arangodb_connection_pool_lease_time_hist_bucket
arangodb_connection_pool_lease_time_hist_sum
arangodb_connection_pool_lease_time_hist_count
Time to lease a connection from the connection pool. There are two pools,
one for the Agency communication with label AgencyComm
and one for
the other cluster internal communication with label ClusterComm
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | simple | Coordinators and DB-Servers |
Threshold: Leasing connections from the pool should be fast, unless a new connection has to be formed, which can easily take (in particular with TLS) several milliseconds. If times are a lot higher, then some underlying network problem might be there.
Total number of failed connection leases
arangodb_connection_pool_leases_failed_total
Total number of failed connection leases. There are two pools, one for
the Agency communication with label AgencyComm
and one for the other
cluster internal communication with label ClusterComm
.
Introduced in: v3.8.0
Renamed from: arangodb_connection_pool_leases_failed
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators and DB-Servers |
Threshold: A failed lease can happen if a connection has been terminated by some idle timeout or if it is already in use by some other request. Since this can happen under concurrent load, failed leases are not actually very worrying.
Total number of successful connection leases from connection pool
arangodb_connection_pool_leases_successful_total
Total number of successful connection leases from connection pool.
There are two pools, one for the Agency communication with label
AgencyComm
and one for the other cluster internal communication with
label ClusterComm
.
Introduced in: v3.8.0
Renamed from: arangodb_connection_leases_successful
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators and DB-Servers |
Threshold: It is normal that this number is growing rapidly when there is any kind of activity in the cluster.
Total number of HTTP/1.1 connections accepted
arangodb_http1_connections_total
Total number of connections accepted for HTTP/1.1. Note that this can include connections that are negotiated to be upgraded to HTTP/2.
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total number of HTTP/2 connections accepted
arangodb_http2_connections_total
Total number of connections accepted for HTTP/2, this can be upgraded connections from HTTP/1.1 or connections negotiated to be HTTP/2 during the TLS handshake.
Introduced in: v3.7.15
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total number of VST connections accepted
arangodb_vst_connections_total
Total number of connections accepted for VST, this are upgraded connections from HTTP/1.1.
Introduced in: v3.7.15
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Errors
Total number of errors logged
arangodb_logger_errors_total
Total number of errors (ERR messages) logged by the logger.
If a problem is encountered which is fatal to some operation, but not for the service or the application as a whole, then an _error is logged.
Reasons for log entries of this severity are for example include missing data, inability to open required files, incorrect connection strings, missing services.
If an error is logged then it should be taken seriously as it may require user intervention to solve.
Introduced in: v3.9.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents, Coordinators, DB-Servers and Single Servers |
Total number of warnings logged
arangodb_logger_warnings_total
Total number of warnings (WARN messages) logged by the logger, including startup warnings.
Warnings might indicate problems, or might not. For example, expected transient environmental conditions such as short loss of network or database connectivity are logged as warnings, not errors.
Introduced in: v3.9.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents, Coordinators, DB-Servers and Single Servers |
Health
Number of drop-follower events
arangodb_dropped_followers_total
Total number of drop-follower events. This metric is increased on leaders
whenever a write operation cannot be replicated to a follower during
synchronous replication, and it would be unsafe in terms of data consistency
to keep that follower.
This metric was named arangodb_dropped_followers_count
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_dropped_followers_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Threshold: Usually, drop-follower events should only happen if servers are restarted or if there are real problems on followers.
Total number of failed heartbeat transmissions
arangodb_heartbeat_failures_total
Total number of failed heartbeat transmissions. Servers in a cluster periodically send their heartbeats to the Agency to report their own liveliness. This counter gets increased whenever sending such a heartbeat fails. In the single server, this counter is only used in the Active Failover deployment mode.
Introduced in: v3.8.0
Renamed from: arangodb_heartbeat_failures
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers and Single Servers |
Threshold: It is a bad sign for health if heartbeat transmissions fail. This can lead to failover actions which are ultimately bad for the service.
Troubleshoot: This can be a sign of overload or of bad network connectivity. Potentially move the Agent instances to separate machines.
Time required to send a heartbeat
arangodb_heartbeat_send_time_msec
(basename)arangodb_heartbeat_send_time_msec_bucket
arangodb_heartbeat_send_time_msec_sum
arangodb_heartbeat_send_time_msec_count
Histogram of times required to send heartbeats. For every heartbeat sent the time is measured and an event is put into the histogram. In the single server, this counter is only used in the Active Failover deployment mode.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Coordinators, DB-Servers and Single Servers |
Threshold: It is a bad sign for health if heartbeat transmissions are not fast. If there are heartbeats which frequently take longer than a few hundred milliseconds, or even seconds, this can eventually lead to failover actions which are ultimately bad for the service.
Troubleshoot: High heartbeat send times can be a sign of overload or of bad network connectivity. Potentially move the Agent instances to separate machines.
Number of delays in the io heartbeat test
arangodb_ioheartbeat_delays_total
This counter is increased whenever the io heartbeat encounters a delay
of at least 1s when writing a small file to the database directory,
reading it and then removing it again.
This test is done periodically to ensure that the underlying volume is
usable and performs reasonably well. The test can be switched off
explicitly with the flag --database.io-heartbeat=false
, but the
default is true
. Furthermore, every such failure leads to a line in
the log at INFO level for the ENGINES
topic.
Introduced in: v3.8.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers, Agents and Single Servers |
Histogram of execution times of a single IO heartbeat check
arangodb_ioheartbeat_duration
(basename)arangodb_ioheartbeat_duration_bucket
arangodb_ioheartbeat_duration_sum
arangodb_ioheartbeat_duration_count
This histogram is updated whenever the io heartbeat runs its test in
the database directory. It writes a small file, syncs it to durable
storage, reads it, and then unlinks the file again. This test is done
periodically to ensure that the underlying volume is usable and performs
reasonably well. The test can be switched off explicitly with the flag
--database.io-heartbeat=false
, but the default is true
.
Introduced in: v3.8.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | medium | DB-Servers, Agents and Single Servers |
Number of failures in the io heartbeat test
arangodb_ioheartbeat_failures_total
This counter is increased whenever the io heartbeat encounters a problem
when writing a small file to the database directory, reading it and then
removing it again. This test is done
periodically to ensure that the underlying volume is usable. The test can
be switched off explicitly with the flag --database.io-heartbeat=false
,
but the default is true
. Furthermore, every such failure leads to a
line in the log at INFO level for the ENGINES
topic.
Introduced in: v3.8.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers, Agents and Single Servers |
License
This instance’s license expiry in days
arangodb_license_expires
This instance’s remaining license validity time.
Introduced in: v3.9.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators and DB-Servers |
Maintenance
Current
loading runtimes
arangodb_load_current_runtime
(basename)arangodb_load_current_runtime_bucket
arangodb_load_current_runtime_sum
arangodb_load_current_runtime_count
Histogram of Current
loading runtimes, i.e. the runtimes
of the ClusterInfo::loadCurrent
internal method. Provides a
distribution of all loading times for the Current
section of the Agency data. The Current
section gets
loaded on server startup, and then gets reloaded on servers
only for any databases in which there have been recent structural
changes (i.e. DDL changes).
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Coordinators and DB-Servers |
Troubleshoot: In case this histogram contains very high loading times, this may be due to using many collections or many shards inside a database for which there are often structural changes. It then may make sense to reduce the number of collections or number of shards. Note that this can have other effects, so it requires an informed decision.
Plan
loading runtimes
arangodb_load_plan_runtime
(basename)arangodb_load_plan_runtime_bucket
arangodb_load_plan_runtime_sum
arangodb_load_plan_runtime_count
Histogram of Plan
loading runtimes, i.e. the runtimes
of the ClusterInfo::loadPlan
internal method. Provides a
distribution of all loading times for the Plan
section of the Agency data. The Plan
section normally gets
loaded on server startup, and then gets reloaded on servers
only for any databases in which there have been recent structural
changes (i.e. DDL changes).
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Coordinators and DB-Servers |
Troubleshoot: In case this histogram contains very high loading times, this may be due to using many collections or many shards inside a database for which there are often structural changes. It then may make sense to reduce the number of collections or number of shards. Note that this can have other effects, so it requires an informed decision.
Counter of actions that are done and have been removed from the registry
arangodb_maintenance_action_done_total
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
This metric counts the number of actions that are done and have been removed.
Introduced in: v3.8.0
Renamed from: arangodb_maintenance_action_done_counter
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Counter of actions that have been discarded because of a duplicate
arangodb_maintenance_action_duplicate_total
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
This metric counts the number of actions that have been created but found to be a duplicate of a already queued action.
Introduced in: v3.8.0
Renamed from: arangodb_maintenance_action_duplicate_counter
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Failure counter for the maintenance actions
arangodb_maintenance_action_failure_total
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
Those action can fail for different reasons. This metric counts the failed actions and can thus provide hints to investigate a malfunction.
Introduced in: v3.8.0
Renamed from: arangodb_maintenance_action_failure_counter
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Time spent in the queue before execution for maintenance actions
arangodb_maintenance_action_queue_time_msec
(basename)arangodb_maintenance_action_queue_time_msec_bucket
arangodb_maintenance_action_queue_time_msec_sum
arangodb_maintenance_action_queue_time_msec_count
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
This metric tracks the time actions spend waiting in the queue in a histogram.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | advanced | DB-Servers |
Counter of actions that have been registered in the action registry
arangodb_maintenance_action_registered_total
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
This metric counts the number of actions that are queued or active.
Introduced in: v3.8.0
Renamed from: arangodb_maintenance_action_registered_counter
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Time spent executing a maintenance action
arangodb_maintenance_action_runtime_msec
(basename)arangodb_maintenance_action_runtime_msec_bucket
arangodb_maintenance_action_runtime_msec_sum
arangodb_maintenance_action_runtime_msec_count
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. Actions are created, registered, queued and executed. Once they are done, they are eventually removed.
This metric tracks the time actions spend executing in a histogram.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | advanced | DB-Servers |
Total time spent on Agency sync
arangodb_maintenance_agency_sync_runtime_msec
(basename)arangodb_maintenance_agency_sync_runtime_msec_bucket
arangodb_maintenance_agency_sync_runtime_msec_sum
arangodb_maintenance_agency_sync_runtime_msec_count
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. To identify the target state differences in the meta data store provided by the Agency are investigated and local changes are reported. This process is called Agency sync and is executed in regular intervals.
This metric tracks the runtime of individual Agency syncs in a histogram. During DDL operations the runtime can increase but should generally be below 1s.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | simple | DB-Servers |
Maintenance Phase 1 runtime histogram
arangodb_maintenance_phase1_runtime_msec
(basename)arangodb_maintenance_phase1_runtime_msec_bucket
arangodb_maintenance_phase1_runtime_msec_sum
arangodb_maintenance_phase1_runtime_msec_count
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. To identify the target state differences in the meta data store provided by the Agency are investigated and local changes are reported. This process is called Agency sync and is executed in regular intervals.
This metric tracks the runtime of phase1 of an Agency sync. Phase1 calculates the difference between the local and the target state.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | advanced | DB-Servers |
Maintenance Phase 2 runtime histogram
arangodb_maintenance_phase2_runtime_msec
(basename)arangodb_maintenance_phase2_runtime_msec_bucket
arangodb_maintenance_phase2_runtime_msec_sum
arangodb_maintenance_phase2_runtime_msec_count
DB-Servers execute reconciliation actions to let the cluster converge to the desired state. To identify the target state differences in the meta data store provided by the Agency are investigated and local changes are reported. This process is called Agency sync and is executed in regular intervals.
This metric tracks the runtime of phase2 of an Agency sync. Phase2 calculates what actions to execute given the difference of the local and target state.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | advanced | DB-Servers |
Network
Request time for Agency requests
arangodb_agencycomm_request_time_msec
(basename)arangodb_agencycomm_request_time_msec_bucket
arangodb_agencycomm_request_time_msec_sum
arangodb_agencycomm_request_time_msec_count
This histogram shows how long requests to the Agency took.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | ms | medium | Coordinators and DB-Servers |
Threshold: Usually, such requests should be relatively quick, mostly clearly sub-second.
Troubleshoot: If the network or the Agents are overloaded, it can help to move Agent instances to separate machines.
Number of failed connectivity check requests sent to Coordinators
arangodb_network_connectivity_failures_coordinators_total
Number of failed connectivity check requests sent by this instance
to Coordinators.
The metric will be increased if a cluster-internal connection to a
Coordinator cannot be established within 10 seconds. In this case the
instance will also log a warning.
Connectivity checks run with a configurable frequency, adjustable via
the --cluster.connectivity-check-interval
startup option.
Introduced in: v3.11.4
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators and DB-Servers |
Troubleshoot: If this metric keeps increasing permanently or for longer periods, there are likely network connectivity issues or misconfigurations, which can prevent the instance from connecting to every Coordinator in the cluster (potentially connecting to itself). In this case it is advised to check the connectivity manually, and ensure that the instance can make connections to all Coordinators, potentially including itself.
If there are temporary network glitches or instance restarts, this metric can also increase. But increases should stop once the network has stabilized and/or instances have restarted successfully.
Number of failed connectivity check requests sent to DB-Servers
arangodb_network_connectivity_failures_dbservers_total
Number of failed connectivity check requests sent by this instance
to DB-Server.
The metric will be increased if a cluster-internal connection to a
DB-Server cannot be established within 10 seconds. In this case the
instance will also log a warning.
Connectivity checks run with a configurable frequency, adjustable via
the --cluster.connectivity-check-interval
startup option.
Introduced in: v3.11.4
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators and DB-Servers |
Troubleshoot: If this metric keeps increasing permanently or for longer periods, there are likely network connectivity issues or misconfigurations, which can prevent the instance from connecting to every DB-Server in the cluster (potentially connecting to itself). In this case it is advised to check the connectivity manually, and ensure that the instance can make connections to all DB-Servers, potentially including itself.
If there are temporary network glitches or instance restarts, this metric can also increase. But increases should stop once the network has stabilized and/or instances have restarted successfully.
Internal request duration for the dequeue in seconds
arangodb_network_dequeue_duration
(basename)arangodb_network_dequeue_duration_bucket
arangodb_network_dequeue_duration_sum
arangodb_network_dequeue_duration_count
Histogram providing the time from submitting an internal requests until the IO thread in the fuerte driver starts working on it. Times are in seconds.
Introduced in: v3.10.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | seconds | advanced | Coordinators, DB-Servers and Agents |
Troubleshoot: Counts in the high brackets indicate that the IO threads cannot keep up with the work.
Number of requests forwarded to another Coordinator
arangodb_network_forwarded_requests_total
Number of requests forwarded to another Coordinator. Request forwarding can happen in load-balanced setups, when one Coordinator receives and forwards requests that can only be handled by a different Coordinator. This includes requests for streaming transactions, AQL, query cursors, Pregel jobs and some others.
Introduced in: v3.8.0
Renamed from: arangodb_network_forwarded_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators |
Internal request round-trip time as a percentage of timeout
arangodb_network_request_duration_as_percentage_of_timeout
Histogram providing the round-trip time of internal requests as a percentage
of the respective request timeout.
This metric provides values between 0
and 100
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | percentage | advanced | Coordinators, DB-Servers and Agents |
Troubleshoot: High values indicate problems with requests that have timed out or have not been far away from running into timeouts. If many requests timeout, this is normally a symptom of overload. This can normally be mitigated by reducing the workload or adjusting the type of operations that are causing the high response times. If the timeouts happen as a result of not enough processing power, it may be useful to scale up the cluster.
Number of internal requests that have timed out
arangodb_network_request_timeouts_total
Number of internal requests that have timed out. This metric is increased whenever any cluster-internal request executed in the underlying connection pool runs into a timeout.
Introduced in: v3.8.0
Renamed from: arangodb_network_request_timeouts
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers and Agents |
Troubleshoot: Request timeouts can be caused by the destination servers being overloaded and thus slow to respond, or by network errors. If this counter increases, it is advised to check network connectivity and server loads.
Number of outgoing internal requests in flight
arangodb_network_requests_in_flight
Number of outgoing internal requests in flight. This metric is increased whenever any cluster-internal request is about to be sent via the underlying connection pool, and is decreased whenever a response for such a request is received or the request runs into a timeout. This metric provides an estimate of the fan-out of operations. For example, a user operation on a collection with a single shard normally leads to a single internal request (plus replication), whereas an operation on a collection with 10 shards may lead to a fan-out of 10 (plus replication).
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators, DB-Servers and Agents |
Internal request duration from fully sent till response received in seconds
arangodb_network_response_duration
(basename)arangodb_network_response_duration_bucket
arangodb_network_response_duration_sum
arangodb_network_response_duration_count
Histogram providing the time from when the request was fully sent off until the response has been received of internal requests in seconds.
Introduced in: v3.10.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | seconds | advanced | Coordinators, DB-Servers and Agents |
Troubleshoot: Counts in the high brackets indicate problems with the network infrastructure.
Internal request send duration in seconds
arangodb_network_send_duration
(basename)arangodb_network_send_duration_bucket
arangodb_network_send_duration_sum
arangodb_network_send_duration_count
Histogram providing the sending time of internal requests in seconds.
Introduced in: v3.10.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | seconds | advanced | Coordinators, DB-Servers and Agents |
Troubleshoot: Counts in the high brackets indicate problems with the network infrastructure.
Number of internal requests for which sending has not finished
arangodb_network_unfinished_sends_total
Number of internal requests for which sending has not finished. This is usually due to some connection problem or to a timeout in case the receiving side did not receive the data fast enough.
Introduced in: v3.10.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers and Agents |
Troubleshoot: If this counter moves, it is a sign that either there are delays in the networking infrastructure or on the receiving side.
Pregel
Number of loading Pregel conductors
arangodb_pregel_conductors_loading_number
Number of loading Pregel conductors.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of Pregel conductors
arangodb_pregel_conductors_number
Number of Pregel conductors.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of running Pregel conductors
arangodb_pregel_conductors_running_number
Number of running Pregel conductors.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of storing Pregel conductors
arangodb_pregel_conductors_storing_number
Number of storing Pregel conductors.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Memory allocated by Pregel for graph storage
arangodb_pregel_graph_memory_bytes_number
The number of bytes allocated by Pregel.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | DB-Servers, Single Servers and Coordinators |
Number of messages received by Pregel
arangodb_pregel_messages_received_total
The number of messages received by Pregel.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of messages sent by Pregel
arangodb_pregel_messages_sent_total
Number of messages sent by Pregel.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of threads running for Pregel
arangodb_pregel_threads_number
Number of threads running for Pregel.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of loading Pregel workers
arangodb_pregel_workers_loading_number
Number of loading Pregel workers.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of Pregel workers
arangodb_pregel_workers_number
Number of Pregel Workers.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of running Pregel workers
arangodb_pregel_workers_running_number
Number of running Pregel workers.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Number of storing Pregel workers
arangodb_pregel_workers_storing_number
Number of storing Pregel workers.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Single Servers and Coordinators |
Replication
Total number of collection truncate operations by synchronous replication
arangodb_collection_truncates_replication_total
Total number of collection truncate operations by synchronous
replication on followers. Note that this metric is only present when the command
line option --server.export-read-write-metrics
is set to true
.
Introduced in: v3.8.0
Renamed from: arangodb_collection_truncates_replication
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Total number of document write operations by synchronous replication
arangodb_document_writes_replication_total
Total number of document write operations by synchronous replication.
This metric is only present if the option
--server.export-read-write-metrics
is set to true
.
Total number of document write operations (insert, update, replace, remove)
executed by the synchronous replication on followers.
This metric is only present if the option --server.export-read-write-metrics
is set to true
.
Introduced in: v3.8.0
Renamed from: arangodb_document_writes_replication
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of refusal answers from a follower during synchronous replication
arangodb_refused_followers_total
Number of refusal answers from a follower during synchronous replication.
A refusal answer is only sent by a follower if the follower is under
the impression that the replication request was not sent by the current
shard leader. This can happen if replication requests to the follower are
delayed or the follower is slow to process incoming requests and there was
a leader change for the shard.
If such a refusal answer is received by the shard leader, it drops the
follower from the list of followers.
This metrics was named arangodb_refused_followers_count
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_refused_followers_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Threshold: Usually, refusal answers only occur if request processing on followers is delayed and there was a recent leadership change. This should not be a common case and normally indicates a problem with the setup or with the load.
Histogram of the RTT of appendEntries requests in microseconds
arangodb_replication2_replicated_log_append_entries_rtt
(basename)arangodb_replication2_replicated_log_append_entries_rtt_bucket
arangodb_replication2_replicated_log_append_entries_rtt_sum
arangodb_replication2_replicated_log_append_entries_rtt_count
The leader of a replicated log replicates the log entries by sending appendEntries requests to its followers. This is a histogram keeping track of the number of such requests this server sends (for any log it is a leader of), as well as their respective round-trip times in microseconds.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | medium | DB-Servers and Single Servers |
Number of replicated log instances created on this server since its start
arangodb_replication2_replicated_log_creation_total
Number of replicated log instances created on this server since its start.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers and Single Servers |
Number of replicated log instances deleted on this server since its start
arangodb_replication2_replicated_log_deletion_total
Number of replicated log instances deleted on this server since its start.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers and Single Servers |
Histogram of the duration of appendEntries calls in microseconds
arangodb_replication2_replicated_log_follower_append_entries_rt
(basename)arangodb_replication2_replicated_log_follower_append_entries_rt_bucket
arangodb_replication2_replicated_log_follower_append_entries_rt_sum
arangodb_replication2_replicated_log_follower_append_entries_rt_count
Followers of a replicated log, including the local followers for persistence each leader has, receive appendEntries requests. These contain log entries to be replicated. Serving these requests means writing them to persistence.
This histogram counts the number of such requests server by this server, plus their respective runtime in microseconds.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | advanced | DB-Servers and Single Servers |
Current number of followers of replicated logs
arangodb_replication2_replicated_log_follower_number
Current number of replicated logs this server is a follower of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers and Single Servers |
Current number of inactive replicated logs
arangodb_replication2_replicated_log_inactive_number
Current number of replicated logs this server is a participant of, but is not yet configured to be either a leader or a follower. This number should usually be zero, except during server startup until the configuration from the Agency is fetched and applied, and for a short time when a log is newly created.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers and Single Servers |
Number of bytes inserted into replicated logs
arangodb_replication2_replicated_log_inserts_bytes
(basename)arangodb_replication2_replicated_log_inserts_bytes_bucket
arangodb_replication2_replicated_log_inserts_bytes_sum
arangodb_replication2_replicated_log_inserts_bytes_count
For all replicated logs this server is a leader of, this counts the number of bytes of raw payload inserted into it.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | medium | DB-Servers and Single Servers |
Round-trip time of inserts into replicated logs in microseconds
arangodb_replication2_replicated_log_inserts_rtt
(basename)arangodb_replication2_replicated_log_inserts_rtt_bucket
arangodb_replication2_replicated_log_inserts_rtt_sum
arangodb_replication2_replicated_log_inserts_rtt_count
Inserts into replicated logs this server is a leader of are counted by this histogram, plus their respective round-trip time. This includes the time for replication, at least until the writeConcern is satisfied.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | medium | DB-Servers and Single Servers |
Current number of replicated log leaders
arangodb_replication2_replicated_log_leader_number
Number of replicated logs this server is currently a leader of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers and Single Servers |
Number of leader takeovers
arangodb_replication2_replicated_log_leader_took_over_total
Number of times this server took over the leadership of a replicated log since server start.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Number of replicated logs
arangodb_replication2_replicated_log_number
Number of replicated logs this server is a participant of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers and Single Servers |
Number of accepted (not yet committed) log entries
arangodb_replication2_replicated_log_number_accepted_entries_total
The entries have been inserted into the log and are being replicated but a quorum has not yet been reached.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of committed log entries
arangodb_replication2_replicated_log_number_committed_entries_total
Number of committed log entries.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of compacted log entries
arangodb_replication2_replicated_log_number_compacted_entries_total
Number of compacted log entries. If log entries have been replicated to every participant and applied durably, they can be compacted, i.e. deleted.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of meta log entries
arangodb_replication2_replicated_log_number_meta_entries_total
Number of meta log entries. A meta log entry is used to create artificial write barriers, for example during a configuration change or move shard operation.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of times started following a replicated log
arangodb_replication2_replicated_log_started_following_total
Number of times this server started following a replicated log since server start.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Number of errors during an acquire snapshot operation
arangodb_replication2_replicated_state_acquire_snapshot_errors_total
Number of errors during an acquire snapshot operation.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Number of log entries applied to the internal state
arangodb_replication2_replicated_state_applied_entries_total
Number of log entries applied to the internal state.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Number of errors during an apply entries operation
arangodb_replication2_replicated_state_apply_entries_errors_total
Number of errors during an apply entries operation.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Histogram of the duration of acquireSnapshot calls in microseconds
arangodb_replication2_replicated_state_follower_acquire_snapshot_rt
(basename)arangodb_replication2_replicated_state_follower_acquire_snapshot_rt_bucket
arangodb_replication2_replicated_state_follower_acquire_snapshot_rt_sum
arangodb_replication2_replicated_state_follower_acquire_snapshot_rt_count
Histogram of the duration of acquireSnapshot calls in microseconds. This measures the time a new follower takes to acquire a snapshot from the leader.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | advanced | DB-Servers |
Histogram of the duration of applyEntries calls in microseconds
arangodb_replication2_replicated_state_follower_apply_entries_rt
(basename)arangodb_replication2_replicated_state_follower_apply_entries_rt_bucket
arangodb_replication2_replicated_state_follower_apply_entries_rt_sum
arangodb_replication2_replicated_state_follower_apply_entries_rt_count
Histogram of the duration of applyEntries calls in microseconds. This measures the time a follower takes to apply log entries to its local state after they have been committed.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | advanced | DB-Servers |
Number of follower replicated states
arangodb_replication2_replicated_state_follower_number
Number of replicated states this server is a follower of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of followers waiting for the leader to acknowledge the current term
arangodb_replication2_replicated_state_follower_waiting_for_leader_number
Number of followers waiting for the leader to acknowledge the current term.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers |
Number of followers waiting for a snapshot transfer to complete
arangodb_replication2_replicated_state_follower_waiting_for_snapshot_number
Number of followers waiting for a snapshot transfer to complete.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers |
Number of leader replicated states
arangodb_replication2_replicated_state_leader_number
Number of replicated states this server is a leader of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Histogram of the duration of recoverEntries calls in microseconds
arangodb_replication2_replicated_state_leader_recover_entries_rt
(basename)arangodb_replication2_replicated_state_leader_recover_entries_rt_bucket
arangodb_replication2_replicated_state_leader_recover_entries_rt_sum
arangodb_replication2_replicated_state_leader_recover_entries_rt_count
Histogram of the duration of recoverEntries calls in microseconds. This measures the time a new leader takes to recover from existing log entries after it was elected.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | us | advanced | DB-Servers |
Number of leaders waiting for recovery to be complete
arangodb_replication2_replicated_state_leader_waiting_for_recovery_number
Number of leaders waiting for recovery to be complete.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers |
Number of replicated states
arangodb_replication2_replicated_state_number
Number of replicated states this server is a participant of.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Number of log entries processed by the follower
arangodb_replication2_replicated_state_processed_entries_total
Number of log entries processed by the follower.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Number of currently connected/active replication clients
arangodb_replication_clients
This metric contains the number of currently active/connected replication clients that have started or are currently receiving data from this server for replication purposes.
Introduced in: v3.10.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Single Servers and DB-Servers |
(DC-2-DC only) Number of times the database and collection overviews have been requested
arangodb_replication_cluster_inventory_requests_total
When using a DC-2-DC configuration of ArangoDB this metric is active on both data-centers. It indicates that the follower data-center periodically matches the available databases and collections in order to mirror them. If no DC-2-DC is set up this value is expected to be 0.
Introduced in: v3.8.0
Renamed from: arangodb_replication_cluster_inventory_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators |
Troubleshoot: If you have a DC-2-DC installation, and this metric stays constant over a longer period of time in any of the two data centers this indicates that the follower data center is not properly connected anymore. The issue most likely is within the sync process on either of the two data-centers as they do not compare their inventory anymore. This gives no information about the healthiness of the ArangoDB cluster itself, please check other metrics for this.
Accumulated time needed to apply asynchronously replicated data on initial synchronization of shards
arangodb_replication_dump_apply_time_total
Measures the time required to clone the existing leader copy of the data onto a new replica shard. It is only measured on the follower server. This time is expected to increase whenever new followers are created, e.g. increasing replication factor, shard redistribution.
Introduced in: v3.8.0
Renamed from: arangodb_replication_dump_apply_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Troubleshoot: This metric measures as typical operation to keep the cluster resilient, so no reaction is required. In a stable cluster situation (no outages, no collection modification) this metric should also be stable.
Total number of bytes replicated in initial asynchronous phase
arangodb_replication_dump_bytes_received_total
During initial replication the existing data from the leader is copied asynchronously over to new shards. The amount of requests required to transport data to this server, as a replica for a shard, is counted here.
Introduced in: v3.8.0
Renamed from: arangodb_replication_dump_bytes_received
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | bytes | medium | DB-Servers |
Total number of documents replicated in initial asynchronous phase
arangodb_replication_dump_documents_total
During initial replication the existing data from the leader is copied asynchronously over to new shards. The amount of documents transported to this server, as a replica for a shard, is counted here.
Introduced in: v3.8.0
Renamed from: arangodb_replication_dump_documents
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers |
Accumulated wait time for replication requests in initial asynchronous phase
arangodb_replication_dump_request_time_total
During initial replication the existing data from the leader is copied asynchronously over to new shards. The accumulated time the follower waited for the leader to send the data is counted here.
Introduced in: v3.8.0
Renamed from: arangodb_replication_dump_request_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Number of requests used in initial asynchronous replication phase
arangodb_replication_dump_requests_total
During initial replication the existing data from the leader is copied asynchronously over to new shards. The amount of data transported to this server, as a replica for a shard, is counted here.
Introduced in: v3.8.0
Renamed from: arangodb_replication_dump_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of failed connection attempts and response errors during initial
asynchronous replication
arangodb_replication_failed_connects_total
During initial replication the existing data from the leader is copied asynchronously over to new shards. Whenever there is a communication issue between the follower and the leader of the shard, it is counted here for the follower. This communication issues cover failed connections or http errors, but they also cover invalid or unexpected data formats received on the follower.
Introduced in: v3.8.0
Renamed from: arangodb_replication_failed_connects
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Threshold: In ideal situation this counter should be 0. It is expected to increase if there is server or network outage. However it is not guaranteed that this metric increases in such a situation.
Troubleshoot: If this counter increases this typically indicates an issue with the communication between servers. If it is just occasionally an increase of one, it can be a simple network hiccup, if you see constant increases here that indicates serious issues. This also indicates that there is a shard trying to get into sync with the existing data, which cannot make progress. So you have only replicationFactor - 1 copies of the data right now. If more servers suffer outage you may lose data in this case.
- First thing to check: Network connectivity, make sure all servers are online and the machines can communicate to one-another.
- Second: Check ArangoDB logs of this server for more details, most likely
you see
WARN
orERROR
messages in thereplication
log topic. If you contact ArangoDB support for this issue, it helps to include these servers logs as well. - Third: (Unlikely) If the logs contain unexpected format or value entries please check if you are running all ArangoDB DB-Servers within the same version of ArangoDB. Only upgrades of one minor version at a time are supported in general, so if you are running one server with a much newer / older version please upgrade all servers to the newest version.
- Forth: If none of the above applies, please contact ArangoDB Support.
Accumulated wait time for replication key chunks determination requests
arangodb_replication_initial_chunks_requests_time_total
This counter exhibits the accumulated wait time for replication key chunks determination requests, in milliseconds. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the time used for the initial step of getting the checksums for the key chunks.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_chunks_requests_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated time needed to request replication docs data
arangodb_replication_initial_docs_requests_time_total
This counter exhibits the accumulated wait time for requesting actual documents for the initial replication, in milliseconds. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the time used for the final step of actually getting the needed documents.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_docs_requests_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated time needed to apply replication initial sync insertions
arangodb_replication_initial_insert_apply_time_total
Accumulated time needed to apply replication initial sync insertions. This counter exhibits the accumulated wait time for actually inserting documents for the initial synchronization, in milliseconds. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the time used for the actual insertion of replicated documents on the follower.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_insert_apply_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated wait time for replication keys requests
arangodb_replication_initial_keys_requests_time_total
This counter exhibits the accumulated wait time for fetching key lists for a chunk, in milliseconds. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the time used for the second step of getting lists of key/revision pairs for each chunk.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_keys_requests_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated time needed to apply replication initial sync removals
arangodb_replication_initial_remove_apply_time_total
This counter exhibits the accumulated wait time for removing local documents during initial synchronization of a shard on the follower, in milliseconds. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the time used for the intermediate step of removing unneeded documents on the follower.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_remove_apply_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated amount of bytes received in initial sync
arangodb_replication_initial_sync_bytes_received_total
This counter exhibits the accumulated number of bytes received for initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates number of bytes received for all three steps.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_bytes_received
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | bytes | medium | DB-Servers |
Number of documents inserted by replication initial sync
arangodb_replication_initial_sync_docs_inserted_total
This counter exhibits the total number of documents inserted on the follower during initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the total number of documents inserted in the third step.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_docs_inserted
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of documents removed by replication initial sync
arangodb_replication_initial_sync_docs_removed_total
This counter exhibits the total number of documents removed on the follower during initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the total number of documents removed in the third step.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_docs_removed
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of documents requested by replication initial sync
arangodb_replication_initial_sync_docs_requested_total
This counter exhibits the total number of documents fetched on the follower from the leader during initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the total number of documents fetched from the leader in the third step.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_docs_requested
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of replication initial sync docs requests
arangodb_replication_initial_sync_docs_requests_total
This counter exhibits the total number of times documents have been fetched on the follower from the leader during initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric accumulates the total number of times documents have been fetched from the leader in the third step.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_docs_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of replication initial sync keys requests
arangodb_replication_initial_sync_keys_requests_total
This counter exhibits the accumulated number of keys requests for initial synchronization of shards. This is part of the older (pre 3.8) initial replication protocol, which might still be used in 3.8 for collections which have been created by older versions.
In this older protocol, the follower first fetches an overview over a shard from the leader. This does a full collection scan and divides the primary keys in the collection into equal sized chunks. Then, a checksum for each chunk is returned. The same is then done on the follower and the checksums are compared, chunk by chunk. For each chunk, for which the checksums do not match, the list of keys and revisions is fetched from the leader. This then enables the follower to fetch the actually needed documents and remove superfluous ones locally.
This metric counts the number of times the follower fetches a list of keys for some chunk.
Introduced in: v3.8.0
Renamed from: arangodb_replication_initial_sync_keys_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Total number of synchronous replication requests
arangodb_replication_synchronous_requests_total_number_total
The total amount of all synchronous replication operation requests between DB-Servers being done.
Introduced in: v3.8.0
Renamed from: arangodb_replication_synchronous_requests_total_number
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Total time needed for all synchronous replication requests
arangodb_replication_synchronous_requests_total_time_total
The total time needed for all synchronous replication requests between DB-Servers being done.
Introduced in: v3.8.0
Renamed from: arangodb_replication_synchronous_requests_total_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Accumulated time needed to apply replication tailing data
arangodb_replication_tailing_apply_time_total
The accumulated time needed to locally process the continuous replication log on a follower received from a replication leader.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_apply_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | DB-Servers |
Troubleshoot: If you see unusual spikes here, the follower might not have enough IO bandwidth or might be overloaded. Try to provision more IOPS or more CPU capacity. Additionally, it could make sense to compare the value with all other available follower DB-Servers to detect potential differences.
Accumulated number of bytes received for replication tailing requests
arangodb_replication_tailing_bytes_received_total
The accumulated number of bytes received from a leader for replication tailing requests. The higher the amount of bytes is, the more data is being processed afterwards on the follower DB-Server.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_bytes_received
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | bytes | medium | DB-Servers |
Troubleshoot: Compare this metric with all other related participating follower DB-Servers. If the given value on a DB-Server is considerable higher, you might want to think about rebalancing your data as the overall work might not be evenly distributed.
Accumulated number of replication tailing document inserts/replaces processed
arangodb_replication_tailing_documents_total
The accumulated number of replication tailing document inserts/replaces processed on a follower.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_documents
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Troubleshoot: Compare this metric with all other related participating follower DB-Servers. If the given value on a DB-Server is considerable higher, you might want to think about rebalancing your data as the overall work might not be evenly distributed. It is important to understand that this metric only enumerates the amount of documents and does not compare document sizes. Even if values compared to other DB-Servers may vary, work load could be fine. Therefore also check the metric arangodb_replication_tailing_bytes_received_total to have an overall and more precise picture.
Number of replication tailing failures due to missing tick on leader
arangodb_replication_tailing_follow_tick_failures_total
The number of replication tailing failures due to missing tick on leader.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_follow_tick_failures
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Troubleshoot: If this is non-zero, action is required. A required follower tick is not present (potentially removed) on a leader DB-Server. Please check the related leader DB-Server log-files to identify the origin of the cause. It may be required to do a full re-sync and/or increase the number of historic logfiles on the leader(s).
Number of replication tailing markers processed
arangodb_replication_tailing_markers_total
The number of replication tailing markers processed on a follower DB-Server. Markers are specific operations which are part of the write-ahead log (WAL). Example actions which are being used in markers: Create or drop a database. Create, drop, rename, change or truncate a collection. Create or drop an index. Create, drop, change a view. Start, commit or abort a transaction.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_markers
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of replication tailing document removals processed
arangodb_replication_tailing_removals_total
The amount of document removal based marker operations on a follower DB-Server.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_removals
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Aggregated wait time for replication tailing requests
arangodb_replication_tailing_request_time_total
Aggregated wait time for replication tailing requests.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_request_time
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | advanced | DB-Servers |
Troubleshoot: If you see unusual spikes here, please inspect potential network issues. It may help to increase network bandwidth and/or reduce network latency. In case there are no network issues, also check the load of the serving leader DB-Server, as well as the follower DB-Server, as they could potentially be overloaded and reaching hardware-based limits.
Number of replication tailing requests
arangodb_replication_tailing_requests_total
The total amount of network replication tailing requests.
Introduced in: v3.8.0
Renamed from: arangodb_replication_tailing_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of revision tree hibernations
arangodb_revision_tree_hibernations_total
The revision trees of collections/shards are normally present in RAM in an uncompressed state. However, to reduce the memory usage of keeping all revision trees in RAM at the same time, revision trees can be put into “hibernation” mode. Any inactive revision tree is automatically hibernated by ArangoDB after a while. For the hibernation step, a revision tree is compressed in RAM, and only the compressed version is then kept. Later accesses of a compressed revision tree require uncompressing the tree again. This metric is increased whenever a revision tree is hibernated. This can happened many times during the lifetime of a revision tree.
Introduced in: v3.8.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Total memory usage of all revision trees (both hibernated and uncompressed)
arangodb_revision_tree_memory_usage
Total memory usage of all revision trees for collections/shards. The revision trees of collections/shards are normally present in RAM in an uncompressed state. However, to reduce the memory usage of keeping all revision trees in RAM at the same time, revision trees can be put into “hibernation” mode. Any inactive revision tree is automatically hibernated by ArangoDB after a while. For the hibernation step, a revision tree is compressed in RAM, and only the compressed version is then kept. Later accesses of a compressed revision tree require uncompressing the tree again. This metrics reports the total memory usage of all revision trees, including both the hibernated and uncompressed forms).
Introduced in: v3.8.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | DB-Servers, Agents and Single Servers |
Number of failed revision tree rebuilds
arangodb_revision_tree_rebuilds_failure_total
Number of failed background revision tree rebuilds. Ideally this value stays at 0, because if a revision tree rebuild fails, the system may stall and not be able to make progress in terms of WAL file collection. When the counter increases, an error message is also logged to the server logfile.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Number of successful revision tree rebuilds
arangodb_revision_tree_rebuilds_success_total
Number of successful background revision tree rebuilds. Ideally this value stays at 0, because a revision tree rebuild indicates a problem with a collection/shard’s revision tree that has happened before.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Number of revision tree resurrections
arangodb_revision_tree_resurrections_total
The revision trees of collections/shards are normally present in RAM in an uncompressed state. However, to reduce the memory usage of keeping all revision trees in RAM at the same time, revision trees can be put into “hibernation” mode. Any inactive revision tree is automatically hibernated by ArangoDB after a while. For the hibernation step, a revision tree is compressed in RAM, and only the compressed version is then kept. Later accesses of a compressed revision tree require uncompressing the tree again. This metric is increased whenever a revision tree is restored from its hibernated state back into an uncompressed form in RAM. This can happened many times during the lifetime of a revision tree.
Introduced in: v3.8.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Number of leader shards on this machine
arangodb_shards_leader_number
Number of leader shards on this machine. Every shard has a leader and potentially multiple followers.
Introduced in: v3.8.0
Renamed from: arangodb_shards_leader_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers and Agents |
Troubleshoot: Since the leaders perform all the read and write operations and the followers only replicate the writes, one should usually have a relatively even distribution of leader shards across DB-Servers. An exception can be one-shard deployments, in which every collection has a single shard and all shards in a database must have the same leader. If you have few databases in a one-shard deployment, then an uneven distribution of leader shards is natural.
You can either move shards manually, use the Rebalance shards button
in the ArangoDB web interface, or use the
cluster maintenance tools
(create-move-plan
and execute-move-plan
specifically). In the latter
case, contact ArangoDB customer support.
Number of shards not replicated at all
arangodb_shards_not_replicated
Number of shards not replicated at all. This is counted for all shards for which this server is currently the leader. The number is increased by one for every shards for which no follower is in sync.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers and Agents |
Troubleshoot: Needless to say, such a situation is very bad for resilience, since it indicates a single point of failure. So, if this number is greater than 0, then some action is indicated. During an upgrade or when some DB-Server was restarted, it can happen that shard followers are out of sync. Normally, shards should get in sync on their own, so observation and waiting is a good measure at first. However, if the situation persists, something is wrong, potentially some constant server crash (maybe out of memory crashes?) or another situation preventing shards to get in sync. Contact ArangoDB customer support in this case.
Number of shards on this machine
arangodb_shards_number
Number of shards on this machine. Every shard has a leader and potentially multiple followers. This metric counts both leader and follower shards.
Introduced in: v3.8.0
Renamed from: arangodb_shards_total_count
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers |
Troubleshoot:
Since both leader and follower shards use memory and disk space,
the total number of shards should be approximately balanced
evenly across the DB-Servers. To achieve this, you can either
move shards manually, use the Rebalance shards button in the
ArangoDB web interface, or use the
cluster maintenance tools
(create-move-plan
and execute-move-plan
specifically). In the latter
case, contact ArangoDB customer support.
Number of leader shards not fully replicated
arangodb_shards_out_of_sync
Number of leader shards not fully replicated. This is counted for all shards for which this server is currently the leader. The number is increased by one for every shards for which not all followers are in sync.
Introduced in: v3.7.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers and Agents |
Troubleshoot:
Needless to say, such a situation is not good resilience, since we
do not have as many copies of the data as the replicationFactor
prescribes. If this metrics has a value greater than 0, then some
action is indicated. During an upgrade or when some DB-Server was
restarted, it can happen that shard followers are out of sync.
Normally, shards should get in sync on their own, so observation
and waiting is a good measure at first. However, if the situation
persists, something is wrong, potentially some constant server crash
(maybe out of memory crashes?) or another situation preventing shards
to get in sync. Contact ArangoDB customer support in this case.
Number of times a follower shard needed to be completely rebuilt because of
too many synchronization failures
arangodb_sync_rebuilds_total
Number of times a follower shard needed to be completely rebuilt because of too many subsequent shard synchronization failures. This metric is always zero from version 3.9.3 onwards. In previous releases, a non-zero value indicates that a follower shard could not get in sync with the leader even after many attempts. When the metric got increased, the follower shard was dropped and completely rebuilt from leader data, in order to increase its chances of getting in sync.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Troubleshoot:
This number is always 0
from version 3.9.3 onwards. If it is
non-zero in previous versions, then something is wrong, please
contact ArangoDB customer support in this case.
Number of times the synchronization of a follower shard synchronization
attempt ran into a timeout
arangodb_sync_timeouts_total
Number of times a follower shard synchronization attempt ran into the
timeout controlled by the startup option --cluster.shard-synchronization-attempt-timeout
.
Running into this timeout is not an error. The timeout simply restricts
individual shard synchronization attempts to a certain maximum runtime.
When it happens, the shard synchronization attempt is aborted by the
follower, but immediately retried afterwards. This abort-and-retry
operation allows the leader DB-Servers to purge their archived WAL files
for the aborted snapshots timely, so that long-running shard synchronization
aborts do not lead to overly long WAL file retention periods on leaders.
Introduced in: v3.9.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers |
Number of times a revision tree for a shard was completely
rebuilt because of too many subsequent failures in the shard synchronization
arangodb_sync_tree_rebuilds_total
Number of times a revision tree for a shard was completely rebuilt because of too many subsequent failures in the shard synchronization. If shards cannot get in sync after several attempts, the shard’s revision tree is first rebuilt on the follower, and then on the leader. If the value is greater than zero, it means there have been shard synchronization failures.
Introduced in: v3.10.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Number of times a mismatching shard checksum was detected when syncing shards
arangodb_sync_wrong_checksum_total
Number of times a mismatching shard checksum was detected when syncing shards. This is a very special metric which is rarely used. When followers of shards get in sync with their leaders, just when everything is completed a final checksum is taken as an additional precaution. If this checksum differs between leader an follower, the incremental resync process starts from scratch.
Introduced in: v3.8.0
Renamed from: arangodb_sync_wrong_checksum
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Troubleshoot:
Normally, this number is always 0
. If it is not, then usually
something is wrong, please contact ArangoDB customer support in this
case.
RocksDB
Collection lock acquisition time histogram
arangodb_collection_lock_acquisition_time
(basename)arangodb_collection_lock_acquisition_time_bucket
arangodb_collection_lock_acquisition_time_sum
arangodb_collection_lock_acquisition_time_count
Histogram of the collection/shard lock acquisition times. Locks are acquired for all write operations, but not for read operations. The values here are measured in seconds.
Introduced in: v3.6.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | medium | DB-Servers, Agents and Single Servers |
Troubleshoot: In case these values are considered too high, check if there are AQL queries or transactions that use exclusive locks on collections, and try to reduce them. Operations using exclusive locks may lock out other queries/transactions temporarily, which leads to an increase in lock acquisition times.
Number of currently active flush subscriptions
arangodb_flush_subscriptions
This metric exposes the number of currently active flush subscriptions.
Flush subscriptions can be created by arangosearch
View links and by background
index creation.
Introduced in: v3.9.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of times RocksDB has entered a stalled (slowed) write state
arangodb_rocksdb_write_stalls_total
This counter reflects the number of times RocksDB was observed by ArangoDB to have entered a stalled (slowed) write state.
If the RocksDB background threads which do cleanup and compaction cannot keep up with the writing, then RocksDB first throttles its write rate (“write stall”) and later stops the writing entirely (“write stop”). Both are suboptimal, since the write rate is too high.
Introduced in: v3.8.0
Renamed from: rocksdb_write_stalls
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers, Agents and Single Servers |
Threshold: If this number grows, you are probably writing faster to ArangoDB than RocksDB can keep up with its background processing. This is OK for a while, but might eventually lead to actual write stops, which are bad since they can lead to unavailability.
Troubleshoot: Quite often, RocksDB is limited by the available I/O bandwidth. Sometimes, it is not the bandwidth itself, but the number of I/O operations per second (IOPS) which is limited. If you are in a cloud environment, IOPS are often scarce (or expensive) and you might be able to deploy more.
Number of times RocksDB has entered a stopped write state
arangodb_rocksdb_write_stops_total
This counter reflects the number of times RocksDB was observed by ArangoDB to have entered a stopped write state.
If the RocksDB background threads which do cleanup and compaction cannot keep up with the writing, then RocksDB first throttles its write rate (“write stall”) and later stops the writing entirely (“write stop”). Both are suboptimal, since the write rate is too high, but write stops are considerably worse, since they can lead to service unavailability.
Introduced in: v3.8.0
Renamed from: rocksdb_write_stops
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | DB-Servers, Agents and Single Servers |
Threshold: If this number grows, you are probably writing a lot faster to ArangoDB than RocksDB can keep up with its background processing. This has lead to actual write stops, which are bad since they can lead to unavailability. If you see this number grow, you need to act, if in doubt, contact ArangoDB support.
Troubleshoot: Quite often, RocksDB is limited by the available I/O bandwidth. Sometimes, it is not the bandwidth itself, but the number of I/O operations per second (IOPS) which is limited. If you are in a cloud environment, IOPS are often scarce (or expensive) and you might be able to deploy more.
Actual delayed RocksDB write rate
rocksdb_actual_delayed_write_rate
This metric exhibits the RocksDB metric rocksdb-actual-delayed-write-rate
.
It shows the current actual delayed write rate. The value 0
means no delay.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB WAL files in the archive
rocksdb_archived_wal_files
This metric exhibits the total number of RocksDB WAL files in the “archive” subdirectory. These are WAL files that can be garbage-collected eventually, when they are not used anymore by replication, WAL tailing or other purposes.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | DB-Servers, Agents and Single Servers |
Cumulated size of RocksDB WAL files in the archive
rocksdb_archived_wal_files_size
This metric exhibits the cumulated size of RocksDB WAL files in the
archive
subdirectory on disk. These are WAL files that can be
garbage-collected eventually, when they are not used anymore by
replication, WAL tailing or other purposes.
Introduced in: v3.9.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | medium | DB-Servers, Agents and Single Servers |
Total number of RocksDB background errors
rocksdb_background_errors
This metric exhibits the RocksDB metric background-errors
. It shows
the accumulated number of background errors.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
RocksDB base level
rocksdb_base_level
This metric exhibits the RocksDB metric rocksdb-base-level
.
It shows the number of the level to which L0 data is compacted.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
RocksDB block cache capacity
rocksdb_block_cache_capacity
This metric exhibits the RocksDB metric rocksdb-block-cache-capacity
.
It shows the block cache capacity in bytes. This can be configured with
the --rocksdb.block-cache-size
startup option.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Size of pinned RocksDB block cache entries
rocksdb_block_cache_pinned_usage
This metric exhibits the RocksDB metric rocksdb-block-cache-pinned-usage
.
It shows the memory size for the RocksDB block cache for the entries
which are pinned, in bytes.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Cumulated size of RocksDB block cache entries
rocksdb_block_cache_usage
This metric exhibits the RocksDB metric rocksdb-block-cache-usage
.
It shows the total memory size for the entries residing in the block cache,
in bytes.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Global current number of hash tables in ArangoDB cache
rocksdb_cache_active_tables
This metric reflects the current number of active hash tables used by the in-memory cache which sits in front of RocksDB. Active tables are used for caching index entries. There should be one active table per index per shard for each index that has in-memory caching enabled. There can also be additional active tables while an existing hash table is migrated to a larger table.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Global current memory allocation of ArangoDB in-memory caches
rocksdb_cache_allocated
This metric reflects the current global allocation for the ArangoDB
in-memory cache which sits in front of RocksDB. For example, the edge caches
counts towards this allocation. All these caches together have a
global limit which can be controlled with the --cache.size
startup option.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Total number of dropped entries in automatic in-memory index cache refilling
rocksdb_cache_auto_refill_dropped_total
This metric shows the total number of entries for which no automatic refilling
happened in the in-memory index caches.
This counter increases only for insert, update, replace, and remove operations
affecting edge indexes and cache-enabled persistent indexes and with requested
automatic refilling,
if no refill operation could be queued due to capacity constraints.
A refill operation request can be rejected if the number of currently queued
refill operations exceeds the maximum value configured via the
--rocksdb.auto-refill-index-cache-queue-capacity
startup option.
Correctness of index lookups is not affected if this metric is non-zero, as
it only reports the number of failed refilling attempts in the in-memory caches
of any index. These in-memory caches are optional and their fill grade does not
affect correctness.
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Troubleshoot:
If this metric keeps increasing, it indicates that the index refill background
thread can’t keep up with the incoming data modification requests.
In this case, consider increasing the background thread’s queueing capacity via
the --rocksdb.auto-refill-index-cache-queue-capacity
startup option.
Increasing the capacity helps to handle bursts of request, but does not help
if the background thread is overwhelmed by a continuous high load.
Total number of automatically refilled in-memory index cache entries
rocksdb_cache_auto_refill_loaded_total
This metric shows the total number of automatically refilled in-memory
index cache entries. Entries in the in-memory index caches are automatically
refilled for edge indexes and cache-enabled persistent indexes if an insert,
update, replace, or remove operation requests the cache refilling, or if the
--rocksdb.auto-refill-index-caches
startup option is enabled.
On Agents and Coordinators, the values reported by this metric are always zero.
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Total number of in-memory index cache refill operations for a complete
index
rocksdb_cache_full_index_refills_total
This metric shows the total number of refill operations to in-memory index caches for entire edge indexes and cache-enabled persistent indexes. On DB-Servers, a full index reload can increase this metric by more than one, as counting is done per shard. On Coordinators and Agents, this metric always contains a value of zero.
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Lifetime hit rate of the ArangoDB cache in front of RocksDB
rocksdb_cache_hit_rate_lifetime
This metric reflects the lifetime hit rate of the ArangoDB in-memory
cache which is sitting in front of RocksDB. For example, the edge
cache is a part of this. The value is a ratio between 0
and 1
.
“Lifetime” means here that accounting is done from the most recent
start of the arangod
instance.
If the hit rate is too low, you might have to little RAM available
for the in-memory caches.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
Recent hit rate of the ArangoDB cache in front of RocksDB
rocksdb_cache_hit_rate_recent
This metric reflects the recent hit rate of the ArangoDB in-memory
cache which is sitting in front of RocksDB. For example, the edge
cache is a part of this. The value is a ratio between 0
and 1
.
If the hit rate is too low, you might have to little RAM available
for the in-memory caches.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
Global allocation limit for the ArangoDB cache in front of RocksDB
rocksdb_cache_limit
This metric reflects the current global allocation limit for the
ArangoDB caches which sit in front of RocksDB. For example, the
edge cache counts towards this allocation. This global limit can
be controlled with the --cache.size
startup option.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Global peak memory allocation of ArangoDB in-memory caches
rocksdb_cache_peak_allocated
This metric reflects the peak global allocation for the ArangoDB
in-memory cache which sits in front of RocksDB. It records the peak value
of the metric rocksdb_cache_allocated
during the lifetime of the
arangod instance.
Introduced in: v3.10.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Global current memory allocation of inactive/reserve hash tables in ArangoDB cache
rocksdb_cache_unused_memory
This metric reflects the current memory allocation for unused hash tables used by the in-memory cache which sits in front of RocksDB. Unused tables can be kept as backups to provide new, readily initialized tables for new caches. The overall memory usage of unused tables is capped by the system, so it does not grow overly large.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Global current number of inactive/reserve hash tables in ArangoDB cache
rocksdb_cache_unused_tables
This metric reflects the current number of unused hash tables used by the in-memory cache which sits in front of RocksDB. Unused tables can be kept as backups to provide new, readily initialized tables for new caches. Unused tables can consume some memory, but the overall memory usage of unused tables is capped.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB column families with pending compaction
rocksdb_compaction_pending
This metric exhibits the RocksDB metric compaction-pending
.
It shows the number of column families for which at least one compaction
is pending.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 0
rocksdb_compression_ratio_at_level0
This metric exhibits the compression ratio of data at level 0 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 0.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 1
rocksdb_compression_ratio_at_level1
This metric exhibits the compression ratio of data at level 1 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 1.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 2
rocksdb_compression_ratio_at_level2
This metric exhibits the compression ratio of data at level 2 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 2.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 3
rocksdb_compression_ratio_at_level3
This metric exhibits the compression ratio of data at level 3 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 3.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 4
rocksdb_compression_ratio_at_level4
This metric exhibits the compression ratio of data at level 4 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 4.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 5
rocksdb_compression_ratio_at_level5
This metric exhibits the compression ratio of data at level 5 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 5.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
RocksDB compression ratio at level 6
rocksdb_compression_ratio_at_level6
This metric exhibits the compression ratio of data at level 6 in RocksDB’s
log structured merge tree. Here, compression
ratio is defined as uncompressed data size / compressed file size.
Returns -1.0
if there are no open files at level 6.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ratio | advanced | DB-Servers, Agents and Single Servers |
Approximate size of RocksDB’s active memtable
rocksdb_cur_size_active_mem_table
This metric exhibits the RocksDB metric rocksdb-cur-size-active-mem-table
.
It shows the approximate size of the active memtable in bytes, summed
over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Approximate size of all active and unflushed RocksDB memtables
rocksdb_cur_size_all_mem_tables
This metric exhibits the RocksDB metric rocksdb-cur-size-all-mem-tables
.
It shows the approximate size of active and unflushed immutable memtables
in bytes, summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Current rate of the RocksDB throttle in bytes per second
rocksdb_engine_throttle_bps
This metric exposes the current write rate limit of the ArangoDB
RocksDB throttle. The throttle limits the write rate to allow
RocksDB’s background threads to catch up with compactions and not
fall behind too much, since this would in the end lead to nasty
write stops in RocksDB and incur considerable delays. If 0
is
shown, no throttling happens, otherwise, you see the current
write rate limit in bytes per second. Also see the --rocksdb.*
startup options.
Introduced in: v3.8.0
Renamed from: rocksdbengine_throttle_bps
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes per second | advanced | DB-Servers, Agents and Single Servers |
Estimated amount of live RocksDB data
rocksdb_estimate_live_data_size
This metric exhibits the RocksDB metric rocksdb-estimate-live-data-size
.
It shows an estimate of the amount of live data in bytes, summed over
all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Estimated number of RocksDB keys
rocksdb_estimate_num_keys
This metric exhibits the RocksDB metric rocksdb-estimate-num-keys
.
It shows the estimated number of total keys in the active and unflushed
immutable memtables and storage, summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Estimated number of bytes awaiting RocksDB compaction
rocksdb_estimate_pending_compaction_bytes
This metric exhibits the RocksDB metric
rocksdb-estimate-pending-compaction-bytes
.
It shows the estimated total number of bytes compaction needs to
rewrite to get all levels down to under target size. Not valid for
other compactions than level-based. This value is summed over all
column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Estimated memory usage for reading RocksDB SST tables
rocksdb_estimate_table_readers_mem
This metric exhibits the RocksDB metric
rocksdb-estimate-table-readers-mem
.
It shows the estimated memory used for reading SST tables, excluding
memory used in block cache (e.g. filter and index blocks), summed over
all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Free disk space in bytes on volume used by RocksDB
rocksdb_free_disk_space
This metric shows the currently free disk space in bytes on the volume which is used by RocksDB. Since RocksDB does not like out of disk space scenarios, please make sure that there is enough free disk space available at all times! Note that this metric is only available/populated on some platforms.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Number of free inodes on the volume used by RocksDB
rocksdb_free_inodes
This metric shows the currently free number of inodes on the disk volume used by RocksDB. Since RocksDB does not like out of disk space scenarios, please make sure that there is enough free inodes available at all times! Note that this metric is only available/populated on some platforms.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Whether RocksDB file deletion is enabled
rocksdb_is_file_deletions_enabled
This metric exhibits the RocksDB metric rocksdb-is-file-deletions-enabled
.
It shows 0
if deletion of obsolete files is enabled, and otherwise,
it shows a non-zero number. Note that for ArangoDB, this is supposed
to always return 1
, since the deletion of obsolete WAL files is done
from ArangoDB, externally to RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Whether RocksDB writes are stopped
rocksdb_is_write_stopped
This metric exhibits the RocksDB metric rocksdb-is-write-stopped
.
It shows 1
if writing to RocksDB has been stopped, and otherwise 0.
If 1
is shown, this usually means that there are too many uncompacted
files and the RocksDB background threads have not managed to keep up
with their compaction work. This situation should be avoided, since
nasty delays in database operations are incurred. If in doubt,
contact ArangoDB support.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Size of live RocksDB SST files
rocksdb_live_sst_files_size
This metric exhibits the RocksDB metric rocksdb-live-sst-files-size
.
It shows the total size in bytes of all SST files belonging to the latest
LSM tree, summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Number of live RocksDB WAL files
rocksdb_live_wal_files
This metric exhibits the total number of live RocksDB WAL files. These are WAL files that cannot be garbage-collected until they are moved over to the archive.
Introduced in: v3.9.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | DB-Servers, Agents and Single Servers |
Cumulated live RocksDB WAL files
rocksdb_live_wal_files_size
This metric exhibits the cumulated size of live RocksDB WAL files on disk. WAL files that cannot be garbage-collected until they are moved over to the archive.
Introduced in: v3.9.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | medium | DB-Servers, Agents and Single Servers |
Number of RocksDB column families awaiting memtable flush
rocksdb_mem_table_flush_pending
This metric exhibits the RocksDB metric mem-table-flush-pending
. It
shows the number of column families for which a memtable flush is
pending.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Minimum number of RocksDB log files to keep
rocksdb_min_log_number_to_keep
This metric exhibits the RocksDB metric rocksdb-min-log-number-to-keep
.
It shows the minimum log number of the log files that should be kept.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of deletes in active RocksDB memtable
rocksdb_num_deletes_active_mem_table
This metric exhibits the RocksDB metric
rocksdb-num-deletes-active-mem-table
.
It shows the total number of delete entries in the active memtable,
summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of deletes in unflushed immutable RocksDB memtables
rocksdb_num_deletes_imm_mem_tables
This metric exhibits the RocksDB metric
rocksdb-num-deletes-imm-mem-tables
.
It shows the total number of delete entries in the unflushed immutable
memtables, summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of entries in the active RocksDB memtable
rocksdb_num_entries_active_mem_table
This metric exhibits the RocksDB metric
rocksdb-num-entries-active-mem-table
.
It shows the total number of entries in the active memtable,
summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of entries in unflushed immutable RocksDB memtables
rocksdb_num_entries_imm_mem_tables
This metric exhibits the RocksDB metric
rocksdb-num-entries-imm-mem-tables
.
It shows the total number of entries in the unflushed immutable memtables,
summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 0
rocksdb_num_files_at_level0
This metric reports the number of files at level 0 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 1
rocksdb_num_files_at_level1
This metric reports the number of files at level 1 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 2
rocksdb_num_files_at_level2
This metric reports the number of files at level 2 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 3
rocksdb_num_files_at_level3
This metric reports the number of files at level 3 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 4
rocksdb_num_files_at_level4
This metric reports the number of files at level 4 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 5
rocksdb_num_files_at_level5
This metric reports the number of files at level 5 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of RocksDB files at level 6
rocksdb_num_files_at_level6
This metric reports the number of files at level 6 in the log structured merge tree of RocksDB.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of unflushed immutable RocksDB memtables
rocksdb_num_immutable_mem_table
This metric exhibits the RocksDB metric num-immutable-mem-table
,
which shows the number of immutable memtables that have not yet been
flushed. This value is the sum over all column families.
Memtables are sorted tables of key/value pairs which begin to be built up in memory. At some stage they are closed and become immutable, and some time later they are flushed to disk.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of flushed immutable RocksDB memtables
rocksdb_num_immutable_mem_table_flushed
This metric exhibits the RocksDB metric num-immutable-mem-table-flushed
,
which shows the number of immutable memtables that have already been
flushed. This value is the sum over all column families.
Memtables are sorted tables of key/value pairs which begin to be built up in memory. At some stage they are closed and become immutable, and some time later they are flushed to disk.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of live RocksDB versions
rocksdb_num_live_versions
This metric exhibits the RocksDB metric rocksdb-num-live-versions
.
It shows the number of live versions. Version
is an internal data
structure. See version_set.h
in the RocksDB source for details. More
live versions often mean more SST files are held from being deleted,
by iterators or unfinished compactions. This number is the number
summed up over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of running RocksDB compactions
rocksdb_num_running_compactions
This metric exhibits the RocksDB metric rocksdb-num-running-compactions
.
It shows the number of currently running compactions.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of running RocksDB flushes
rocksdb_num_running_flushes
This metric exhibits the RocksDB metric rocksdb-num-running-flushes
.
It shows the number of currently running flushes.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of unreleased RocksDB snapshots
rocksdb_num_snapshots
This metric exhibits the RocksDB metric rocksdb-num-snapshots
.
It shows the number of unreleased snapshots of the database.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Timestamp of oldest unreleased RocksDB snapshot
rocksdb_oldest_snapshot_time
This metric exhibits the RocksDB metric rocksdb-oldest-snapshot-time
.
It shows a number representing the Unix timestamp of the oldest
unreleased snapshot.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Number of prunable RocksDB WAL files in the archive
rocksdb_prunable_wal_files
This metric exhibits the total number of RocksDB WAL files in the “archive” subdirectory that can be pruned. These are WAL files that can be pruned by a background thread to reclaim disk space.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | DB-Servers, Agents and Single Servers |
RocksDB metric “background-errors”
rocksdb_read_only
This metric indicates whether RocksDB currently is in read-only
mode, due to a background error. If RocksDB is in read-only mode,
this metric has a value of 1
. When in read-only mode, all
writes into RocksDB fail. When RocksDB is in normal operations
mode, this metric has a value of 0
.
Introduced in: v3.8.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | DB-Servers, Agents and Single Servers |
Troubleshoot: If this value is non-zero, it means that all write operations in RocksDB fail until the RocksDB background error is resolved. The arangod server logfile should show more details about the exact errors that are happening, so logs should be inspected first. RocksDB can set a background error when some I/O operation fails. This is often due to disk space usage issues, so often either freeing disk space or increasing the disk capacity help. Under some conditions, RocksDB can automatically resume from the background error and go back into normal operations. However, if the background error happens during certain RocksDB operations, it cannot resume operations automatically, so the instance needs a manual restart after the error condition is removed.
Approximate size of all RocksDB memtables
rocksdb_size_all_mem_tables
This metric exhibits the RocksDB metric rocksdb-size-all-mem-tables
.
It shows the approximate size of all active, unflushed immutable, and
pinned immutable memtables in bytes, summed over all column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Used disk space in bytes on volume used by RocksDB
rocksdb_total_disk_space
This metric shows the currently used disk space in bytes on the volume which is used by RocksDB. Since RocksDB does not like out of disk space scenarios, please make sure that there is enough free disk space available at all times! Note that this metric is only available/populated on some platforms.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Number of used inodes on the volume used by RocksDB
rocksdb_total_inodes
This metric shows the currently used number of inodes on the disk volume used by RocksDB. Since RocksDB does not like out of disk space scenarios, please make sure that there are enough free inodes available at all times! Note that this metric is only available/populated on some platforms.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Total number of RocksDB sst files, aggregated over all levels
rocksdb_total_sst_files
This metric reports the number of sst files of the log structured merge tree of RocksDB, aggregated over all levels.
Introduced in: v3.11.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Size of all RocksDB SST files
rocksdb_total_sst_files_size
This metric exhibits the RocksDB metric rocksdb-total-sst-files-size
.
It shows the total size in bytes of all SST files, summed over all
column families.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Whether or not the pruning of archived RocksDB WAL files is currently
activated
rocksdb_wal_pruning_active
This metric contains a value of 0
if the pruning of archived RocksDB WAL
files is not activated, and 1
if it is activated.
WAL file pruning is normally deactivated for the first few minutes after
an instance is started, so that other instances in the cluster can start
replicating from the instance before all archived WAL files are deleted.
The value should flip from 0
to 1
a few minutes after server start.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | DB-Servers, Agents and Single Servers |
Lower bound sequence number from which WAL files need to be kept because
of external flushing needs
rocksdb_wal_released_tick_flush
This metric exposes the RocksDB WAL sequence number from which onwards
WAL files have to be kept because the WAL data could be used by external
flushing needs. WAL files with sequence numbers higher than this value
are not garbage-collected.
The candidates that can keep WAL files from being garbage-collected are
arangosearch
View links or inverted indexes that are still syncing data,
and background index creation.
Introduced in: v3.9.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Lower bound sequence number from which WAL files need to be kept because
of replication
rocksdb_wal_released_tick_replication
This metric exposes the RocksDB WAL sequence number from which onwards WAL files have to kept in the archive because the WAL data could be used by the replication. WAL files with sequence numbers higher than this value are not garbage-collected.
Introduced in: v3.9.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Current RocksDB WAL sequence number
rocksdb_wal_sequence
This metric exposes the current RocksDB WAL sequence number. Any write operations into the database increases the sequence number.
Introduced in: v3.8.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
RocksDB sequence number until which the background sync thread
has caught up
rocksdb_wal_sequence_lower_bound
This metric exposes the RocksDB WAL sequence number until which the ArangoDB background sync thread has fully caught up to. The value exposed here should be monotonically increasing and always progress if there are write operations executing.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | DB-Servers, Agents and Single Servers |
Scheduler
Total number of REST handler tasks created for the scheduler
arangodb_scheduler_handler_tasks_created_total
Total number of REST handler tasks that were created for execution via the scheduler. This counter is increased for each incoming request for which a REST handler mapping exists and that does not need to be forwarded to another Coordinator in the cluster.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Current queue length of the high priority queue in the scheduler
arangodb_scheduler_high_prio_queue_length
The number of jobs currently queued on the scheduler’s high priority queue.
The capacity of the high priority queue can be configured via the startup
option --server.prio1-size
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total number of jobs dequeued
arangodb_scheduler_jobs_dequeued_total
The total number of jobs dequeued from all scheduler queues. Calculating the difference between arangodb_scheduler_jobs_submitted_total and arangodb_scheduler_jobs_dequeued_total gives the total number of currently queued jobs. Calculating the difference between arangodb_scheduler_jobs_dequeued_total and arangodb_scheduler_jobs_done_total gives the number of jobs currently being processed.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Total number of queue jobs done
arangodb_scheduler_jobs_done_total
The total number of queue jobs done. Calculating the difference between arangodb_scheduler_jobs_dequeued_total and arangodb_scheduler_jobs_done_total gives the total number of jobs currently being processed.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Total number of jobs submitted to the scheduler
arangodb_scheduler_jobs_submitted_total
Total number of jobs submitted to the scheduler. Calculating the difference between arangodb_scheduler_jobs_submitted_total and arangodb_scheduler_jobs_dequeued_total gives the total number of currently queued jobs.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Last recorded dequeue time for a low priority queue item
arangodb_scheduler_low_prio_queue_last_dequeue_time
Last recorded dequeue time for a low priority queue item, i.e., the amount of time the job was sitting in the queue. If there is nothing to do for a long time, this metric is reset to zero. A large value for this metric indicates that the server is under heavy load and low priority jobs cannot be dequeued in a timely manner
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | ms | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: Normally this time should be clearly sub-second.
Troubleshoot: If you see larger values here, in particular over a longer period of time, you should consider reducing the load of the server (if possible), scaling up (bigger machine) or scaling out (more Coordinators). Otherwise requests cannot be processed in a timely manner and you run the risk that the queue becomes full and requests are declined.
Current queue length of the low priority queue in the scheduler
arangodb_scheduler_low_prio_queue_length
The number of jobs currently queued on the scheduler’s low priority queue.
The capacity of the low priority queue can be configured via the startup
option --server.maximal-queue-size
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Current queue length of the maintenance priority queue in the scheduler
arangodb_scheduler_maintenance_prio_queue_length
The number of jobs currently queued on the scheduler’s maintenance priority
queue. These are the jobs with the highest priority and are mainly used for
cluster internal operations. The capacity of the maintenance priority queue
can be configured via the startup option --server.scheduler-queue-size
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Current queue length of the medium priority queue in the scheduler
arangodb_scheduler_medium_prio_queue_length
The number of jobs currently queued on the scheduler’s medium priority queue.
The capacity of the medium priority queue can be configured via the startup
option --server.prio2-size
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Number of awake worker threads
arangodb_scheduler_num_awake_threads
The number of worker threads currently working on some job or spinning while waiting for new work (i.e., not sleeping).
Introduced in: v3.8.0
Renamed from: arangodb_scheduler_awake_threads
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Current number of detached worker threads
arangodb_scheduler_num_detached_threads
The number of worker threads currently started and detached from the scheduler. Worker threads which perform potentially long running tasks like waiting for a lock can detach themselves from the scheduler to allow new scheduler threads to be started and avoid server blockage.
Introduced in: v3.11.5
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Current number of worker threads
arangodb_scheduler_num_worker_threads
The number of worker threads currently started. This includes detached worker threads. Worker threads can be started and stopped dynamically based on the server load.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Current number of working threads
arangodb_scheduler_num_working_threads
The current number of threads actually working on some job (i.e., not spinning while waiting for new work).
Introduced in: v3.6.10
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total number of ongoing RestHandlers coming from the low priority queue
arangodb_scheduler_ongoing_low_prio
Total number of low priority jobs currently being processed.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of tasks dropped and not added to internal queue
arangodb_scheduler_queue_full_failures_total
Number of tasks dropped because the queue was already full. The queue capacities
can be configured via the startup options --server.scheduler-queue-size
,
--server.prio1-size
, --server.prio2-size
and --server.maximal-queue-size
.
Introduced in: v3.8.0
Renamed from: arangodb_scheduler_queue_full_failures
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Server’s internal queue length
arangodb_scheduler_queue_length
The total number of currently queued jobs in all queues.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of tasks/requests dropped and not added to internal queue
due to the client-specified queue time requirements not being satisfiable
arangodb_scheduler_queue_time_violations_total
Number of tasks/requests dropped because the client-specified queue time
requirements, as indicated by client applications in the
x-arango-queue-time-seconds
HTTP request header could not be satisfied by
the receiving server instance. This happens when the actual time needed to
queue/dequeue requests on the scheduler queue exceeds the maximum time value
that the client has specified in the request.
Whenever this happens, the client application gets an HTTP 412 error
response back with error code 21004 (“queue time violated”).
Although the metric is exposed on all instance types, it is likely
always 0
on DB-Servers, simply because Coordinators do not forward the
x-arango-queue-time-seconds
when they send internal requests to DB-Servers.
Introduced in: v3.9.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total accumulated number of scheduler threads started
arangodb_scheduler_threads_started_total
Total accumulated number of scheduler threads started. Worker threads can be started and stopped dynamically based on the server load.
Introduced in: v3.8.0
Renamed from: arangodb_scheduler_threads_started
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Accumulated total number of scheduler threads stopped
arangodb_scheduler_threads_stopped_total
Total accumulated number of scheduler threads stopped. Worker threads can be started and stopped dynamically based on the server load.
Introduced in: v3.8.0
Renamed from: arangodb_scheduler_threads_stopped
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Statistics
Bytes received for requests
arangodb_client_connection_statistics_bytes_received
(basename)arangodb_client_connection_statistics_bytes_received_bucket
arangodb_client_connection_statistics_bytes_received_sum
arangodb_client_connection_statistics_bytes_received_count
Histogram of the received request sizes in bytes.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Bytes sent for responses
arangodb_client_connection_statistics_bytes_sent
(basename)arangodb_client_connection_statistics_bytes_sent_bucket
arangodb_client_connection_statistics_bytes_sent_sum
arangodb_client_connection_statistics_bytes_sent_count
Histogram of the sent response sizes in bytes
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
The number of client connections that are currently open
arangodb_client_connection_statistics_client_connections
The number of client connections that are currently open. Note: this metric considers only HTTP and HTTP/2 connections, but not VST connections.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total connection time of a client
arangodb_client_connection_statistics_connection_time
(basename)arangodb_client_connection_statistics_connection_time_bucket
arangodb_client_connection_statistics_connection_time_sum
arangodb_client_connection_statistics_connection_time_count
Histogram of the connection’s total lifetime, i.e., the time between the point when the connection was established until it was closed. Smaller numbers indicate that there is not a lot of load and/or that connections are not reused for multiple requests. Consider using keep-alive header or HTTP/2 or VST.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | advanced | Coordinators, DB-Servers, Agents and Single Servers |
I/O time needed to answer a request
arangodb_client_connection_statistics_io_time
(basename)arangodb_client_connection_statistics_io_time_bucket
arangodb_client_connection_statistics_io_time_sum
arangodb_client_connection_statistics_io_time_count
Histogram of I/O times needed to answer a request. This includes the time required to read the incoming request as well as the time required to send the response.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Queueing time needed for requests
arangodb_client_connection_statistics_queue_time
(basename)arangodb_client_connection_statistics_queue_time_bucket
arangodb_client_connection_statistics_queue_time_sum
arangodb_client_connection_statistics_queue_time_count
Histogram of the time requests are spending on a queue waiting to be processed. The overwhelming majority of these times should be clearly sub-second.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Request time needed to answer a request
arangodb_client_connection_statistics_request_time
(basename)arangodb_client_connection_statistics_request_time_bucket
arangodb_client_connection_statistics_request_time_sum
arangodb_client_connection_statistics_request_time_count
Histogram of the time required to actually process a request. This does not include the time required to read the incoming request, the time the request is sitting on the queue, or the time required to send the response.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total time needed to answer a request
arangodb_client_connection_statistics_total_time
(basename)arangodb_client_connection_statistics_total_time_bucket
arangodb_client_connection_statistics_total_time_sum
arangodb_client_connection_statistics_total_time_count
Histogram of the total times required to process a request. This includes the time required to read the incoming request, the time the request is sitting in the queue, the time to actually process the request, and the time required to send the response.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Bytes received for requests, only user traffic
arangodb_client_user_connection_statistics_bytes_received
(basename)arangodb_client_user_connection_statistics_bytes_received_bucket
arangodb_client_user_connection_statistics_bytes_received_sum
arangodb_client_user_connection_statistics_bytes_received_count
Histogram of the received request sizes in bytes, only considering user traffic, i.e. traffic authenticated with a real user (or role).
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Bytes sent for responses, only user traffic
arangodb_client_user_connection_statistics_bytes_sent
(basename)arangodb_client_user_connection_statistics_bytes_sent_bucket
arangodb_client_user_connection_statistics_bytes_sent_sum
arangodb_client_user_connection_statistics_bytes_sent_count
Histogram of the sent response sizes in bytes, only user traffic, i.e. traffic that has been authenticated with a real user (or role).
Introduced in: v3.10.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total memory usage of connection statistics
arangodb_connection_statistics_memory_usage
Total memory usage of connection statistics.
If the startup option --server.statistics
is set to true
, then some
connection statistics are built up in memory for every connection that is
made to the arangod server.
It is expected that the memory usage reported by this metric remains
relatively constant over time. It should grow only when there are bursts of
new connections.
Some memory is pre-allocated at startup for higher efficiency.
No memory will be allocated for connection statistics if the startup option
is set to false
.
Introduced in: v3.11.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Number of file descriptors currently opened by the arangod process
arangodb_file_descriptors_current
Number of file descriptors currently opened by the arangod process.
This will include regular files as well as sockets.
The metric is only available on platforms that support it.
As counting the number of file descriptors can be expensive and can have
an impact on performance, the metric will only be updated in a controllable
interval. The interval can be adjusted via the startup option
--server.count-descriptors-interval
. Shorter intervals mean more up-to-date
count values, at the potential expense of more IO operations that are
required for counting.
It is possible to turn off counting of file descriptors by setting the
interval to 0. In this case, the metric will report a value of 0.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Agents, Coordinators, DB-Servers and Single Servers |
System limit for the number of open file descriptors for the arangod process
arangodb_file_descriptors_limit
System limit for the number of open file descriptors for the arangod process. It is only available on platforms that support it. The limit is imposed by the operating system or system configuration. The value of this metric will remain constant for the entire lifetime of the process.
Introduced in: v3.11.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | medium | Agents, Coordinators, DB-Servers and Single Servers |
Number of asynchronously executed HTTP requests
arangodb_http_request_statistics_async_requests_total
This counter reflects the total number of asynchronous HTTP (or VST)
requests which have hit this particular instance of arangod
. Asynchronous
refers to the fact that the response is not sent with the HTTP response,
but is rather queried separately using the /_api/jobs
API.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_async_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP DELETE requests
arangodb_http_request_statistics_http_delete_requests_total
This counter reflects the total number of HTTP (or VST) DELETE
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_delete_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP GET requests
arangodb_http_request_statistics_http_get_requests_total
This counter reflects the total number of HTTP (or VST) GET
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_get_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP HEAD requests
arangodb_http_request_statistics_http_head_requests_total
This counter reflects the total number of HTTP (or VST) HEAD
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_head_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP OPTIONS requests
arangodb_http_request_statistics_http_options_requests_total
This counter reflects the total number of HTTP (or VST) OPTIONS
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_options_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP PATCH requests
arangodb_http_request_statistics_http_patch_requests_total
This counter reflects the total number of HTTP (or VST) PATCH
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_patch_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP POST requests
arangodb_http_request_statistics_http_post_requests_total
This counter reflects the total number of HTTP (or VST) POST
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_post_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of HTTP PUT requests
arangodb_http_request_statistics_http_put_requests_total
This counter reflects the total number of HTTP (or VST) PUT
requests which have hit this particular instance of arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_http_put_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of other HTTP requests
arangodb_http_request_statistics_other_http_requests_total
This counter reflects the total number of HTTP (or VST) other
or ILLEGAL requests which have hit this particular instance of
arangod
. These are all requests, which are not one of the following:
DELETE
, GET
, HEAD
, POST
, PUT
, OPTIONS
, PATCH
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_other_http_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Total number of HTTP requests executed by superuser/JWT
arangodb_http_request_statistics_superuser_requests_total
This counter reflects the total number of HTTP (or VST)
requests that have been authenticated with the JWT superuser token,
which have hit this particular instance of
arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_superuser_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Total number of HTTP requests
arangodb_http_request_statistics_total_requests_total
This counter reflects the total number of HTTP (or VST) requests which
have hit this particular instance of arangod
. Note that this counter
is ever growing during the lifetime of the arangod
process. However,
when the process is restarted, it starts from scratch. In the Grafana
dashboards, it is usually visualized as a rate per second, averaged
with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_total_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Total number of HTTP requests executed by user clients
arangodb_http_request_statistics_user_requests_total
This counter reflects the total number of HTTP (or VST) requests
that have been authenticated for some user (as opposed to with the
JWT superuser token), which have hit this particular instance of
arangod
.
Note that this counter is ever growing during the lifetime of the
arangod
process. However, when the process is restarted, it starts
from scratch. In the Grafana dashboards, it is usually visualized as a
rate per second, averaged with a sliding window of a minute.
Introduced in: v3.8.0
Renamed from: arangodb_http_request_statistics_user_requests
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metrics reflects the performance of an instance in a certain way. Note that your mileage may vary according to available resources as well as to complexity of the requests the client sends.
Number of intermediate commits performed in transactions
arangodb_intermediate_commits_total
Number of intermediate commits performed in transactions.
An intermediate commit happens if a logical transaction needs to be
split into multiple physical transaction because of the volume of data
handled in the transaction. The thresholds for when to perform an
intermediate commit can be controlled by startup options
--rocksdb.intermediate-commit-count
(number of write operations after
which an intermediate commit is triggered) and --rocksdb.intermediate-commit-size
(cumulated size of write operations after which an intermediate commit is triggered).
The values can also be overridden for individual transactions.
This metric was named arangodb_intermediate_commits
in previous
versions of ArangoDb.
Introduced in: v3.8.0
Renamed from: arangodb_intermediate_commits
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers and Single Servers |
Troubleshoot: If this value is non-zero, it doesn’t necessarily indicate a problem. It can happen for large transactions and large data-loading jobs. However, as modifications performed by intermediate commits are persisted and cannot simply be rolled back in memory, it should be monitored whether the intermediate commits only happen for operations where they are expected. If they also happen for operations that are supposed to be atomic, then the intermediate commit size and count parameters need to be adjusted, or larger operations should be broken up into smaller ones in the client application.
Number of document reads which have been executed with dirty reads
arangodb_potentially_dirty_document_reads_total
This counter exposes the number of document reads (single or batch to shards in the cluster) which have been executed with “dirty reads”. A dirty read is one which may also use follower shards and not only leader shards. Note that it is the transaction in the context of which the read runs which determines, if dirty reads are allowed.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators |
Number of major page faults
arangodb_process_statistics_major_page_faults_total
On Windows, this figure contains the total number of page faults. On other system, this figure contains the number of major faults the process has made which have required loading a memory page from disk.
Introduced in: v3.8.0
Renamed from: arangodb_process_statistics_major_page_faults
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of minor page faults
arangodb_process_statistics_minor_page_faults_total
The number of minor faults the process has made which have not required loading a memory page from disk. This figure is not reported on Windows.
Introduced in: v3.8.0
Renamed from: arangodb_process_statistics_minor_page_faults
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of threads
arangodb_process_statistics_number_of_threads
Number of threads in the arangod process.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Resident set size
arangodb_process_statistics_resident_set_size
The total size of the number of pages the process has in real memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out. The resident set size is reported in bytes.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Resident set size as fraction of system memory
arangodb_process_statistics_resident_set_size_percent
The relative size of the number of pages the process has in real memory compared to system memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out. The value is a ratio between 0.00 and 1.00.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This value can be consistently relatively high, even when not under load, due to different caches like the RocksDB block cache or the edge cache. There should be some safety margin left, so it should not get too close to 1.
Process system time
arangodb_process_statistics_system_time
Amount of time that this process has been scheduled in kernel mode, measured in seconds.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metric can vary significantly dependent on the workload. If the rate is consistently very high, it could be an indication of some problem.
Process user time
arangodb_process_statistics_user_time
Amount of time that this process has been scheduled in user mode, measured in seconds.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: This metric can vary significantly dependent on the workload. If the rate is consistently very high, it could be an indication of some problem.
Virtual memory size
arangodb_process_statistics_virtual_memory_size
On Windows, this figure contains the total amount of memory that the memory manager has committed for the arangod process. On other systems, this figure contains the size of the virtual memory the process is using.
Introduced in: v3.6.1
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Body size in bytes for HTTP/1.1 requests
arangodb_request_body_size_http1
(basename)arangodb_request_body_size_http1_bucket
arangodb_request_body_size_http1_sum
arangodb_request_body_size_http1_count
Histogram of the body sizes of the received HTTP/1.1 requests in bytes. Note that this does not account for the header.
Introduced in: v3.7.15
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | medium | Coordinators, DB-Servers, Agents and Single Servers |
Body size in bytes for HTTP/2 requests
arangodb_request_body_size_http2
(basename)arangodb_request_body_size_http2_bucket
arangodb_request_body_size_http2_sum
arangodb_request_body_size_http2_count
Histogram of the body sizes of the received HTTP/2 requests in bytes. Note that this does not account for the header.
Introduced in: v3.7.15
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | medium | Coordinators, DB-Servers, Agents and Single Servers |
Body size in bytes for VST requests
arangodb_request_body_size_vst
(basename)arangodb_request_body_size_vst_bucket
arangodb_request_body_size_vst_sum
arangodb_request_body_size_vst_count
Histogram of the body sizes of the received VST requests in bytes. Note that this does include the binary header.
Introduced in: v3.7.15
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | bytes | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total memory usage of request statistics
arangodb_request_statistics_memory_usage
Total memory usage of request statistics.
If the startup option --server.statistics
is set to true
, then some
request statistics are built up in memory for incoming requests.
Some memory is pre-allocated at startup for higher efficiency.
It is expected that the memory usage reported by this metric remains
relatively constant over time. It should grow only when there are bursts of
incoming requests.
No memory will be allocated for request statistics if the startup option
is set to false
.
Introduced in: v3.11.6
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | Coordinators, DB-Servers, Agents and Single Servers |
Number of CPU cores visible to the arangod process
arangodb_server_statistics_cpu_cores
Number of CPU cores visible to the arangod process, unless the
environment variable ARANGODB_OVERRIDE_DETECTED_NUMBER_OF_CORES
is set. In that case, the environment variable’s value is reported.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Percentage of time that the system CPUs have been idle
arangodb_server_statistics_idle_percent
Percentage of time that the system CPUs have been idle, as
a value between 0
and 100
, and as reported by the operating system.
This metric is only reported on some operating systems.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | percentage | simple | Coordinators, DB-Servers, Agents and Single Servers |
Percentage of time that the system CPUs have been waiting for I/O
arangodb_server_statistics_iowait_percent
Percentage of time that the system CPUs have been waiting for I/O, as
a value between 0
and 100
, and as reported by the operating system.
This metric is only reported on some operating systems.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | percentage | simple | Coordinators, DB-Servers, Agents and Single Servers |
Physical memory in bytes
arangodb_server_statistics_physical_memory
Physical memory of the system in bytes, as reported by the operating system
unless the environment variable ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY
is set. In that case, the environment variable’s value is reported.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of seconds elapsed since server start
arangodb_server_statistics_server_uptime_total
Number of seconds elapsed since server start, including fractional
seconds.
This metric was named arangodb_server_statistics_server_uptime
in previous versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_server_statistics_server_uptime
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | s | simple | Coordinators, DB-Servers, Agents and Single Servers |
Percentage of time that the system CPUs have spent in kernel mode
arangodb_server_statistics_system_percent
Percentage of time that the system CPUs have spent in kernel mode, as
a value between 0
and 100
, and as reported by the operating system.
This metric is only reported on some operating systems.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | percentage | simple | Coordinators, DB-Servers, Agents and Single Servers |
Percentage of time that the system CPUs have spent in user mode
arangodb_server_statistics_user_percent
Percentage of time that the system CPUs have spent in user mode, as
a value between 0
and 100
, and as reported by the operating system.
This metric is only reported on some operating systems.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | percentage | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of V8 contexts currently alive
arangodb_v8_context_alive
Number of V8 contexts currently alive. Normally, only Coordinators and single servers should have V8 contexts, for DB-Servers and Agents the value is always zero.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: If this number is close to the maximum allowed number of V8 contexts, there might be a shortage. This can delay Foxx queries and AQL user defined functions. On the other hand, V8 contexts can use quite a lot of memory, so one should not have too many if RAM is scarce.
Number of V8 contexts currently busy
arangodb_v8_context_busy
Number of V8 contexts currently busy, that means, they are currently working on some JavaScript task. Normally, only Coordinators and single servers should have V8 contexts, for DB-Servers and Agents the value is always zero.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: If this number is close to the maximum allowed number of V8 contexts, there might be a shortage. This can delay Foxx queries and AQL user defined functions. On the other hand, V8 contexts can use quite a lot of memory, so one should not have too many if RAM is scarce.
Number of V8 contexts currently dirty
arangodb_v8_context_dirty
This gauge reflects the number of V8 contexts that are currently dirty. A V8 context is dirty, if it has executed JavaScript for some time and is due for a garbage collection.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of V8 contexts currently free
arangodb_v8_context_free
This gauge reflects the number of V8 contexts that are currently free.
If this number drops to 0
there might be a shortage of V8 contexts.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Maximum number of concurrent V8 contexts
arangodb_v8_context_max
This is the maximum number of concurrent V8 contexts. This is limited by a server option, since V8 contexts can use a lot of RAM. V8 contexts are created and destroyed as needed up to the limit shown in this metric.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Minimum number of concurrent V8 contexts
arangodb_v8_context_min
This is the minimum number of concurrent V8 contexts. This is limited by a server option. V8 contexts are created and destroyed as needed but there are never fewer than the limit shown in this metric.
Introduced in: v3.6.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total number of compressed inserts into the in-memory edge cache
rocksdb_cache_edge_compressed_inserts_total
Total number of compressed inserts into the in-memory edge cache. This metric is increased whenever a payload value for the edge cache was compressed and then inserted.
Introduced in: v3.11.3
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Overall effective compression ratio of all edge cache entries ever stored
in the edge cache
rocksdb_cache_edge_compression_ratio
Overall effective compression ratio of all edge cache entries ever stored in the in-memory edge cache. The value should be 0 if compression is not used.
Introduced in: v3.11.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
gauge | bytes | advanced | DB-Servers, Agents and Single Servers |
Total number of insertions into the in-memory edge cache
for non-connected edges
rocksdb_cache_edge_empty_inserts_total
Total number of insertions into the in-memory edge cache for edges that were not connected to other edges. In this case, the edge case will store the information that there are no connected edges.
Introduced in: v3.11.4
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Total effective memory size of all edge cache entries ever stored in the edge cache
rocksdb_cache_edge_inserts_effective_entries_size_total
Total effective memory size of all edge cache data that were stored in the the edge cache of any collection/shard since the server start. The size is calculated after any potential compression, so the compression efficiency can be estimated by this metric by the size of the uncompressed edge cache entries. Note that this metric is incremented upon every edge cache insert attempt. It is increased also when data cannot be inserted into the cache (e.g. because the cache had no memory left). The metric is not decreased when data gets evicted from the cache.
Introduced in: v3.11.2
Renamed from: rocksdb_cache_edge_effective_entries_size
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | bytes | advanced | DB-Servers, Agents and Single Servers |
Total number of insertions into the in-memory edge cache.
in the edge cache
rocksdb_cache_edge_inserts_total
Total number of insertions into the in-memory edge cache. This metric is increased whenever a value is inserted into the in-memory edge cache, regardless of whether the value was compressed or not.
Introduced in: v3.11.4
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Total gross memory size of all edge cache entries ever stored in the edge cache
rocksdb_cache_edge_inserts_uncompressed_entries_size_total
Total gross memory size of all edge cache data that were stored in the edge cache of any collection/shard since the server start. The size is calculated before any potential compression. Note that this metric is incremented upon every cache insert attempt. It is increased also when data cannot be inserted into the cache (e.g. because the cache had no memory left). The metric is not decreased when data gets evicted from the cache.
Introduced in: v3.11.2
Renamed from: rocksdb_cache_edge_uncompressed_entries_size
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | bytes | advanced | DB-Servers, Agents and Single Servers |
Total amount of time spent in ‘free memory’ tasks of the in-memory
cache subsystem
rocksdb_cache_free_memory_tasks_duration_total
Total amount of time spent inside ‘free memory’ tasks of the in-memory cache subsystem. ‘free memory’ tasks are scheduled by the cache subsystem to free up memory in existing cache hash tables.
Introduced in: v3.10.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | us | advanced | DB-Servers, Agents and Single Servers |
Total number of ‘free memory’ tasks scheduled by the in-memory
cache subsystem
rocksdb_cache_free_memory_tasks_total
Total number of ‘free memory’ tasks that were scheduled by the in-memory edge cache subsystem. This metric will be increased whenever the cache subsystem schedules a task to free up memory in one of the managed in-memory caches. It is expected to see this metric rising when the cache subsystem hits its global memory budget.
Introduced in: v3.10.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Total amount of time spent in ‘migrate’ tasks of the in-memory
cache subsystem
rocksdb_cache_migrate_tasks_duration_total
Total amount of time spent inside ‘migrate’ tasks of the in-memory cache subsystem. ‘migrate’ tasks are scheduled by the cache subsystem to migrate existing cache hash tables to a bigger or smaller table.
Introduced in: v3.10.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | us | advanced | DB-Servers, Agents and Single Servers |
Total number of ‘migrate’ tasks scheduled by the in-memory
cache subsystem
rocksdb_cache_migrate_tasks_total
Total number of ‘migrate’ tasks that were scheduled by the in-memory edge cache subsystem. This metric will be increased whenever the cache subsystem schedules a task to migrate an existing cache hash table to a bigger or smaller size.
Introduced in: v3.10.11
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers, Agents and Single Servers |
Transactions
Number of read operation requests on leaders
arangodb_collection_leader_reads_total
This metric exposes the number of per-shard read operation requests on DB-Servers. It is increased by AQL queries and single-/multi-document read operations.
An AQL query increases the counter exactly once for a shard that is involved in the query in read-only mode, regardless if and how many documents/edges the query actually reads from the shard. For shards that are accessed by an AQL query in read/write mode, only the write counter is increased.
For every single- or multi-document read operation, the counter is increased exactly once for each shard that is affected by the operation, even if multiple documents are read from the same shard.
This metric is not exposed by default. It is only present if the startup option
--server.export-shard-usage-metrics
is set to either enabled-per-shard
or
enabled-per-shard-per-user
. With the former setting, the metric has different
labels for each shard that was read from. With the latter setting, the metric
has different labels for each combination of shard and user that accessed the shard.
The metric is only exposed on DB servers and not on Coordinators or single servers.
Note that internal operations, such as internal queries executed for statistics gathering, internal garbage collection, and TTL index cleanup are not counted in these metrics. Additionally, all requests that use the superuser JWT for authentication and that do not have a specific user set, are not counted. Requests are also only counted if they have an ArangoDB user associated with them, so that the cluster must also be running with authentication turned on.
Introduced in: v3.11.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Number of write operation requests on leaders
arangodb_collection_leader_writes_total
This metric exposes the number of per-shard write operation requests on DB-Servers. It is increased by AQL queries and single-/multi-document write operations.
An AQL query increases the counter exactly once for a shard that is involved in the query in write-only or read-write mode, regardless if and how many documents/edges are inserted or modified in the shard.
For every single- or multi-document write operation, the counter is increased exactly once for each shard that is affected by the operation, even if multiple documents are inserted, modified, or removed from the same shard.
For collection truncate operations, the counter is also increased exactly once for each shard affected by the truncate.
This metric is not exposed by default. It is only present if the startup option
--server.export-shard-usage-metrics
is set to either enabled-per-shard
or
enabled-per-shard-per-user
. With the former setting, the metric has different
labels for each shard that was read from. With the latter setting, the metric
has different labels for each combination of shard and user that accessed the shard.
The metric is only exposed on DB servers and not on Coordinators or single servers.
Note that internal operations, such as internal queries executed for statistics gathering, internal garbage collection, and TTL index cleanup are not counted in these metrics. Additionally, all requests that use the superuser JWT for authentication and that do not have a specific user set, are not counted. Requests are also only counted if they have an ArangoDB user associated with them, so that the cluster must also be running with authentication turned on.
Introduced in: v3.11.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Total amount of collection lock acquisition time
arangodb_collection_lock_acquisition_micros_total
Total amount of time it took to acquire collection/shard locks for write operations, summed up for all collections/shards. Does not increase for any read operations. The value is measured in microseconds.
Introduced in: v3.8.0
Renamed from: arangodb_collection_lock_acquisition_micros
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | us | medium | DB-Servers, Agents and Single Servers |
Troubleshoot: In case this value is considered too high, check if there are AQL queries or transactions that use exclusive locks on collections, and try to reduce them. Operations using exclusive locks may lock out other queries/transactions temporarily, which leads to an increase in lock acquisition time.
Number of transactions using sequential locking of collections to avoid deadlocking
arangodb_collection_lock_sequential_mode_total
Number of transactions using sequential locking of collections to avoid deadlocking. By default, a Coordinator tries to lock all shards of a collection in parallel. This approach is normally fast but can cause deadlocks with other transactions that lock the same shards in a different order. In case such a deadlock is detected, the Coordinator aborts the lock round and starts a new one that locks all shards in sequential order. This avoids deadlocks, but has a higher setup overhead.
Introduced in: v3.8.0
Renamed from: arangodb_collection_lock_sequential_mode
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | Coordinators |
Troubleshoot: In case this value is increasing, check if there are AQL queries or transactions that use exclusive locks on collections, and try to reduce them. Operations using exclusive locks may lock out other queries/transactions temporarily, which can lead to (temporary) deadlocks in case the queries/transactions are run on multiple shards on different servers.
Number of timeouts when trying to acquire collection exclusive locks
arangodb_collection_lock_timeouts_exclusive_total
Number of timeouts when trying to acquire collection exclusive locks. This counter increases whenever an exclusive collection lock cannot be acquired within the configured lock timeout.
Introduced in: v3.8.0
Renamed from: arangodb_collection_lock_timeouts_exclusive
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers, Agents and Single Servers |
Troubleshoot: In case this value is considered too high, check if there are AQL queries or transactions that use exclusive locks on collections, and try to reduce them. Operations using exclusive locks may lock out other queries/transactions temporarily, which can lead to other operations running into timeouts waiting for the same locks.
Number of timeouts when trying to acquire collection write locks
arangodb_collection_lock_timeouts_write_total
Number of timeouts when trying to acquire collection write locks. This counter increases whenever a collection write lock cannot be acquired within the configured lock timeout. This can only happen if writes on a collection are locked out by other operations on the collection that use an exclusive lock. Writes are not locked out by other, non-exclusively locked writes.
Introduced in: v3.8.0
Renamed from: arangodb_collection_lock_timeouts_write
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | DB-Servers, Agents and Single Servers |
Troubleshoot: In case this value is considered too high, check if there are AQL queries or transactions that use exclusive locks on collections, and try to reduce them. Operations using exclusive locks may lock out other queries/transactions temporarily, which can lead to other operations running into timeouts waiting for the same locks.
Number of bytes read in read requests on DB-Servers
arangodb_collection_requests_bytes_read_total
This metric exposes the per-shard number of bytes read by read operation requests on DB-Servers. It is increased by AQL queries that read documents or edges and for single- or multi-document read operations. The metric is normally increased only on the leader, but it can also increase on followers if “reads from followers” are enabled.
For every read operation, the metric is increased by the approximate number of bytes read to retrieve the underlying document or edge data. This is also true if a document or edge is served from an in-memory cache. If an operation reads multiple documents/edges, it increases the counter multiple times, each time with the approximate number of bytes read for the particular document/edge.
The numbers reported by this metric normally relate to the cumulated sizes of documents/edges read. The metric is also increased for transactions that are started but later aborted. However, metrics updates may be buffered until the end of a transaction. Note that the metric is not increased for secondary index point lookups or scans, or for scans in a collection that iterate over documents but do not read them.
This metric is not exposed by default. It is only present if the startup option
--server.export-shard-usage-metrics
is set to either enabled-per-shard
or
enabled-per-shard-per-user
. With the former setting, the metric has different
labels for each shard that was read from. With the latter setting, the metric
has different labels for each combination of shard and user that accessed the shard.
The metric is only exposed on DB servers and not on Coordinators or single servers.
Note that internal operations, such as internal queries executed for statistics gathering, internal garbage collection, and TTL index cleanup are not counted in these metrics. Additionally, all requests that use the superuser JWT for authentication and that do not have a specific user set, are not counted. Requests are also only counted if they have an ArangoDB user associated with them, so that the cluster must also be running with authentication turned on.
Enabling this metric via the startup option will likely result in a small latency overhead of few percent for read operations. The exact overhead depends on several factors, such as the type of operation (single or multi-document operation), replication factor, network latency etc.
Introduced in: v3.11.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Number of bytes written in write requests on DB-Servers
arangodb_collection_requests_bytes_written_total
This metric exposes the per-shard number of bytes written by write operation requests on DB-Servers, on both leaders and followers. It is increased by AQL queries and single-/multi-document write operations. The metric is first increased only the leader. With traditional replication, the metric is also increased on followers for every replication request that the leader make to the followers. With the new replication protocol (“replication2”), the metric is not increased on followers.
For every write operation, the metric is increased by the approximate number of bytes written for the document or edge in question. If an operation writes multiple documents/edges, it increases the counter multiple times, each time with the approximate number of bytes written for the particular document/edge.
An AQL query also increases the counter for every document or edge written, each time with the approximate number of bytes written for document/edge.
The numbers reported by this metric normally relate to the cumulated sizes of documents/edges written. For remove operations, however, only a fixed number of bytes is counted per removed document/edge. For truncate operations, the metrics will be affected differently depending on how the truncate is executed internally. For truncates on smaller shards, the truncate operation will be executed as the removal of the individual documents in the shard. Thus the metric will also be increased as if the documents were removed individually. Truncate operations on larger shards however will be executed via a special operation in the storage engine, which marks a whole range of documents as removed, but defers the actual removal until much later (compaction process). If a truncate is executed like this, the metric will not be increased at all. Writes into secondary indexes are not counted at all. The metric is also increased for transactions that are started but later aborted. However, metrics updates may be buffered until the end of a transaction.
This metric is not exposed by default. It is only present if the startup option
--server.export-shard-usage-metrics
is set to either enabled-per-shard
or
enabled-per-shard-per-user
. With the former setting, the metric has different
labels for each shard that was read from. With the latter setting, the metric
has different labels for each combination of shard and user that accessed the shard.
The metric is only exposed on DB servers and not on Coordinators or single servers.
Note that internal operations, such as internal queries executed for statistics gathering, internal garbage collection, and TTL index cleanup are not counted in these metrics. Additionally, all requests that use the superuser JWT for authentication and that do not have a specific user set, are not counted. Requests are also only counted if they have an ArangoDB user associated with them, so that the cluster must also be running with authentication turned on.
Enabling this metric via the startup option will likely result in a small latency overhead of few percent for write operations. The exact overhead depends on several factors, such as the type of operation (single or multi-document operation), replication factor, network latency etc.
Introduced in: v3.11.7
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | advanced | DB-Servers |
Total time spent in collection truncate operations
arangodb_collection_truncate_time
(basename)arangodb_collection_truncate_time_bucket
arangodb_collection_truncate_time_sum
arangodb_collection_truncate_time_count
Total time spent in collection truncate operations, including both
user-initiated truncate operations and truncate operations
executed by the synchronous replication on followers.
Note that this metric is only present when the command
line option --server.export-read-write-metrics
is set to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | DB-Servers, Agents and Single Servers |
Total number of collection truncate operations (excluding synchronous replication)
arangodb_collection_truncates_total
Total number of collection truncate operations on leaders (excluding synchronous
replication). Note that this metric is only present when the command
line option --server.export-read-write-metrics
is set to true
.
Introduced in: v3.8.0
Renamed from: arangodb_collection_truncates
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Agents, DB-Servers and Single Servers |
Number of read transactions, which are allowed to do dirty reads
arangodb_dirty_read_transactions_total
Total number of read-only transactions, which allow for dirty reads
(read from followers). This metric is only collected for
transactions on Coordinators in a cluster. Other instances may expose
the value as 0
.
Introduced in: v3.10.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total time spent in document insert operations
arangodb_document_insert_time
(basename)arangodb_document_insert_time_bucket
arangodb_document_insert_time_sum
arangodb_document_insert_time_count
Total time spent in document insert operations, including both
user-initiated insert operations and insert operations executed by
the synchronous replication on followers. This metric
is only present if the option --server.export-read-write-metrics
is set
to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Agents, DB-Servers and Single Servers |
Total time spent in document read-by-primary-key operations
arangodb_document_read_time
(basename)arangodb_document_read_time_bucket
arangodb_document_read_time_sum
arangodb_document_read_time_count
Total time spent in document read-by-primary-key operations. This metric
is only present if the option --server.export-read-write-metrics
is set
to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | DB-Servers, Single Servers and Agents |
Total time spent in document remove operations
arangodb_document_remove_time
(basename)arangodb_document_remove_time_bucket
arangodb_document_remove_time_sum
arangodb_document_remove_time_count
Total time spent in document replace operations, including both
user-initiated replace operations and replace operations executed by
the synchronous replication on followers. This metric
is only present if the option --server.export-read-write-metrics
is set
to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Agents, DB-Servers and Single Servers |
Total time spent in document replace operations
arangodb_document_replace_time
(basename)arangodb_document_replace_time_bucket
arangodb_document_replace_time_sum
arangodb_document_replace_time_count
Total time spent in document replace operations, including both
user-initiated replace operations and replace operations executed by
the synchronous replication on followers. This metric
is only present if the option --server.export-read-write-metrics
is set
to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Agents, DB-Servers and Single Servers |
Total time spent in document update operations
arangodb_document_update_time
(basename)arangodb_document_update_time_bucket
arangodb_document_update_time_sum
arangodb_document_update_time_count
Total time spent in document update operations, including both
user-initiated update operations and update operations executed by
the synchronous replication on followers. This metric
is only present if the option --server.export-read-write-metrics
is set
to true
.
Introduced in: v3.8.0
Type | Unit | Complexity | Exposed by |
---|---|---|---|
histogram | s | simple | Agents, DB-Servers and Single Servers |
Total number of document write operations (excluding synchronous replication)
arangodb_document_writes_total
Total number of document write operations (insert, update, replace, remove) on
leaders, excluding writes by the synchronous replication on followers.
This metric is only present if the option --server.export-read-write-metrics
is set to true
.
Introduced in: v3.8.0
Renamed from: arangodb_document_writes
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Agents, DB-Servers and Single Servers |
Number of read transactions
arangodb_read_transactions_total
Total number of read-only transactions. In the cluster, this metric is collected separately for transactions on Coordinators and the transaction counterparts on leaders and followers.
Introduced in: v3.8.2
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of transactions aborted
arangodb_transactions_aborted_total
Total number of transactions aborted. In the cluster, this metric is
collected separately for transactions on Coordinators and the
transaction counterparts on leaders and followers.
This metric was named arangodb_transactions_aborted
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_transactions_aborted
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of transactions committed
arangodb_transactions_committed_total
Total number of transactions committed. In the cluster, this metric is
collected separately for transactions on Coordinators and the
transaction counterparts on leaders and followers.
This metric was named arangodb_transactions_committed
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_transactions_committed
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total number of expired transactions
arangodb_transactions_expired_total
Total number of expired transactions, i.e. transactions that have
been begun but that were automatically garbage-collected due to
inactivity within the transactions’ time-to-live (TTL) period.
In the cluster, this metric is collected separately for transactions
on Coordinators and the transaction counterparts on leaders and followers.
This metric was named arangodb_transactions_expired
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_transactions_expired
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Number of transactions started
arangodb_transactions_started_total
Total number of transactions started/begun. In the cluster, this metric is
collected separately for transactions on Coordinators and the
transaction counterparts on leaders and followers.
This metric was named arangodb_transactions_started
in previous
versions of ArangoDB.
Introduced in: v3.8.0
Renamed from: arangodb_transactions_started
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
V8
Total number of V8 contexts ever created
arangodb_v8_context_created_total
This counter reflects the total number of V8 contexts ever created. It is OK if this number keeps growing since the V8 contexts are created and destroyed as needed. In rare cases a high fluctuation can indicate some unfortunate usage pattern.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_created
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Accumulated total time for creating V8 contexts
arangodb_v8_context_creation_time_msec_total
This counter reflects the accumulated total time for creating V8 contexts, in milliseconds. It is OK if this number keeps growing since the V8 contexts are created and destroyed as needed. In rare cases a high fluctuation can indicate some unfortunate usage pattern.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_creation_time_msec
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | ms | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total number of V8 contexts ever destroyed
arangodb_v8_context_destroyed_total
This counter reflects the total number of V8 contexts ever destroyed. It is OK if this number keeps growing since the V8 contexts are created and destroyed as needed. In rare cases a high fluctuation can indicate some unfortunate usage pattern.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_destroyed
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | medium | Coordinators, DB-Servers, Agents and Single Servers |
Total number of V8 context enter failures
arangodb_v8_context_enter_failures_total
Total number of V8 context enter failures. A context receives a context enter event every time it begins to execute some JavaScript. If no context is available at such a time the system waits for 60s for a context to become free. If this does not happen within the 60s, the context enter event fails, a warning is logged and this counter is increased by one.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_enter_failures
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Threshold: If you see V8 context enter failures, then you do not have enough V8 contexts or the server is overloaded by JavaScript tasks. If some JavaScript code blocks V8 contexts for too long, the free V8 contexts can run out and these failures begin to happen.
Total number of V8 context enter events
arangodb_v8_context_entered_total
Total number of V8 context enter events. A context receives a context enter event every time it begins to execute some JavaScript. This number is a rough estimate as to how much JavaScript the server executes.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_entered
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Total number of V8 context exit events
arangodb_v8_context_exited_total
This counter reflects the total number of V8 context exit events. A context receives a context exit event every time it finishes to execute some JavaScript.
Introduced in: v3.8.0
Renamed from: arangodb_v8_context_exited
Type | Unit | Complexity | Exposed by |
---|---|---|---|
counter | number | simple | Coordinators, DB-Servers, Agents and Single Servers |
Get usage metrics
Returns detailed shard usage metrics on DB-Servers.
These metrics can be enabled by setting the --server.export-shard-usage-metrics
startup option to enabled-per-shard
to make DB-Servers collect per-shard
usage metrics, or to enabled-per-shard-per-user
to make DB-Servers collect
usage metrics per shard and per user whenever a shard is accessed.
Metrics API
Get the metrics (deprecated)
/_admin/metrics/v2
instead. From version 3.10.0 onward, /_admin/metrics
returns the same metrics as /_admin/metrics/v2
.Returns the instance’s current metrics in Prometheus format. The returned document collects all instance metrics, which are measured at any given time and exposes them for collection by Prometheus.
The document contains different metrics and metrics groups dependent
on the role of the queried instance. All exported metrics are
published with the arangodb_
or rocksdb_
string to distinguish
them from other collected data.
The API then needs to be added to the Prometheus configuration file for collection.
Examples
curl --header 'accept: application/json' --dump - 'http://localhost:8529/_admin/metrics'