68862: Acronis Cyber Infrastructure: empty graphs on WebCP and Grafana dashboards

use Google Translate

Last update: 22-09-2021

Symptoms

WebCP and Grafana monitoring dashboards do not reflect performance information on any of the monitoring charts e.g:

      

At the same time all storage services are healthy and working properly and no other issues with UI functionalities are being observed.

Cause

Acronis Cyber Infrastructure uses open-source Prometheus monitoring system to monitor performance and availability of the storage cluster, infrastructure nodes and the deployed services. Absence of information on UI charts in most cases is caused by the issues with prometheus.service on the Management Node. In particular some part of Prometheus data may become corrupted in case if at some point there was a lack of space on root partition of the Management Node, the corresponding messages will be found in /var/log/messages log, e.g.:

[root@node01 ~]# grep prometheus /var/log/messages* | grep 'no space left on device'
Mar 26 11:40:39 node01 prometheus: level=warn ts=2021-03-26T11:40:39.031Z caller=manager.go:606 component="rule manager" group=fused msg="Rule sample appending failed" err="write to WAL: log samples: write /var/lib/prometheus/data/wal/00001066: no space left on device"

Solution

1. Check Prometheus logs to confirm data segment corruption, e.g. via journalctl tool:

[root@node01 ~]# journalctl -u prometheus | grep error

May 25 15:33:55 node01.vstoragedomain prometheus[3969]: level=error ts=2021-05-25T15:33:55.561Z caller=main.go:787 err="opening storage failed: mmap files, file: /var/lib/prometheus/data/chunks_head/000072: mmap: invalid argument"

Jun 16 07:00:03 node01.vstoragedomain prometheus[2989]: level=error ts=2021-06-16T07:00:03.255Z caller=db.go:730 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: read segments: corruption in segment /var/lib/prometheus/data/wal/checkpoint.013688/00000001 at 661: unexpected full record"
 

or via system logs:

[root@node01 ~]# grep error.*prometheus /var/log/messages

Jun 30 17:29:27 node01 prometheus: level=error ts=2021-06-30T15:29:27.537Z caller=main.go:787 err="opening storage failed: mmap files, file: /var/lib/prometheus/data/chunks_head/000108: mmap: invalid argument"

2. Clean up Prometheus data directory on the Management Node and restart the service in order to force metrics collection by Prometheus from the scratch:

[root@node01 ~]# rm -rf /var/lib/prometheus/data/*
[root@node01 ~]# systemctl restart prometheus

More information

Contact Acronis support in case if issues with empty monitoring charts will persist after Prometheus data directory clean up.

 

Tags: