60632: Acronis Cyber Infrastructure: Chunk Server node is in 'Failed' state

use Google Translate

Symptoms

You observe all or several of these symptoms:

  • In WebCP, some nodes or disks are marked as FAILED and the storage icon is shown in red:
  • In the output of vstorage -c <clustername> top command some of CSs are in the Failed state:
  • One or more of the following events are found in the vstorage -c <clustername> get-event output:

MDS WRN: CS#1025 have reported IO error on pushing chunk 1cee of 'data.0', please check disks
MDS ERR CS#1026 detected back storage I/O failure
MDS ERR CS#1026 detected journal I/O failure
MDS WRN: Integrity failed accessing 'data.0' by the client at 192.168.1.11:42356
MDS WRN: CS#1025 is failed permanently and will not be used for new chunks allocation

Cause

In case an I/O error is returned by any disk, Chunk Server located on this disk is switched to the 'failed' state. Acronis Cyber Infrastructure would not automatically recover the CS from this state, even after a Storage Node reboot. 

Right after an I/O error occurs, the file system is re-mounted in the read-only mode and Acronis Cyber Infrastructure no longer tries to allocate any data chunks on this CS. At the same time, if the drive is still available for reading, Acronis Cyber Infrastructure tries to replicate all the chunks out of it. 

Solution

The following workflow is recommended to troubleshoot the issue: 

  1. Determine affected disk.
  2. Check its health status.
  3. Decide if the device needs replacement.
  4. Based on the information above, return failed CS to active status or decommission it.

1. Determine the affected device

How to find the affected node and drive with WebCP
In the left menu, go to Nodes and click the node marked as Failed. Note the name of this node. Click Disks and find the disk marked as Failed. Note the device name for this disk (for example, SDC on this screenshot):

How to find the affected disk with SSH and CLI
Log in to any node of the Acronis Cyber Infrastructure cluster with SSH.

Issue the following command:
vstorage -c <cluster_name> stat | grep failed

Example output: 

[root@ ~]# vstorage -c PCKGW1 stat | grep failed
connected to MDS#2
CS nodes:  6 of 6 (5 avail, 0 inactive, 0 offline, 1 out of space, 1 failed), storage version: 122
  1026 failed     98.2GB     0B        6        2     0%       0/0    0.0  172.29.38.210 7.5.111-1.as7

Note CS ID displayed in the first column (1026 in the example above) and the IP address of the node where CS is located (172.29.38.210 in the example above).

Log in to the affected node. 

To determine the disk where the affected CS is located, use following command:
vstorage -c <cluster_name> list-services

Example output: 

Although the CS is in the failed state, it is running and replicating data to other CSs, if it is possible. Therefore, in the output of list-services command it is displayed as active.

[root@PCKGW1 ~]# vstorage -c PCKGW1 list-services
TYPE    ID      ENABLED  STATUS        DEVICE/VOLUME GROUP  DEVICE INFO             PATH
CS      1025    enabled  active [1297] /dev/sdd1            VMware   Virtual disk   /vstorage/df218335/cs
CS      1026    enabled  active [1288] /dev/sdc1            VMware   Virtual disk   /vstorage/12bb6baf/cs
MDS     1       enabled  active [1295] /dev/sdb1            VMware   Virtual disk   /vstorage/38b5fb92/mds

In the ID column find CS with the ID you have noted on the previous step. Note Device/volume for this CS and its path (see PATH column). The PATH column is useful than you need to review the log file for given CS. Log file will be located at PATH/logs (/vstorage/12bb6baf/cs/logs for the example above).

2. Check the affected disk health status

The ultimate goal of this step is to collect information required to make a decision whether it is possible to continue using the affected disk, or whether it should be replaced.

The following information should be reviewed and analyzed for any data related to the issue: 

  • dmesg command output. It is handy to use dmesg -T in order to see human-readable timestamps. 
  • /var/log/messages file
  • SMART status of physical hard drive. Could be acquired with: systemctl -a <affected device> 

3. Decide if device needs replacement

Depending on the physical storage type (directly attached JBOD, iSCSI LUN, Fibre channel etc.) and particular circumstances, exact error messages and patterns vary greatly.

Here are some rules of thumb to facilitate decision-making process: 

  • If SMART status is unsatisfactory for the physical disk, this usually means the disk needs to be replaced.
  • Check if similar issues or any other error messages were previously logged for this disk. If the issue appears for the first time usually CS could be reused without configuration changes. Nevertheless pay special attention to this CS in future. 
  • If there are multiple error messages present in dmesg and/or /var/log/messages for several disks on a single backplane or RAID controller, this means hardware itself could be a culprit. Contact your hardware vendor for aditional review.
  • In case of iSCSI device any I/O errors could be a result of poor network connectivity or incorrect network configuration. Troubleshootng should start with thorough network check. 
  • If Acronis Cyber Infrastructure is installed to a virtual machine and CS is located in .vmdk or .vhd file stored on a NAS, such a system should be carefully checked for reliability before going to production. Acronis Cyber Infrastructure ships a special tool, vstorage-hwflush-check, for checking how a storage device flushes data to disk in an emergency case such as power outage. We strongly recommend using this tool to make sure your storage behaves correctly in case of power-off events. This article explains how to use the tool.

4. Return failed CS to Active status

If it is decided to reuse the same CS on the same drive, follow the steps below:

  • Reboot the affected Acronis Cyber Infrastructure node
  • Check dmesg | grep <disk name> (eg. dmesg | grep sdc in the example above) for any messages about file system errors on the affected drive. In case of errors check the file system with fsck or e2fsck

  • Use following command to override failed status for the CS: 

    vstorage -c <cluster_name> rm-cs -U <CSID>

  • Verify and confirm Active state for the CS with the following command:

    vstorage -c <cluster_name> stat | grep <CSID>

 

More information

In case of any doubts, please do not hesitate to contact Acronis Technical support for additional review of your Acronis Cyber Infrastructure state.

Tags: 

You are reporting a typo in the following text:
Simply click the "Send typo report" button to complete the report. You can also include a comment.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
2 + 9 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.