64145: Acronis Cyber Infrastructure: adding new disk with cache option causes failure of another storage disks using the same cache disk

use Google Translate

Last update: 06-10-2020

Symptoms

The following behavior has been observed:

1. New disk on the node has been assigned with Storage role and Use SSD for caching and checksumming option (CS #1047, ex.).

2. After new disk appearing in "Active" state all other disks configured to use SSD for caching left in "Failed" status:

[root@node01 ~]# vstorage -c cluster1 list-services
TYPE    ID      ENABLED                 STATUS                  DEVICE/VOLUME GROUP            DEVICE INFO                         PATH
CS      1029    enabled                 failed                  /dev/sdd1                      ATA      TOSHIBA HDWN160            /vstorage/9905d6a0/cs
CS      1030    enabled                 failed                  /dev/sdc1                      ATA      TOSHIBA HDWN160            /vstorage/01608bad/cs
CS      1038    enabled                 failed                  /dev/sde1                      ATA      INTEL SSDSC2KB48           /vstorage/156c23d5/cs
CS      1047    enabled                 active [29463]          /dev/sdf1                      ATA      TOSHIBA MG07ACA1           /vstorage/6c3918e5/cs
MDS     3       enabled                 active [2702]           /dev/sdb1                      ATA      Crucial_CT512MX1           /vstorage/a09ea656/mds

3. The following pattern of behavior with latest message could not lock repository for issued CSs can be seen per cluster events:

[root@node01 ~]# vstorage -c cluster1 get-events
...
2019-11-28 12:42:10.987 MON INF: CS#1038 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1030 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1029 was stopped: csd: could not lock repository
2019-11-28 12:42:43.588 MDS INF: New CS#1047 at 192.168.1.101:48408 (0.0.0.4b11ac89f7274355), tier=0
2019-11-28 12:42:45.199 MDS INF: CS#1047 is active
...
2019-11-28 12:56:55.276 MDS WRN: CS#1029 is offline
2019-11-28 12:56:56.276 MDS WRN: CS#1038, CS#1030 are offline

4. All affected CSs have journal located on the same cache disk (disk /dev/sdb1 in this example):

[root@node01 ~]# ll /vstorage/a09ea656/journal/journal-cs-10*/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:27 /vstorage/a09ea656/journal/journal-cs-1029/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1030/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:12 /vstorage/a09ea656/journal/journal-cs-1038/journal
-rw------- 1 vstorage vstorage 90932514816 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1047/journal

Cause

A software-related issue causes vstorage-csd services remain in the 'failed' state after cache journal reconfiguration.

Solution

The issue is fixed in Acronis Cyber Infrastructure 4.0.

For earlier versions: to fix the issue it is required to manually restart all vstorage-csd services on the affected node using the following command:

# (export CN=$(cat /mnt/vstorage/.vstorage.info/clustername); for i in $(vstorage -c $CN list-services | grep 'CS.*failed'| awk '{print$2}'); do systemctl restart vstorage-csd.$CN.$i.service; done)

More information

If the issue still persists, contact Acronis support for assistance.