64145: Acronis Cyber Infrastructure: adding new disk with cache option causes failure of another storage disks using the same cache disk

use Google Translate

Symptoms

The following behavior has been observed:

1. New disk on the node has been assigned with Storage role and Use SSD for caching and checksumming option (CS #1047, ex.).

2. After new disk appearing in "Active" state all other disks configured to use SSD for caching left in "Failed" status:

[root@node01 ~]# vstorage -c cluster1 list-services
TYPE    ID      ENABLED                 STATUS                  DEVICE/VOLUME GROUP            DEVICE INFO                         PATH
CS      1029    enabled                 failed                  /dev/sdd1                      ATA      TOSHIBA HDWN160            /vstorage/9905d6a0/cs
CS      1030    enabled                 failed                  /dev/sdc1                      ATA      TOSHIBA HDWN160            /vstorage/01608bad/cs
CS      1038    enabled                 failed                  /dev/sde1                      ATA      INTEL SSDSC2KB48           /vstorage/156c23d5/cs
CS      1047    enabled                 active [29463]          /dev/sdf1                      ATA      TOSHIBA MG07ACA1           /vstorage/6c3918e5/cs
MDS     3       enabled                 active [2702]           /dev/sdb1                      ATA      Crucial_CT512MX1           /vstorage/a09ea656/mds

3. The following pattern of behavior with latest message could not lock repository for issued CSs can be seen per cluster events:

[root@node01 ~]# vstorage -c cluster1 get-events
...
2019-11-28 12:42:10.987 MON INF: CS#1038 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1030 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1029 was stopped: csd: could not lock repository
2019-11-28 12:42:43.588 MDS INF: New CS#1047 at 192.168.1.101:48408 (0.0.0.4b11ac89f7274355), tier=0
2019-11-28 12:42:45.199 MDS INF: CS#1047 is active
...
2019-11-28 12:56:55.276 MDS WRN: CS#1029 is offline
2019-11-28 12:56:56.276 MDS WRN: CS#1038, CS#1030 are offline

4. All affected CSs have journal located on the same cache disk (disk /dev/sdb1 in this example):

[root@node01 ~]# ll /vstorage/a09ea656/journal/journal-cs-10*/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:27 /vstorage/a09ea656/journal/journal-cs-1029/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1030/journal
-rw------- 1 vstorage vstorage 90798297088 Nov 28 13:12 /vstorage/a09ea656/journal/journal-cs-1038/journal
-rw------- 1 vstorage vstorage 90932514816 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1047/journal

Cause

A software-related issue causes vstorage-csd services remain in the 'failed' state after cache journal reconfiguration.

Solution

The issue will be permanently fixed in future updates.

To fix the issue it is required to manually restart all vstorage-csd services on the affected node using the following command:

# (export CN=$(cat /mnt/vstorage/.vstorage.info/clustername); for i in $(vstorage -c $CN list-services | grep 'CS.*failed'| awk '{print$2}'); do systemctl restart vstorage-csd.$CN.$i.service; done)

More information

If the issue still persists, contact Acronis support for assistance.