Symptoms
The following behavior has been observed:
1. New disk on the node has been assigned with Storage role and Use SSD for caching and checksumming option (CS #1047, ex.).
2. After new disk appearing in "Active" state all other disks configured to use SSD for caching left in "Failed" status:
[root@node01 ~]# vstorage -c cluster1 list-services
TYPE ID ENABLED STATUS DEVICE/VOLUME GROUP DEVICE INFO PATH
CS 1029 enabled failed /dev/sdd1 ATA TOSHIBA HDWN160 /vstorage/9905d6a0/cs
CS 1030 enabled failed /dev/sdc1 ATA TOSHIBA HDWN160 /vstorage/01608bad/cs
CS 1038 enabled failed /dev/sde1 ATA INTEL SSDSC2KB48 /vstorage/156c23d5/cs
CS 1047 enabled active [29463] /dev/sdf1 ATA TOSHIBA MG07ACA1 /vstorage/6c3918e5/cs
MDS 3 enabled active [2702] /dev/sdb1 ATA Crucial_CT512MX1 /vstorage/a09ea656/mds
3. The following pattern of behavior with latest message could not lock repository
for issued CSs can be seen per cluster events:
[root@node01 ~]# vstorage -c cluster1 get-events
...
2019-11-28 12:42:10.987 MON INF: CS#1038 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1030 was stopped: csd: could not lock repository
2019-11-28 12:42:11.325 MON INF: CS#1029 was stopped: csd: could not lock repository
2019-11-28 12:42:43.588 MDS INF: New CS#1047 at 192.168.1.101:48408 (0.0.0.4b11ac89f7274355), tier=0
2019-11-28 12:42:45.199 MDS INF: CS#1047 is active
...
2019-11-28 12:56:55.276 MDS WRN: CS#1029 is offline
2019-11-28 12:56:56.276 MDS WRN: CS#1038, CS#1030 are offline
4. All affected CSs have journal located on the same cache disk (disk /dev/sdb1
in this example):
[root@node01 ~]# ll /vstorage/a09ea656/journal/journal-cs-10*/journal -rw------- 1 vstorage vstorage 90798297088 Nov 28 13:27 /vstorage/a09ea656/journal/journal-cs-1029/journal -rw------- 1 vstorage vstorage 90798297088 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1030/journal -rw------- 1 vstorage vstorage 90798297088 Nov 28 13:12 /vstorage/a09ea656/journal/journal-cs-1038/journal -rw------- 1 vstorage vstorage 90932514816 Nov 28 13:28 /vstorage/a09ea656/journal/journal-cs-1047/journal
Cause
A software-related issue causes vstorage-csd services remain in the 'failed' state after cache journal reconfiguration.
Solution
The issue is fixed in Acronis Cyber Infrastructure 4.0.
For earlier versions: to fix the issue it is required to manually restart all vstorage-csd services on the affected node using the following command:
# (export CN=$(cat /mnt/vstorage/.vstorage.info/clustername); for i in $(vstorage -c $CN list-services | grep 'CS.*failed'| awk '{print$2}'); do systemctl restart vstorage-csd.$CN.$i.service; done)
More information
If the issue still persists, contact Acronis support for assistance.