68797: Acronis Cyber Infrastructure: Node is suspended with "token node_crash_per_hour_threshold" in shaman

use Google Translate

Last update: 05-11-2021

Symptoms

One of the following scenarios is being observed:

1. Upon attempt to exit node from maintenance mode it stuck in status Exiting maintenance halted, the following error related to the issue with shaman resuming may be observed in /var/log/vstorage-ui-agent/messages.log on the issued node:

[root@storage-node-01 ~]# grep 'ERR.*node.*is not suspended with token' /var/log/vstorage-ui-agent/messages.log
ERROR 2021-06-04 12:14:02,420 r-94049c9ad0d443eb agent/presentation/api/ha/shaman.py:177:ResumeShaman.post status: 1 err:  out: Error: node "7e8808440b804f98" is not suspended with token ""

2. Eligibility check during update to 4.7 version detected the following issue:

There are forcefully suspended nodes. Check "shaman stat -j"

Manual check of shaman cluster status shows that there are node suspended with node_crash_per_hour_threshold mark, e.g.:

[root@storage-node-01 ~]# shaman stat -j | jq '.status.nodes[] | select (.status=="Suspended")'
{
  "annotations": {
    "node-capabilities": "basic,gc:rsp,ob:nsag,pr,ob:nst,ob:rlg",
    "node-disable-roles": "yes",
    "node-hostname": "storage-node-01.vstoragedomain",
    "node-roles": "VM:QEMU,CT:VZ7,ISCSI,S3",
    "package-version": "1.4.3"
  },
  "address": "192.168.3.21",
  "status": "Suspended",
  "suspend_tokens": [
    {
      "id": "node_crash_per_hour_threshold",
      "description": "Reached NODE_CRASH_PER_HOUR_THRESHOLD. See 'man shaman'.",
      "timestamp": "2021-06-04T07:04:11.793761331-04:00"
    }
  ]
}

Cause

At some point internal monitoring service shaman detected 3 consequent failures of the same node per hour and suspended the failing node's membership in HA shaman cluster disallowing to exit node from maintenance mode or start update to ACI 4.7.

Solution

Resume node membership in HA shaman cluster using the following command executed on the issued node:

[root@storage-node-01 ~]# shaman resume --token node_crash_per_hour_threshold

If node is stuck in Exiting maintenance halted, re-try attempt to exit node from maintenance mode via WebCP or via vinfra CLI tool, e.g.:

[root@storage-node-01 ~]# vinfra node maintenance stop storage-node-01 --wait

Contact Acronis support in case if issue with node still persists or if assistance with root cause investigation of detected previously crashes is required.

More information

To see more information about shaman HA monitoring check manual on any ACI node:

# man shaman

Tags: