64948: Acronis Cyber Infrastructure: Nodes with AMD Epyc Rome CPU and Mellanox NIC reboot unexpectedly

use Google Translate

Symptoms

Cluster nodes — if these nodes have AMD 2nd generation Epyc (Rome) CPU and Mellanox NIC — reboot unexpectedly.

In kernel log (use the dmesg command to see it):

AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x002d address=0x00000000be6ca980 flags=0x0020]

Cause

Known issue specific to this configuration. See the Troubleshooting section of this Mellanox community article.

Solution

  1. Enable SR-IOV in BIOS.
  2. Set iommu=pt is set on the Linux grub menu:
    1. In /etc/default/grub, add kernel parameter iommu=pt to the string GRUB_CMDLINE_LINUX.
       
      GRUB_CMDLINE_LINUX="<YOUR_PARAMS> iommu=pt"

       
      For example:
      Before:
      GRUB_CMDLINE_LINUX="crashkernel=auto tcache.enabled=0 rd.md.uuid=93606373:d5569557:322f4641:13d6fab3 rd.md.uuid=c0b44f6a:1efde5fe:51aace30:4627c299 rd.md.uuid=d8db1339:2fb46769:61385b6b:ba385aa7 quiet"
      After: 
      GRUB_CMDLINE_LINUX="crashkernel=auto tcache.enabled=0 rd.md.uuid=93606373:d5569557:322f4641:13d6fab3 rd.md.uuid=c0b44f6a:1efde5fe:51aace30:4627c299 rd.md.uuid=d8db1339:2fb46769:61385b6b:ba385aa7 quiet iommu=pt"

    2. Run:
      grub2-mkconfig -o /boot/grub2/grub.cfg
      Default location, is different for EFI or if changed by the user.