65165: Acronis Cyber Infrastructure: Cluster network issues with Broadcom NIC bnx2x driver

use Google Translate

Last update: Fri, 2020-08-07 08:21

Symptoms

Low-load network communication works fine, but any load increase causes packets to drop. This may lead to various failures in provisioning scenarios:

1. MDS is not reachable from host:

# vstorage -c aci_cluster get-event
connected to MDS#16
2020-06-19 11:32:17.180 MDS#16 responds with error 13 (Peer did not respond or did not hold deadline)
2020-06-19 11:32:17.180 Unable complete operation, timeout (30 sec) expired.
Operation failed

cs.log shows timeout:

2020-0619 14:49:25.108 timer_work: rpc timer expired, killing connection to MDS#16, 1
2020-06-19 14:49:25.108 sio_trace_health: Trouble on MDS#16 fd=24 st=1/0 bufs=235104/108604/332800/0 queue=0/0/0/0/3 retr=1/0/0:0 rtt=103/65/0 cwnd=10/7/26880
2020-06-19 14:49:25.108 rpc_abort: aborted msg to MDS#16, tmo=0, err=13, -30000
2020-06-19 14:49:25.108 mds_register_done: CSD register error 13

2. VM creation with existing cluster fails. In /var/log/hci/nova/nova-compute.log, you see this message:

2020-06-09 01:38:08.644 6 ERROR nova.compute.resource_tracker 
Skipping removal of allocations for deleted instances: Unable to establish connection to https://compute-api.vstoragedomain.:35357/v3/auth/tokens: ('Connection aborted.', error(104, 'Connection reset by peer')): ConnectFailure: Unable to establish connection to https://compute-api.vstoragedomain.:35357/v3/auth/tokens: ('Connection aborted.', error(104, 'Connection reset by peer'))

3. Compute cluster creation fails. In celery.log, you see the stack failing with 504:

2020-06-05 16:08:26,851 ERROR [r-f736b935058341aa] backend/presentation/utils.py:85:make_error_response Request error
Traceback (most recent call last):
...
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/magnumclient/common/httpclient.py", line 362, in _http_request
error_json.get('debuginfo'), method, url)
magnumclient.common.apiclient.exceptions.GatewayTimeout: Gateway Timeout (HTTP 504)
2020-06-05 16:08:26,854 INFO [r-f736b935058341aa] backend/hooks/req_id.py:29:request_finished === END REQUEST: POST:/api/v2/compute/k8saas/ STATUS:504 ===

4. You are setting up Acronis Cyber Infrastructure and want to increase MTU of your Broadcom NIC with bnx2x driver to achieve best performance, but you cannot set a value larger than 3616.

Solution

Network adapters with bnx2x driver (for example, Broadcom Limited BCM57840 NetXtreme II 10/20-Gigabit Ethernet / Hewlett-Packard Company FlexFabric 10Gb 2-port 536FLB Adapter) do not perform well with high MTU values; this is a known implementation issue in the driver. For network adapters with this driver, we implemented a limitation of MTU 3616 in Acronis Cyber Infrastructure kernel. However, when such card is included in bond (default MTU value for bonding is 9000), the device will drop all packets greater that MTU value.

For verification, you can run ping with packet size greater than Broadcom limit value:

# ping <IP address> -s 4000

The output will show a 100% packet loss:
--- 10.XXX.XXX.XXX ping statistics --- 
5 packets transmitted, 0 received, 100% packet loss, time 4003ms

As setting the MTU value to recommended 9000 is not possible, these adapters do not allow best performance for our storage infrastructure solution and should not be used in Acronis Cyber Infrastructure setup. 

As a temporary workaround for the hardware replacement phase, decrease the bond connection MTU to 3000: this will align virtual device sizing with physical ifaces, allowing the communication.

More information

Acronis Cyber Infrastructure core is designed to work with RedHat drivers and cannot guarantee function of another driver provided by the vendor. We recommend selecting another network adapter for your infrastructure.