The only reason for that is to reach more people, especially the people who Google for solutions when in trouble. Because this time I’m writing about a few rather frustrating issues that took me some time and energy to solve, so I’d like to share it with as many as possible!
The background is a very common setup where we have HPE Proliant G9 or G10 servers and use the VMware ESXi HPE-customized image, namely the “HPE Gen9 Plus Custom Image for ESXi 6.7 U2 Install CD”. It’s been around since April, and probably quite well used by now.
What doesn’t seem to be as well-known is a couple of problems with this version!
This I ran into when a host refused to accept patches through VUM, referring to “unable to write to disk” and “full /tmp”.
vdf -h shows the /tmp IS full:
Going to /tmp and doing ls -l shows the culprit
AMS is HPEs “Agentless Management Service”, used to channel info from iLO through the OS to applications showing hardware status.
Unfortunately, this is not the first time HPE have had problems with AMS in the OEM-bundle, and if we do a quick search there’s an advisory about this issue:
The fix is to upgrade AMS to 3.4.5 through the offline bundle, or the fantastic duct-tape solution of (frequently) deleting the file! A simple rm ams-bbUsg.txt will fix it temporarily.
A few weeks later I ran in to a similar problem when Veeam just stopped backing up about half of the VMs. The error in Veeam wasn’t very clear but at least put us in the right direction:
Error: DiskLib error: .The file is locked or in use -- File open failed: File not open
Failed to create NFC download stream. NFC path: [nfc://conn:<fqdn of vCenter>,nfchost:host-9095,stg:email@example.com/<name of VM-file>.vmx
Apparently only VMs residing on one specific host were failing, and it points to vCenter not being able to read the vmx-files of the VM being backed up.
Looking at the host in vCenter I see the following event:
The ramdisk 'var' is full. As a result, the file /var/run/vmware/.vvold-conflict-resolution-file.LOCK.531962696 could not be written
So, another RAM-disk getting filled by something! This time it’s the /var-disk:
That disk contains a lot of folders, files and links. There are several ways to find a big file or folder, I use du -h and du -h -d 1 depending on the depth you want to show.
Here you see that /log/EMU/mili is rather big:
The /EMU/mili-folder is for Emulex hardware (not militant, wingless birds) and used by that daemon.
This is a known bug with Emulex drivers, even if(because?) you don’t have any Emulex hardware installed it fills up a log with errors as seen here after a cat mili2d.log
Tue Oct 15 07:50:03 2019,532350375, ERROR:MILI_enumerate_elxiscsi:Failed to initialize User Init with status = 19
Tue Oct 15 07:50:03 2019,532350375, ERROR:MILI_enumerate_elx_nics:Failed to initialize USer Init with status = 19
Tue Oct 15 07:50:03 2019,532350375, ERROR:could not open device node /vmfs/devices/char/vmkdriver/be_esx_nic
Tue Oct 15 07:50:03 2019,532350375, CRITICAL:backend_init:OneConnect Adapter Not Found.
The quick fix is to delete the mili2d.log repeatedly, the workaround is to remove the corresponding VIB:
esxcli software vib remove --vibname elx-esx-libelxima.so
Another fix is to install the patch containing the ESXi670-201904211-UG which moves the log(!), not fixing the original problem:
“PR 2226688: Emulex drivers logs might fill up the /var file system logs
Emulex drivers might write logs at /var/log/EMU/mili/mili2d.log and fill up the 40 MB /var file system logs of RAM drives.
This issue is resolved in this release. The fix changes writes of Emulex drivers to the /scratch/log/ instead of the /var/log/.”
That’s all for now, but I’ll be back soon (after VMworld) with a few more “surprises” from the OEM-version of ESXi.
Until then, keep it virtual