Thursday, April 12, 2018

QFSDP JANUARY 2018 on Exadata with OVM with more then 8 vm's watch out

Last Month we installed the OS related, firmware, .... part of the QFSDP JANUARY 2018 12.2.1.1.6 to be more specific, on our test and dev system at my customer.

GI and DB still need to be patched.

After our last patching experience http://pfierens.blogspot.be/2017/06/applying-april-2017-qfsdp-12102-exadata.html this only could go better.


Well to put a very long story short be cautious we ran into now 4 different bugs, causing instability of the RAC clusters, GI that refused to startup, loss of Infiniband Connectivity ...


So the Monday after the patching we were hit by instability of our Exadata OVM infrastructure for Test and Dev and Qualification. Dom0 rebooting ....

There seemed also to be an issue on IB interfaces in the domU, unfortunately
we didn't have a crash dump so support couldn't really do something.


The only way to get GI and DB's up again was to reboot the VM, crsctl stop crs and start crs didn't really work logs showed IB issues


Last time (forgot to blog about that ) we ran into the gnttab_max_frames issue which we had set to 512 after this patching it was put to 256 so we thought that might have been the reason, because in this release another parameter was introduced in grub.conf.



gnttab_max_maptrack_frames
gnttab_max_frames

the relation between the two was difficult to find but in the end this seem not to be the right diagnosis

if you want some more information about the gnttab_max_frames please read this
shortly put each virtual disk needs and networking operations needs a number of frames granted to communicate if this is not correctly set then you have issues ....


Luckily the Friday in that same week we were in the same situation, we decided to let the dom0 crash and that way have a crashdump.

After uploading that crashdump to Support the where able to see that issue was on Melanox HCA Firmware layer. between APR 2017 and January there where 4000 changes in that Firmware that happened which one or combination caused our issue.



Bottom line : There seem to be issue with the melanox HCA firmware (from 2.11.1280 to 2.35.5532.)
in this patch, you may encounter it if you have more then 8 vm's under one dom0, we had 9......



so basically we shutdown one vm on each node and had again stability.

when it was confirmed in numerous conf calls that  8 was  the magic number we decided to move the exadata monitoring vm functionality to another vm and shutdown the monitoring vm, to be again at 8 vm's


we got a stable situation until last Friday where we had an issue with both IB switches being unresponsive and the second switch not take the sm master role, this issue is still under investigation and hopefully not related to the QFSDP JAN 2018 ...



If you have similar symptoms point support to bugs :

  Bug 27724899 - Dom0 crashes with ib_umad_close with large no. of VMs 
  Bug 27691811 
  Bug 27267621 


UPDATE :

There seem to be a bug as well in the IB switch version 2.2.7-1 solved in 2.2.10 (not released yet) not everything is solved only the logging issue but not the main root cause apparently there is a separate ER for this






No comments: