Thursday, June 22, 2017

To Data Guard First or Not to Data Guard First that is the question

In a previous post you could read about issues with IB switches and other problems with the APR QFSDP 12.1

we had some more surprises.

All BP we installed so far are Data Guard First enabled meaning you can install them on the standby do a switchover, bring your new standby at your pace to the same level and do a data patch.


Well we did that as for all the other QFSDP's so far.

but after the switchover ALL of a sudden our standby's didn't follow anymore and aborted recovery with an ORA-00600 ....

we clearly ran into this issue :

ORA-00600:[KTSLU_PUA_REMCHK-1] Could be generated after Applying April 2017 Database Bundle Patch (12.1.0.2.170418 DBBP) (Doc ID 2267842.1)

Note :




We fixed our issue by just applying the BP on the unpatched home we didn't add the extra patch.

The key is to re-read the documentation several times, however wouldn't it be nice if oracle support could send you a mail they have records of everything you downloaded anyways, judging the sales people that call each time i download new stuff ;-)

this would be a great service !

Another great service would actually to test patches ... and test if a patch is DG first here the second patch was release a long time after the APR bundle making you wonder if they actually tested this patch upfront  in a DG environment and did a switchover

Applying April 2017 QFSDP 12.1.0.2 Exadata 12.2.1.1.1

UPDATE see bellow for more info

The customer I currently work for has Exadata X4,X5, X6 mostly in 1/4 or 1/8 configurations running on Exadata 12.1.2.3.2.160721 with OVM. that means that every exadata is divided in a couple of vm’s Pieter Van Puymbroeck and I already talked about this a couple of times at user conferences.


 you can find the presentation online soon link to be posted.


We decided to patch to Exadata 12.2.1.1.1.

After all this was the second release of Exadata 12.2 ;-)



 Why 12.2.1.1.1 ?


 We want to start developing in DB 12.2 this year and make use of all the exadata features, offloading, …. which is not the case if you run EXADATA 12.1….. from note :


 Exadata 12.2.1.1.1 release and patch (25512521)

 (Doc ID 2231247.1) Database servers using virtualization (OVM) require the following: Oracle Grid Infrastructure home in domU must be running release 12.1.0.2.161018 (Oct 2016) Proactive Bundle Patch or later before updating a domU to Exadata 12.2. All user domains (domUs) must be updated to Exadata 12.2 before updating the management domain (dom0) to Exadata 12.2.


 This forced us to take a different approach, since that version required to at least have DB + GI 12.1.0.2 BP OCT 2016 and we were on 12.1.0.2 JULY so we patched in following order :
we upgraded GI + DB to 12.1.0.2 APR 2017
upgraded the cells
upgrade the domU
upgrade the dom0
upgrade the IB switch.
this went pretty smooth on our X4 test system.
 our X6 system was something else, on that test / dev qualification machine wthere were about 10vm’s on each node all, which made it pretty labour intensive.

 I scripted as much as I could using dcli we ran into a couple of issues on that EXADATA (10vm’s ….):


  •  corrupt libserver.a file 
  • snapshots that still we mounted on the OVM 
  • patchmgr that bricked the IB switch. 
  • IB switch that stayed in pre boot rebooting
  • IB switch made disk groups dismount on one node basically if we rebooted root cause still ongoing
all this made that the patching that we tought would be finished in about 12 - 14h lasted for around 30h, lots of lead time because of 3 SEV SR 1 open, which really didn’t move despite being a SEV 1 issue, not speaking about the IB switch which was patched more then a week after all the other components.

Libserver.a 


 The libserver.a issue was resolved by copying over a good version to that file to the $ORACLE_HOME/lib directory and reapplying the patch.

 Mount snapshot LVMs 


 Although support suggested to drop the LVM image called : LVMDoNotRemoveOrUse (size 1g)

 I didn’t do that for obvious reasons and checked what was in the mounted Snapshots and removed those. the issue is when you get a support guy stating that you should remove an LVM that is named LVMDoNotRemoveOrUse and which seems Exadata Internal and specific and present on every single Exadata we have, your confidence in the person helping you takes a hit.

IB Switch patch


You start even more than you normally do checking the support responses for validity hence losing even more precious downtime window time….. then finally sunday evening was there and we would start patching the easiest part of the whole rack the one that you don’t touch too much, the real black box, THE IB switch.


 We used patchmgr what happened next was not a pleasant experience : after completing successfully the precheck patchmgr said that the upgrade was FAILED ?

As I was exhausted after the long patching I was confident I could log on with the console cable monday.

Well that was too optimistic it was completely dead, after opening an SR in which I stated a Field Engineer would be necessary to  change it.

After waiting almost for 8h before support wanted to believe me that the switch was bricked and holding me busy sending over info from the surviving switch, a field engineer was scheduled the next day.

After a couple of hours the new switch was put in the rack and patched, it was to our surprise that by default fwverify complained about fs permissions on the timezone files, they are said to 755 instead of 644 :

 however this is due to bug :

"
16847481 Setting time zone changes permissions of time zone file.

 

Workaround: Use the Fabric Monitor feature of the Oracle ILOM web interface to retrieve connector information.

After setting the time zone, the permissions of the time zone file are set to be highly restricted. Consequently, the fwverify command receives an error when it attempts to read the file.





 Workaround: Open the permissions on the time zone file.

After setting the time zone, become the root user.

Open the permissions for the time zone file.

# chmod 644 /conf/localtime

“

the Field engineer left the switch with the on the same initial version as the surviving switch (2.1.8), the upgrade was again in our hands.




This time we took the manual upgrade route. This was way out of my comfort zone, a switch should just work so my Exitas Solaris and ZFSSA specialist colleague Filip Francis, proposed to help me. we followed the procedure to the letter and ended up with a switch stuck in preboot phase ……

Luckily there was a note that described this …. exactly our symptoms: Infiniband Gateway Switch Stays In Pre-boot Environment During Upgrade/Reboot (Doc ID 2202721.1).

on to switch ibb01 .... the same workaround needed. We didn't dare to use patch mgr anymore on the other EXA's we patched so fare

Hmm to our surprise although subnet-manager was running  after a reboot we had node evictions and disk group dismount investigations are still ongoing.

BTW : the IB patch is quite big it brings the switch from Centos 5.2 an 8y old version to Oracle Enterprise Linux 6.7



So last week we did production standby more about this in a next blog post.



UPDATE : I was contacted by Mr Kundersma a consulting member of Technical Staff of the DB HA  and MAA group who asked me more details about the LVM snapshot. He looked in the SR and came very quickly to the root cause of the issue.

Big thumbs up to him and his team thank you very much for reaching out !!!

Thank you to ExadataPM Gurmeet for reaching out on Twitter



Thanks Filip and Pieter Van Puymbroeck for your support