Friday, April 20, 2018

Setting up standby of PDB 12.2.0.1.170818

We have a single tenant production database

We wanted to create a standby database for it.

That worked perfectly fine :





SET DECRYPTION WALLET OPEN IDENTIFIED BY “password”;
run
{
 allocate channel target01 type disk;
allocate channel target02 type disk;
allocate channel target03 type disk;
allocate channel target04 type disk;
ALLOCATE auxiliary CHANNEL aux01 TYPE DISK;
ALLOCATE auxiliary CHANNEL aux02 TYPE DISK;
ALLOCATE auxiliary CHANNEL aux03 TYPE DISK;
ALLOCATE auxiliary CHANNEL aux04 TYPE DISK;
duplicate target database for standby from active database password file section size 2000M;
}
However ..... my datafiles where not put in place correctly on the standby on the primary it had this structure
+DATA/db_unique_name/DATAFILE 

+DATA/db_unique_name/pdb_guid/DATAFILE

on the Standby on the other hand .... Everything also the PDB datafiles in
+DATA/db_unique_name/DATAFILE
After checking all my parameters I went to look on MOS apparently known issue




Datafile on Standby Database is created under incorrect directory (Doc ID 2346623.1)
Patches available
Update :
Installed the patch and still the same issue ! :-( Opening SR
Update 2 : Twitter is fantastic Piotr Wrzosek pointed me to bug

INCONSISTENT OMF BEHAVIOUR WITH ALTER DATABASE MOVE DATAFILE bug 17613474
https://twitter.com/pewu78/status/987335828958564352
It seems to be present already for some time however no patch available for our version it seems ;(
Patch 24836489: DATAFILES ARE CREATED WRONG LOCATION IN OMF DEFINED PDB DATABASE



Update and solution : REMOVE SECTION SIZE from the RMAN duplicate then the files are created correctly Need to rollback patch to see if just REMOVING section size is enough or if the above mentioned patch 25576813 is still needed

Thursday, April 12, 2018

QFSDP JANUARY 2018 on Exadata with OVM with more then 8 vm's watch out

Last Month we installed the OS related, firmware, .... part of the QFSDP JANUARY 2018 12.2.1.1.6 to be more specific, on our test and dev system at my customer.

GI and DB still need to be patched.

After our last patching experience http://pfierens.blogspot.be/2017/06/applying-april-2017-qfsdp-12102-exadata.html this only could go better.


Well to put a very long story short be cautious we ran into now 4 different bugs, causing instability of the RAC clusters, GI that refused to startup, loss of Infiniband Connectivity ...


So the Monday after the patching we were hit by instability of our Exadata OVM infrastructure for Test and Dev and Qualification. Dom0 rebooting ....

There seemed also to be an issue on IB interfaces in the domU, unfortunately
we didn't have a crash dump so support couldn't really do something.


The only way to get GI and DB's up again was to reboot the VM, crsctl stop crs and start crs didn't really work logs showed IB issues


Last time (forgot to blog about that ) we ran into the gnttab_max_frames issue which we had set to 512 after this patching it was put to 256 so we thought that might have been the reason, because in this release another parameter was introduced in grub.conf.



gnttab_max_maptrack_frames
gnttab_max_frames

the relation between the two was difficult to find but in the end this seem not to be the right diagnosis

if you want some more information about the gnttab_max_frames please read this
shortly put each virtual disk needs and networking operations needs a number of frames granted to communicate if this is not correctly set then you have issues ....


Luckily the Friday in that same week we were in the same situation, we decided to let the dom0 crash and that way have a crashdump.

After uploading that crashdump to Support the where able to see that issue was on Melanox HCA Firmware layer. between APR 2017 and January there where 4000 changes in that Firmware that happened which one or combination caused our issue.



Bottom line : There seem to be issue with the melanox HCA firmware (from 2.11.1280 to 2.35.5532.)
in this patch, you may encounter it if you have more then 8 vm's under one dom0, we had 9......



so basically we shutdown one vm on each node and had again stability.

when it was confirmed in numerous conf calls that  8 was  the magic number we decided to move the exadata monitoring vm functionality to another vm and shutdown the monitoring vm, to be again at 8 vm's


we got a stable situation until last Friday where we had an issue with both IB switches being unresponsive and the second switch not take the sm master role, this issue is still under investigation and hopefully not related to the QFSDP JAN 2018 ...



If you have similar symptoms point support to bugs :

  Bug 27724899 - Dom0 crashes with ib_umad_close with large no. of VMs 
  Bug 27691811 
  Bug 27267621 


UPDATE :

There seem to be a bug as well in the IB switch version 2.2.7-1 solved in 2.2.10 (not released yet) not everything is solved only the logging issue but not the main root cause apparently there is a separate ER for this






Sunday, April 1, 2018

to_dog_year on Exadata

One of the new features which seem to be overlooked in all the publications I saw about 18c is the TO_DOG_YEAR() function. It seems obvious that this was missed, because it fairly undocumented as Frank Pachot Pieter Van Puymbroeck Brendan Tierney Martin Berger Oyvind Isene pointed out.


I wanted to know how it behaved on the Exadata especially on my customers OVM on Exadata. tested version : Exadata 18.1.4.0.0 my first tries where not successfully and still are not successful i see quite some strange behaviour.
desc dogyear

FUNCTION TO_DOG_YEAR RETURNS NUMBER
 Argument Name        Type        In/Out Default?
 ------------------------------ ----------------------- ------ --------
 DOB                            DATE                    IN
 FEMALE                         BOOLEAN                 IN      DEFAULT
 NLS_BREAD                      VARCHAR2                IN      DEFAULT
 OWN_SMOKE                      BOOLEAN                 IN      DEFAULT  

i tried following :
select to_dogyear(to_date(‘01-04-2008','DD-MM-YYYY’) , ‘LABRADOR’) from dual ;


and this raised following :

ORA-07445: exception encountered: core dump [DOG_BREED_VIOLATION] [+34] [PC:0xDEADBEEF] [ADDR:0x14] [UNABLE_TO_BARK] [] 

when choosing another breed it worked although it gave a pretty bizarre result
select to_dogyear(to_date('28-03-2013','DD-MM-YYYY') , ‘POODLE’) a from dual ;

a
-------------------------
vulnerability not found

while according to pedigree it should be 50 and not only that it should RETURN a Number ? WTH ok what’s happen when we try to run the function on cats
select to_dogyear(to_date(‘01-04-2008','DD-MM-YYYY’), ‘GARFIELD’) a from dual ;


a
------------------------
is this a dog ?

Oracle you have some work to do. I would expect a number to be returned not a string Does anybody else with an Exadata see this behaviour preferably Bare Metal ? Cloud ?


Update 2-APR-2018 Before you think that PIO is Pets In Oracle A little update for those who didn’t realize this was posted on 1st of April. It was an April Fool common idea from some Oracle Community buddies on the post-UKOUG_TECH17 trip. And what remains true all the year is how this community is full of awesome people. And special thanks to Connor who added great ideas here :)

Thursday, October 19, 2017

OUD and 12.2 how to get your db registered

We are slowly moving to 12.2 some new products will start of with this version, Single tenant for the moment.

The customer uses OUD and Global Roles to manage centrally access to databases and database objects.

One of the first things that needs to be done is register the database in oud.

I tried to do it and then I got this message :




Ok we are a bit stuck here.
The database needs to be in OUD before we can use global roles and global users .... The are no listeners associated with this database doesn't help here ;-)
I tried different things to work around it adapted the listener.ora file with static registration but didn't work.

Finally after opening an SR the engineer suggested to put a <> listener.ora in /tmp and point TNS_ADMIN to that location. And yes that solved this issue I was able to register the database in OUD.


LISTENER=
  (DESCRIPTION=
    (ADDRESS_LIST=
      (ADDRESS=(PROTOCOL=tcp)(HOST=exa*******)(PORT=1521))
      (ADDRESS=(PROTOCOL=ipc)(KEY=extproc))))
SID_LIST_LISTENER=
  (SID_LIST=
    (SID_DESC=
      (GLOBAL_DBNAME=c20*****.acme.com)
      (ORACLE_HOME=/u01/app/product/12.2.0.1/dbhome1)
      (SID_NAME=c20*****1)
    )
    (SID_DESC=
      (GLOBAL_DBNAME=pdb*****.acme.com)
      (ORACLE_HOME=/u01/app/product/12.2.0.1/dbhome1)
      (SID_NAME=c20*****1)
    )
)


btw it is every easy to do this in a silent way :

dbca -silent -configureDatabase -sourceDB c20v01d01 -registerWithDirService true -dirServiceUserName "cn=Directory Manager" -dirServicePassword "" -walletPassword "" -sysDBAPassword  -sysDBAUserName sys





Must say this time I had a very good experience with Oracle Support


Thursday, June 22, 2017

To Data Guard First or Not to Data Guard First that is the question

In a previous post you could read about issues with IB switches and other problems with the APR QFSDP 12.1

we had some more surprises.

All BP we installed so far are Data Guard First enabled meaning you can install them on the standby do a switchover, bring your new standby at your pace to the same level and do a data patch.


Well we did that as for all the other QFSDP's so far.

but after the switchover ALL of a sudden our standby's didn't follow anymore and aborted recovery with an ORA-00600 ....

we clearly ran into this issue :

ORA-00600:[KTSLU_PUA_REMCHK-1] Could be generated after Applying April 2017 Database Bundle Patch (12.1.0.2.170418 DBBP) (Doc ID 2267842.1)

Note :




We fixed our issue by just applying the BP on the unpatched home we didn't add the extra patch.

The key is to re-read the documentation several times, however wouldn't it be nice if oracle support could send you a mail they have records of everything you downloaded anyways, judging the sales people that call each time i download new stuff ;-)

this would be a great service !

Another great service would actually to test patches ... and test if a patch is DG first here the second patch was release a long time after the APR bundle making you wonder if they actually tested this patch upfront  in a DG environment and did a switchover

Applying April 2017 QFSDP 12.1.0.2 Exadata 12.2.1.1.1

UPDATE see bellow for more info

The customer I currently work for has Exadata X4,X5, X6 mostly in 1/4 or 1/8 configurations running on Exadata 12.1.2.3.2.160721 with OVM. that means that every exadata is divided in a couple of vm’s Pieter Van Puymbroeck and I already talked about this a couple of times at user conferences.


 you can find the presentation online soon link to be posted.


We decided to patch to Exadata 12.2.1.1.1.

After all this was the second release of Exadata 12.2 ;-)



 Why 12.2.1.1.1 ?


 We want to start developing in DB 12.2 this year and make use of all the exadata features, offloading, …. which is not the case if you run EXADATA 12.1….. from note :


 Exadata 12.2.1.1.1 release and patch (25512521)

 (Doc ID 2231247.1) Database servers using virtualization (OVM) require the following: Oracle Grid Infrastructure home in domU must be running release 12.1.0.2.161018 (Oct 2016) Proactive Bundle Patch or later before updating a domU to Exadata 12.2. All user domains (domUs) must be updated to Exadata 12.2 before updating the management domain (dom0) to Exadata 12.2.


 This forced us to take a different approach, since that version required to at least have DB + GI 12.1.0.2 BP OCT 2016 and we were on 12.1.0.2 JULY so we patched in following order :
we upgraded GI + DB to 12.1.0.2 APR 2017
upgraded the cells
upgrade the domU
upgrade the dom0
upgrade the IB switch.
this went pretty smooth on our X4 test system.
 our X6 system was something else, on that test / dev qualification machine wthere were about 10vm’s on each node all, which made it pretty labour intensive.

 I scripted as much as I could using dcli we ran into a couple of issues on that EXADATA (10vm’s ….):


  •  corrupt libserver.a file 
  • snapshots that still we mounted on the OVM 
  • patchmgr that bricked the IB switch. 
  • IB switch that stayed in pre boot rebooting
  • IB switch made disk groups dismount on one node basically if we rebooted root cause still ongoing
all this made that the patching that we tought would be finished in about 12 - 14h lasted for around 30h, lots of lead time because of 3 SEV SR 1 open, which really didn’t move despite being a SEV 1 issue, not speaking about the IB switch which was patched more then a week after all the other components.

Libserver.a 


 The libserver.a issue was resolved by copying over a good version to that file to the $ORACLE_HOME/lib directory and reapplying the patch.

 Mount snapshot LVMs 


 Although support suggested to drop the LVM image called : LVMDoNotRemoveOrUse (size 1g)

 I didn’t do that for obvious reasons and checked what was in the mounted Snapshots and removed those. the issue is when you get a support guy stating that you should remove an LVM that is named LVMDoNotRemoveOrUse and which seems Exadata Internal and specific and present on every single Exadata we have, your confidence in the person helping you takes a hit.

IB Switch patch


You start even more than you normally do checking the support responses for validity hence losing even more precious downtime window time….. then finally sunday evening was there and we would start patching the easiest part of the whole rack the one that you don’t touch too much, the real black box, THE IB switch.


 We used patchmgr what happened next was not a pleasant experience : after completing successfully the precheck patchmgr said that the upgrade was FAILED ?

As I was exhausted after the long patching I was confident I could log on with the console cable monday.

Well that was too optimistic it was completely dead, after opening an SR in which I stated a Field Engineer would be necessary to  change it.

After waiting almost for 8h before support wanted to believe me that the switch was bricked and holding me busy sending over info from the surviving switch, a field engineer was scheduled the next day.

After a couple of hours the new switch was put in the rack and patched, it was to our surprise that by default fwverify complained about fs permissions on the timezone files, they are said to 755 instead of 644 :

 however this is due to bug :

"
16847481 Setting time zone changes permissions of time zone file.

 

Workaround: Use the Fabric Monitor feature of the Oracle ILOM web interface to retrieve connector information.

After setting the time zone, the permissions of the time zone file are set to be highly restricted. Consequently, the fwverify command receives an error when it attempts to read the file.





 Workaround: Open the permissions on the time zone file.

After setting the time zone, become the root user.

Open the permissions for the time zone file.

# chmod 644 /conf/localtime

“

the Field engineer left the switch with the on the same initial version as the surviving switch (2.1.8), the upgrade was again in our hands.




This time we took the manual upgrade route. This was way out of my comfort zone, a switch should just work so my Exitas Solaris and ZFSSA specialist colleague Filip Francis, proposed to help me. we followed the procedure to the letter and ended up with a switch stuck in preboot phase ……

Luckily there was a note that described this …. exactly our symptoms: Infiniband Gateway Switch Stays In Pre-boot Environment During Upgrade/Reboot (Doc ID 2202721.1).

on to switch ibb01 .... the same workaround needed. We didn't dare to use patch mgr anymore on the other EXA's we patched so fare

Hmm to our surprise although subnet-manager was running  after a reboot we had node evictions and disk group dismount investigations are still ongoing.

BTW : the IB patch is quite big it brings the switch from Centos 5.2 an 8y old version to Oracle Enterprise Linux 6.7



So last week we did production standby more about this in a next blog post.



UPDATE : I was contacted by Mr Kundersma a consulting member of Technical Staff of the DB HA  and MAA group who asked me more details about the LVM snapshot. He looked in the SR and came very quickly to the root cause of the issue.

Big thumbs up to him and his team thank you very much for reaching out !!!

Thank you to ExadataPM Gurmeet for reaching out on Twitter



Thanks Filip and Pieter Van Puymbroeck for your support

Friday, November 25, 2016

restore table space in PDB


As said earlier, I am currently doing a POC on multitenant 12.1.0.2 for my customer.
Unfortunately 12.1 and not 12.2 which seems to have really nice features especially with regards to FLASHBACK, limiting resources in the containers ....


One of the tasks I am investigating is backup and recovery. here some findings so far.

Situation :

CDB called CDB

container called ORA1


I have used OMF this time on a regular fs just to be able to easily remove datafiles while the db is running ....




SQL>!rm /oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k2nnz6_.dbf'



SQL> alter session set container = ora1

  2  /



Session altered.



SQL>  select * from phil.test;



no rows selected



SQL>

SQL> create table phil.test2 as select * from dba_objects;

create table phil.test2 as select * from dba_objects

                                                   *

ERROR at line 1:

ORA-01116: error in opening database file 10

ORA-01110: data file 10:

'/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k2nnz6_.db

f'

ORA-27041: unable to open file

Linux-x86_64 Error: 2: No such file or directory

Additional information: 3






ok the file is gone ! yippee

next step connect to RMAN

ok connected in the CDB :




$ rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Nov 25 15:18:00 2016

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database: CDB (DBID=2009742972)

RMAN> restore tablespace ora1:users;

Starting restore at 25-NOV-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=17 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00010 to /oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k2nnz6_.dbf
channel ORA_DISK_1: reading from backup piece /oraarch/bck/CDB_0grlqcro_1_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 11/25/2016 15:18:11
ORA-19870: error while restoring backup piece /oraarch/bck/CDB_0grlqcro_1_1
ORA-19573: cannot obtain exclusive enqueue for datafile 10

RMAN>

RMAN>

RMAN> alter database datafile 10 offline;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of sql statement command at 11/25/2016 15:19:35
ORA-01516: nonexistent log file, data file, or temporary file "10"



RMAN> exit



ok the CDB is not aware of datafile 10 ....

connect to SQLPLUS



 sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Fri Nov 25 15:21:54 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter session set container = ora1;

Session altered.

SQL>  alter database datafile 10 offline;

Database altered.
ok back to RMAN



rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Nov 25 15:22:44 2016

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database: CDB (DBID=2009742972)

RMAN> restore tablespace ora1:users;

Starting restore at 25-NOV-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=267 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00010 to /oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k2nnz6_.dbf
channel ORA_DISK_1: reading from backup piece /oraarch/bck/CDB_0grlqcro_1_1
channel ORA_DISK_1: piece handle=/oraarch/bck/CDB_0grlqcro_1_1 tag=TAG20161125T151456
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 25-NOV-16

RMAN> recover tablespace ora1:users;

Starting recover at 25-NOV-16
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 25-NOV-16

RMAN> exit


ok table space restored and recovered .


where are the datafiles now ?




SQL> alter session set container=ora1;

Session altered.

SQL> select name from v$datafile;

NAME
--------------------------------------------------------------------------------
/oradata/CDB/datafile/o1_mf_undotbs1_c1yz8bk7_.dbf
/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_system_c1yzs31m_.db
f

/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_sysaux_c1yzs31t_.db
f

/oradata/CDB/datafile/o1_mf_users_d3k7f3tk_.dbf
ok so the users tablespace was restored in the CDB root directory ....
instead of here :




/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k2nnz6_.dbf


hmm lets check if we see the same when connecting to PDB instead of CDB when recovering :

so same scenario remove the datafile ...


[oracle@12cR1 datafile]$ rman target sys@ora1

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Nov 25 15:42:42 2016

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

target database Password:
connected to target database: CDB (DBID=2009742972)

RMAN> restore tablespace users;

Starting restore at 25-NOV-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=21 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00010 to /oradata/CDB/datafile/o1_mf_users_d3k7f3tk_.dbf
channel ORA_DISK_1: reading from backup piece /oraarch/bck/CDB_0grlqcro_1_1
channel ORA_DISK_1: piece handle=/oraarch/bck/CDB_0grlqcro_1_1 tag=TAG20161125T151456
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 25-NOV-16

RMAN> recover tablespace users;

Starting recover at 25-NOV-16
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 25-NOV-16

RMAN> alter tablespace users online;

Statement processed

okay restored where they were originally stored
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select name from v$datafile;

NAME
--------------------------------------------------------------------------------
/oradata/CDB/datafile/o1_mf_undotbs1_c1yz8bk7_.dbf
/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_system_c1yzs31m_.dbf


/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_sysaux_c1yzs31t_.dbf

/oradata/CDB/2223196C0BE77A23E053144EA8C066EC/datafile/o1_mf_users_d3k8lbts_.dbf

for me while making sense I see lots of potential issues with this ;) so I hope it was worthwhile to note this down