Wednesday, December 28, 2011

upgrading a RAC database from 10203 to 11203

I am busy planning, testing and working out the upgrade plan of a business critical database RAC at my current client. Some of the requirements are that at all times there is least 1 standby per site (they have two sites).
Stay tuned for more the coming weeks / months.



Wednesday, November 30, 2011

changes in 11gR2 compared to 10gR2 (1)

While preparing for the upgrade from 10.2.0.3 to 11.2.0.3  I noticed an

interesting change from  11.2.0.2 on ;
if a datafile is unavailable for write the  database performs a shutdown abort unless _datafile_write_errors_crash_instance is set.

Before  if the database was in running archive log and the datafile in question was not a "system" one the datafile would just be put offline.... now the database stops.

Will test it soon and put some excepts of the alert.log




Friday, November 4, 2011

CPU count and parallelism



Last week we had  a weird issue, all of sudden a batch job, which runs monthy, took lots of time to finish, in fact it didn't finish at all since we decided to stop it.

First investgation showed that the explain plan of the query changed, this is normal and happens from time to time, this is what the CBO does, for good and sometimes as in this case for bad.

the explain plan used to execute the query used parallelism, which in our case wasn't beneficial at all.


this is how the plan looked like :



-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation                     | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|*  1 |  VIEW                         |                      |      1 |      1 |    26 |  4106   (1)| 00:00:50 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 

PLAN_TABLE_OUTPUT 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|*  2 |   COUNT STOPKEY               |                      |      1 |        |       |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|*  3 |    TABLE ACCESS BY INDEX ROWID| R_J       |      1 |      1 |    21 |     4   (0)| 00:00:01 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|*  4 |     INDEX RANGE SCAN          | R_J_EPOCH |      1 |      1 |       |     3   (0)| 00:00:01 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   5 |      SORT AGGREGATE           |                      |      1 |      1 |    15 |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   6 |       PX COORDINATOR          |                      |      1 |        |       |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   7 |        PX SEND QC (RANDOM)    | :TQ10000             |      1 |      1 |    15 |            |          |  Q1,00 | P->S | QC (RAND)  |      1 |00:03:08.03 |     823K|    799K| 
|   8 |         SORT AGGREGATE        |                      |      1 |      1 |    15 |            |          |  Q1,00 | PCWP |            |      1 |00:03:08.03 |     823K|    799K| 
|   9 |          PX BLOCK ITERATOR    |                      |      1 |   5227 | 78405 |  4102   (1)| 00:00:50 |  Q1,00 | PCWC |            |      5 |00:02:09.54 |     823K|    799K| 
|* 10 |           TABLE ACCESS FULL   | R_J       |      1 |   5227 | 78405 |  4102   (1)| 00:00:50 |  Q1,00 | PCWP |            |      5 |00:02:09.54 |     823K|    799K| 


sorry for the bad formatting ...




the normal plan looked like this :




PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------------------------------------------------
|*  1 |  VIEW                            |                      |      1 |      1 |    26 |  4337   (1)| 00:00:53 |      1 |00:00:00.01 |      12 |      2 |
|*  2 |   COUNT STOPKEY                  |                      |      1 |        |       |            |          |      1 |00:00:00.01 |      12 |      2 |
|*  3 |    TABLE ACCESS BY INDEX ROWID   | R_J       |      1 |      1 |    21 |     4   (0)| 00:00:01 |      1 |00:00:00.01 |      12 |      2 |
|*  4 |     INDEX RANGE SCAN             | R_J_EPOCH |      1 |      1 |       |     3   (0)| 00:00:01 |      1 |00:00:00.01 |      11 |      2 |
|   5 |      SORT AGGREGATE              |                      |      1 |      1 |    15 |            |          |      1 |00:00:00.01 |       8 |      2 |
|*  6 |       TABLE ACCESS BY INDEX ROWID| RJ       |      1 |   5227 | 78405 |  4333   (1)| 00:00:52 |      5 |00:00:00.01 |       8 |      2 |
|*  7 |        INDEX RANGE SCAN          | R_J_ID    |      1 |   5227 |       |    25   (0)| 00:00:01 |      6 |00:00:00.01 |       3 |      0 |

PLAN_TABLE_OUTPUT




The problem was that manualy reexcuting the query, didn't show the messed up plan, it took the right plan !!!


sometimes good plan sometimes bad plan ….
so i couldn't set 10053 event  because it wasn't reproducible…
During two days I tried to reproduce the issue but i couldn't.

 i tried it again with some random values when the load was higher
and now  could in a consistent way reproduce it.


By comparing the CBO traces is saw that the cpu_count was different .

A couple of weeks ago the sysadmin implemented solaris cpu pools to dynamically assign CPU resources, sometime more sometime less ....

this caused the optimizer to take the wrong plan when more than 11 cpu's were assigned....

Case closed ;-)



Wednesday, September 14, 2011

ORA-* errors during rediscovery of host : grid control

while installing the oracle agent 10.2.0.5 agent and rediscovering I received several errors in the instances monitored :


< ORA-00603: ORACLE server session terminated by fatal error
< ORA-27504: IPC error creating OSD context
< ORA-27300: OS system dependent operation:IPC init failed with status: 65
< ORA-27301: OS failure message: Package not installed
< ORA-27302: failure occurred at: skgxpcini
< ORA-27303: additional information: libskgxpd.so called
< ORA-00603: ORACLE server session terminated by fatal error
< ORA-27504: IPC error creating OSD context
< ORA-27300: OS system dependent operation:IPC init failed with status: 65
< ORA-27301: OS failure message: Package not installed
< ORA-27302: failure occurred at: skgxpcini
< ORA-27303: additional information: libskgxpd.so called

further investigation pointed me to metalink notes

Ora-00603 When Invoking Sqlplus Via Grid Control OS Command Job (Doc ID 1327675.1)
ORA-7445 [KSXPSIGOSDERR()+252] ERROR IN A RAC DATABASE (Doc ID 984963.1)



I basically adapted the emctl script of the agent and just added the following: 

LD_LIBRARY_PATH_64=$ORACLE_HOME/lib:$LD_LIBRARY_PATH_64

I did this straight after this section of the script

#
# Make sure certain environment variables are set
#

Tuesday, August 23, 2011

rman catalog start with changes incarnation

Today I encountered a very weird unexpected issue on our 10GR2 (10203 EE database running on solaris 64 sparc).

I wanted to clean up some archives from another incarnarnation in the archive disk group, so decided to catalog them in RMAN and the plan was to afterwards delete them.

everything went fine until I saw this in the alert.log of the primary


New incarnation branch detected in ArchiveLog, filename +ARCHDG1/DBPRODNL/archivelog/2011_05_26/thread_1_seq_837.8222.752112659
Inspection of file changed rdi from 1 to 4
Setting recovery target incarnation to 4
Mon Aug 22 14:57:11 2011
Setting recovery target incarnation to 4

RMAN list incarnation and v$datababase_incarnation also showed that we went to another incarnation :

DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 DBPROD 2789041670 PARENT 1 29-NOV-07
5 5 DBPROD 2789041670 ORPHAN 6238221940 13-MAY-11
4 4 DBPROD 2789041670 CURRENT 6274083013 17-MAY-11
2 2 DBPROD 2789041670 ORPHAN 6468809687 20-JUN-11
3 3 DBPROD 2789041670 ORPHAN 6508613920 29-JUN-11

v$database_incarnation showed similar output.

As soon as the primary found the archives the RFS started to send them to the standby db, which stopped recovery immediately

Mon Aug 22 15:04:38 2011
RFS[96]: Archived Log: '+ARCHDG1/DBPRODBE/archivelog/2011_08_22/thread_2_seq_42.11513.759855877'
Mon Aug 22 15:06:41 2011
RFS[94]: Archived Log: '+ARCHDG1/DBPRODBE/archivelog/2011_08_22/thread_2_seq_2.11991.759856001'
Mon Aug 22 15:06:43 2011
RFS[96]: Archived Log: '+ARCHDG1/DBPRODBE/archivelog/2011_08_22/thread_2_seq_1.11962.759856003'
Mon Aug 22 15:07:56 2011
MRP0: Incarnation has changed! Retry recovery...
Mon Aug 22 15:07:56 2011
Errors in file /u01/oracle/admin/DBPRODBE/bdump/DBPRODn1_mrp0_27062.trc:
ORA-19906: recovery target incarnation changed during recovery
Mon Aug 22 15:07:56 2011
Managed Standby Recovery not using Real Time Apply
Mon Aug 22 15:07:59 2011
Recovery interrupted!
Recovered data files to a consistent state at change 6792290587

oh crap now standby's are not following anymore ....

what does the incarnation look like on the standby :

RMAN> list incarnation;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 DBPROD 2789041670 PARENT 1 29-NOV-07
2 2 DBPROD 2789041670 CURRENT 6274083013 17-MAY-11


after investigation further on the primary I noticed that the reset logs time in v$database
and the reset logs time of the current incarnation indicated by v$database_incarnation , list incarnation.


select RESETLOGS_TIME from v$database;

RESETLOGS
---------
29-NOV-07

this is the output from list incarnation on the primary

while list incarnation in rman give us following :


Copyright (c) 1982, 2005, Oracle. All rights reserved.

lconnected to target database: DBPROD (DBID=2789041670)

RMAN> list incarnation;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 DBPROD 2789041670 PARENT 1 29-NOV-07
5 5 DBPROD 2789041670 ORPHAN 6238221940 13-MAY-11
4 4 DBPROD 2789041670 CURRENT 6274083013 17-MAY-11
2 2 DBPROD 2789041670 ORPHAN 6468809687 20-JUN-11
3 3 DBPROD 2789041670 ORPHAN 6508613920 29-JUN-11

RMAN> exit



weird things happening in the end on the standby we decided to change the incarnation to the first one

rman target /

reset database to incarnation 1;



after which we could reenable the recovery, and the application of redo continued.

The primary is still in a somewhat ambiguous state oracle support is working on this to see how we can corret this, it is working correctly for the moment.

keep you posted for the explanation why this exactly happened....


UPDATE :

After lots of updates on the SR Oracle support told me this is basically the expected behaviour, if rman finds archives of another newer incarnation while cataloging it will change the incarnation of the database.
The workaround for my issue with the primary is to reset the incarnation to 1 in Rman.


Thursday, July 7, 2011

adding tnsping to the instant client 11.2

apparently oracle removed tnsping from the instant client 11.2.0.2.
this very handy command to do some troubleshooting is crucial in our environment so we tried to add it from a full blown oracle client installation.

in order to add tnsping to the instant_client you must copy the tnsping command to the bin directory of the instant client.

however you also need to copy
$ORACLE_HOME/network/message directory to the instant client directory in the network subdir

also if you want to make use of ldap you need to copy the

$ORACLE_HOME/ldap directory to you instant client directory.
and $ORACLE_HOME/lib directory to your instant client directory.

Thursday, June 16, 2011

how to incrementally apply backup on data guard database.

the oracle documentation explains very nicely how to :

Use RMAN Incremental Backups to Roll Forward a Physical Standby Database (see here), what it doesn't tell however is what to do if new datafiles where created in the meantime.
At least on 10.2.0.3 the procedure didn't work if there were new datafiles created between the current_scn of the standby and the actual scn of the primary.

to check if this is the case :

select file#,name from v$datafile where creation_change# > THE_SCN_FOUND

one could think that following would work :

backup the datafiles listed above and restore them.

however the switch database to copy ; complains about following


RMAN-03002: failure of switch to copy command at 06/16/2011 21:17:53
RMAN-06571: datafile 736 does not have recoverable copy

just use catalog start with you might think, well think again it doesn't work.

what does work :

catalog datafilecopy 'datafile_names_of_the_datafiles_restored'

and the switch database to copy;