Wednesday, November 30, 2011

changes in 11gR2 compared to 10gR2 (1)

While preparing for the upgrade from 10.2.0.3 to 11.2.0.3  I noticed an

interesting change from  11.2.0.2 on ;
if a datafile is unavailable for write the  database performs a shutdown abort unless _datafile_write_errors_crash_instance is set.

Before  if the database was in running archive log and the datafile in question was not a "system" one the datafile would just be put offline.... now the database stops.

Will test it soon and put some excepts of the alert.log




Friday, November 4, 2011

CPU count and parallelism



Last week we had  a weird issue, all of sudden a batch job, which runs monthy, took lots of time to finish, in fact it didn't finish at all since we decided to stop it.

First investgation showed that the explain plan of the query changed, this is normal and happens from time to time, this is what the CBO does, for good and sometimes as in this case for bad.

the explain plan used to execute the query used parallelism, which in our case wasn't beneficial at all.


this is how the plan looked like :



-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation                     | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|*  1 |  VIEW                         |                      |      1 |      1 |    26 |  4106   (1)| 00:00:50 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 

PLAN_TABLE_OUTPUT 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|*  2 |   COUNT STOPKEY               |                      |      1 |        |       |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|*  3 |    TABLE ACCESS BY INDEX ROWID| R_J       |      1 |      1 |    21 |     4   (0)| 00:00:01 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|*  4 |     INDEX RANGE SCAN          | R_J_EPOCH |      1 |      1 |       |     3   (0)| 00:00:01 |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   5 |      SORT AGGREGATE           |                      |      1 |      1 |    15 |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   6 |       PX COORDINATOR          |                      |      1 |        |       |            |          |        |      |            |      1 |00:03:08.04 |     823K|    799K| 
|   7 |        PX SEND QC (RANDOM)    | :TQ10000             |      1 |      1 |    15 |            |          |  Q1,00 | P->S | QC (RAND)  |      1 |00:03:08.03 |     823K|    799K| 
|   8 |         SORT AGGREGATE        |                      |      1 |      1 |    15 |            |          |  Q1,00 | PCWP |            |      1 |00:03:08.03 |     823K|    799K| 
|   9 |          PX BLOCK ITERATOR    |                      |      1 |   5227 | 78405 |  4102   (1)| 00:00:50 |  Q1,00 | PCWC |            |      5 |00:02:09.54 |     823K|    799K| 
|* 10 |           TABLE ACCESS FULL   | R_J       |      1 |   5227 | 78405 |  4102   (1)| 00:00:50 |  Q1,00 | PCWP |            |      5 |00:02:09.54 |     823K|    799K| 


sorry for the bad formatting ...




the normal plan looked like this :




PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------------------------------------------------
|*  1 |  VIEW                            |                      |      1 |      1 |    26 |  4337   (1)| 00:00:53 |      1 |00:00:00.01 |      12 |      2 |
|*  2 |   COUNT STOPKEY                  |                      |      1 |        |       |            |          |      1 |00:00:00.01 |      12 |      2 |
|*  3 |    TABLE ACCESS BY INDEX ROWID   | R_J       |      1 |      1 |    21 |     4   (0)| 00:00:01 |      1 |00:00:00.01 |      12 |      2 |
|*  4 |     INDEX RANGE SCAN             | R_J_EPOCH |      1 |      1 |       |     3   (0)| 00:00:01 |      1 |00:00:00.01 |      11 |      2 |
|   5 |      SORT AGGREGATE              |                      |      1 |      1 |    15 |            |          |      1 |00:00:00.01 |       8 |      2 |
|*  6 |       TABLE ACCESS BY INDEX ROWID| RJ       |      1 |   5227 | 78405 |  4333   (1)| 00:00:52 |      5 |00:00:00.01 |       8 |      2 |
|*  7 |        INDEX RANGE SCAN          | R_J_ID    |      1 |   5227 |       |    25   (0)| 00:00:01 |      6 |00:00:00.01 |       3 |      0 |

PLAN_TABLE_OUTPUT




The problem was that manualy reexcuting the query, didn't show the messed up plan, it took the right plan !!!


sometimes good plan sometimes bad plan ….
so i couldn't set 10053 event  because it wasn't reproducible…
During two days I tried to reproduce the issue but i couldn't.

 i tried it again with some random values when the load was higher
and now  could in a consistent way reproduce it.


By comparing the CBO traces is saw that the cpu_count was different .

A couple of weeks ago the sysadmin implemented solaris cpu pools to dynamically assign CPU resources, sometime more sometime less ....

this caused the optimizer to take the wrong plan when more than 11 cpu's were assigned....

Case closed ;-)