2007/09/06

the trouble with blind faith

10g introduced the new job scheduler, presumably to replace the previous dbms_job functionality.


I wish someone had told all RDBMS developers about that, because there are still a lot of functions inside Oracle that use the previous dbms_job. But that is not the subject of this entry.


Now, one of the new pieces of functionality that makes use of the scheduler is the "maintenance window" and its automatic optimizer statistics gathering.


This can be a very convenient piece of functionality that automatically and regularly inspects the database and determines which objects need statistics gathering and then proceeds to do just that: gather them stats.


It uses the existence - or lack - of statistics for an object as one of the criteria to trigger a stats gathering. It also uses "staleness of stats" as the other criteria, defining it by default as "more than 10% of table rows has changed therefore stats must be stale".


But like any automatic process that relies on fixed boundaries, it is only as smart as the criteria used for defining these boundaries in the first place!


It is therefore up to each dba to help it along and ensure it can do the best job possible within its constraints. And I wish Oracle would make this a lot clearer in their documentation...


Now, what possibly can go wrong with a u-beaut automagic process in-line with the "we don't need no theenking dbas" mantra?


Well, let's look at this very simple and yet very common scenario:


1- the default maintenance window kicks in at 10pm every night and goes to 6am - and all day on weekends. During that window, a gather stats job runs looking for stats to update.


2- your site is a DW that has constant - 24X7 - ETL jobs running.


3- part of these jobs involves truncating work tables, filling them up with intermediate data, applying all sorts of calculations to that new data and then aggregating results into dimension tables for later processing of fact data.


Hmmmm... So what does the gather stats job do when it hits one of these intermediate tables that, by coincidence, just had a truncate done on it?


Bingo! It analyzes it, because it has changed by more than 10% and now its stats must be "stale".


Now of course, when the ETL job starts pumping data into these tables and starts the aggregation and analysis phase, the optimizer is going to come up with some very funny execution plans, based on the assumption that these tables are empty!


If you then inspect the execution plans of the statements involved, typically you will find that any joins on these tabls will all of a sudden start using "merge join cartesian" and other nice constructs that work oh-so-efficiently on fairly loaded tables...


Of course, this is where blind faith in automation must stop and analytic thought should kick-in.


Three immediate and obvious ways of addressing this problem:


1- Change your ETL processes to re-analyze the intermediate tables immediately after re-loading them with new data. This is my preferred approach since release 7 of Oracle as it puts the solution to the problem right at the source of it, meaning I don't have to figure out complex timing or scheduling to work around the problem. This is of course only possible if you have access to the ETL code.


2- Use the "stats lockout" feature of 10g whereby you can specify a lock, object by object, that will prevent auto stats gathering jobs from acting on these objects. You can override these locks with your own execution of the stats gathering procedure, tailored to each table and timed to perfection: when the tables have significant amounts of data in them. I'm starting to like this approach more and more.


3- Remove the stats from the tables all together and rely on dynamic sampling to gather sample stats at SQL runtime. The BIG disadvantage of this method of course is that the dynamic sampling has a finite overhead and is volatile: it won't survive between SQL statements. So it will have to be done for every statement that needs to access the tables. If you have a lot of these statements, you have a lot of overhead right there. I prefer to leave dynamic sampling to where I feel it might be more useful: adhoc and volatile DSS systems, rather than deterministic, scheduled ETL jobs.


Take your pick of which one solves your case best, or a combination of them if that is the case.


But never, ever believe that blind faith in an automatic process will produce ideal results in every case.



Speaking of ETL:




This is what opal miners use to sift through the large amounts of rubble they dig out of the shafts: the lot goes through the tube and is circulated in the big barrel. The heavy stuff drops to the bottom of the barrel and is removed every few minutes. The light stuff is pumped out to the big fan and the atmosphere. Put simply: a BIG vacuum cleaner! Quite "thunderdome", don't you think?





My first view of one of the nicest churches anywhere in the world: St Stephan, in downtown old Viena.




and another view of local peace and quiet. Man, I wish I could afford this house...



Catchyalata, folks!

5 Comments:

Blogger SydOracle said...

A slight variation on (1) is to copy out the table stats once (at a time when the table is filled) and then load them back in when they are needed.
It would be faster if the fresh analyze takes a significant time

Thursday, September 06, 2007 6:23:00 am  
Blogger Noons said...

gary:

yes, that works. but if I have to go to all that trouble, then I prefer to just gather stats once and then lock them in place so the automatic stats don't catch the table at an awkward moment.

Of course, all this should be done only for very specific instances. This is not a general purpose panacea!

Thursday, September 06, 2007 11:13:00 am  
Blogger Joel Garry said...

A few years back, there was this thing called "Street of Dreams" where developers would get together and build dream houses along a street, then open them up for tours before selling them. One house had an entire wall covered with a slab of opal substrate, full of raw opals. It was situated at the top of the stairs, with a clerestory window allowing sunlight to shine directly on it. Coming up the stairs was impressive, to say the least.

The house that Beavis and Butthead built (IIRC).

Saturday, September 08, 2007 4:31:00 am  
Blogger DomBrooks said...

> there are still a lot of functions inside Oracle that use the previous dbms_job
In terms of bespoke developed functionality, dbms_scheduler is not really suitable to be used to create a job as part of a bigger transaction as it autocommits.

Monday, September 10, 2007 6:02:00 pm  
Blogger Noons said...

Joel:

Nice!


Dom:

yes, you're absolutely right. although I'm still of the mind that having two different mechanisms - diferent doco, different learning curve, different issues, different limitations - to implement what boils down to a batch job functionality is probably not the best way to approach this.

It would have been much better if the option to commit or not was a toggle and scheduler - being the new product - had been designed to operate both ways.

As is, we now have what amounts to two different products to implement the same functionality...

Monday, September 10, 2007 10:05:00 pm  

Post a Comment

<< Home