On Monday, August 04, 2014 10:38:57 PM Alan McKinnon wrote:
> On 04/08/2014 21:46, J. Roeleveld wrote:
> > On 4 August 2014 15:35:41 CEST, Alan McKinnon <alan.mckin...@gmail.com> > 
>> Either make the ETL tool pick up where it stopped and continue as it is
> >> the only that knows what it was doing and how far it got. Or, wrap the
> >> entire script in a single transaction.
> > 
> > Alan,
> > 
> > That would be the ideal solution.
> 
> You have the same concerns I do - how do you make a transaction around
> 500 million rows. So I asked the in-house expert - Mrs Alan :-)

Have a very large temporary tablespace on the database server.

> > However, a single transaction dealing with around 500,000,000 rows will
> > get me shot by the DBAs :) (Never mind that the performance of this will
> > be such that having it all done by an office full of secretaries might be
> > quicker.)
> She reckons an ETL job *must* be self-contained; if it isn't then it's
> broken by design. It must be idempotent too, which can be as simple as
> "Truncate, Load, Commit"

Most common tactic (done by humans):
- delete from <target table> where INS_PCS_ID = <crashed run-id>;
- update target table set VLD_TO = null where UPD_PCS_ID = <crashed run-id>;
Then, restart the crashed run-id.

For this, you need to know which command failed to know where to find the 
actual run-id you need to roll back.

> > Having the ETL process clever enough to be able to pick up from any point
> > requires a degree of forward thinking and planning that is never done in
> > real life. I would love to design it like that as it isn't too difficult.
> > But I always get brought into these projects when implementing these
> > structures will require a full rewrite and getting the original
> > architects to admit their design can't be made restartable without human
> > intervention.
> I agree with that design actually - it's the job of the hardware and OS
> guys to make stuff reliable that the application layer can rely on. When
> a SAN connection goes away, it usually comes back and the app layer just
> carries on (never mind that it retried 100 times meanwhile).

Yes, until you find out the clustered FS being used causes the crashes... 
(Yes, been in that situation...)

> Sometimes this doesn't work out. The easiest, cheapest and quickest way
> to handle it is to just restart the whole job from the beginning. This
> offends the engineer in us sometimes, but it really is the best way and
> all of Unix is built on this very idea :-)

Which is generally done. Usually, requiring a manual clean up prior to 
restart. If done properly, the ETL process has the capability to roll back the 
failed run prior to redoing it.
This, however, requires extensive planning and design at the initial 
implementation phase.

> If the SAn goes away too often and it causes issues, the manybe the best
> approach is to get the SAN and facilities guys to get their act together

Instead of finger-pointing.

> > At which point the business simply says it is acceptable to have people do
> > a manual rollback and restart the schedules from wherever it went wrong.
> Exactly. One of the few cases where business has the correct idea.
> There's only some many pennies to spend and so many dollars to be delivered.

Nightly processes that fail and then have to wait for the day-shift to arrive 
often cost the business more because the reports are delayed.

> > I'm sure your wife has similar experiences as this is why these projects
> > are always late to deliver and over budget.
> She says her projects are subject to the same universal inviolate rule
> as mine:
> 
> time and cost is always best engineering estimate times pi

"Overhead, testing, maintenance, ....", yes, it all adds to.

> We learn to deal with it. Which brings us back to Martin's initial
> statement: a scheduler cannot deal with any of this, the job itself
> must. It's an unpredictable event and schedulers can only deal with
> predictable events

True, but keeping the schedules and state stored in a way to make it easy to 
find out how far the whole process got makes recovery simpler.
Otherwise it's often quicker to simply roll back the entire schedule and 
restart. Even if only the last 2 of the 50 commands didn't run yet.

--
Joost

Reply via email to