snfctech wrote:
Does anyone have experience building a data warehouse in python?  Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go.  e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.

My experience is that if you enjoy hacking around databases and are proficient in Python, than for small scale solutions it is preferable to do it yourself. If you need a large scale solutions with advanced requirements, building it yourself is mostly the only way.

I have build a rather complex datawarehouse system in the past (well actually more like a centralised synchronisation hub, having input and reporting databases as satellite clients), shoving data from around 500 databases (typically 5 Gb in size each) spread over the world.

The only commercial solutions I reviewed was Business Objects Data Integrator and Oracle Warehouse Builder.

These tools where quite flexible and if you already have a license deal which includes these tools I would definitely recommend to have more than just a look at it.

If not and you are comfortably with using python to shovel data from A to B and transform it at the same time (moving relational data automatically into a EAV model and back again, for example) than building your own solution will probably save you money and time (as opposed to learn how to use that ETL tool).

This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a subset on the data depending on the data? (My solution was one centralised hub which contains all data but is only accessible to special 'client' servers strictly maintained by me which only sync the data relevant to them).

- Authenticity, if you have different values for the same thing, which one should be considered authoritative and if yes may it be pushed back to the un-authoritative?

-Synchronisation, you really don't want to push/pull all of the database content over every x times, so how can you delta it and is there a way to do this only when the data changes (push vs pull)?

-ATOMIC, how long may the data be out of date and is it allowed to partially update

-Using and maintaining multiple databases, hopefully spread over multiple systems. I had a server for each production DB, a server that mirrored that production DB with some added columns per table for external synchronization purposes and a master synchronisation server (so in essence all data was copied three times, not very efficient but good if you like to play it on the safe side).


--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to