Re: DataWarehouse modules

Jed Lund Wed, 20 May 2015 11:02:40 -0700

Ashish,

I am required to use a non-perl workflow/ETL package on occasion at $work
and my experience has been that they all try and think for you and
ultimately this is limiting.  There is some business think out there that
that a "roll-your-own" perl ETL costs more to support than off the shelf.
So far when I use these packages I seem to find some data source or some
transformation requirements that are not supported and I go back to
building in perl.  Additionally most of those packages don't link to perl
very well which means that I have trouble with a hybrid approach.  (perl
talks with most everything just not the other way around)

I think the argument against "roll your own" is true for poorly written
solutions.   It may also be true for simple ETL cases and a newbie skill
set.  But, I also think that perl unfairly gets lumped in with all
scripting languages when this assessment is made at a business level for
overall business needs. I suspect the area that would be most welcome in
the perl community would be more of a Cookbook style repository
demonstrating existing ETL implementations so ETL newbies could emulate
some best practices or even cut and paste baby steps into their ETL
development.  This would allow the community to coalesce around some best
practices as well as share some new ones.  I know there are different
versions of this list but for me best ETL practices include built-in end to
end logging that can be turned on an off selectively, (to second Jillian's
recommendation) config file management of specific flows rather than
writing individual scripts, and some extraction platform where you manage
the transformative elements and data scrubbing elements so that they can be
added to the codebase and called via config file without rewriting the the
whole ETL scheme you are using.  There are a lot of other best practices
dealing with database connections and architecture that are mostly baked
into DBI and DBIx that I don't cover in this list.

With all that said my own code lacks substantially in the documentation
area and doesn't have a full test suit so I haven't published all the
pieces of what I do at work so I think your desire to bring something out
should be supported.  I look forward to your contributions.

Best Regards,

Jed

On Tue, May 19, 2015 at 8:52 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Jed,
>
> Thanks for your detailed reply.
>
> While I quite agree that perl indeed throws a number of powerful tools and
> modules at the developer's disposal for most data crunching, I was
> wondering if there may not be a gain in extending Nelson's modules further
> with the aim of providing some standardization (at least to serve the
> typical use cases more easily).  One may even provide certain methods to
> override for specific processing.
>
> For even more specific cases, developers may still go the route of
> combining their own modules in a custom way.
>
> Also, I see lot of people turning to other languages to leverage "Big
> Data" platforms in ETL. This may be an area where perl can become more
> friendly.
>
> Welcome your thoughts and of others in the group.
>
> Regards,
> Ashish
>
> On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com> wrote:
>
>> Ashish,
>>
>> In one perspective all of perl and CPAN is a very powerful ETL package.
>> I would argue that it's ability to stitch practically any system to another
>> system while mangling the data in transit to your your own specification is
>> not exceeded anywhere.  There are languages and tools outside perl where
>> the process is more standardized for specific industry flows.  However,
>> with that standardization comes some inflexibility.  Many if not most of
>> the pieces of those tools also exist in the CPAN-verse in a pretty
>> standardized format also.  They just aren't collated together under an ETL
>> or data warehouse header.  You will be most successful in building ETL
>> flows for your needs with perl by searching for each piece of your ETL flow
>> separately rather than looking for a one stop shop.  For instance Nelson
>> built his data stuff on DBI <https://metacpan.org/pod/DBI> and DBIx
>> <https://metacpan.org/search?q=DBIx&size=20>.  While Nelsons packages
>> aren't doing so well in the CPAN testers universe you may find that DBI and
>> DBIx already provide sufficient functionality for your needs without his
>> wrapper.  Both of those packages are battle tested in many ways and link
>> quite well to all of the other interesting things on CPAN.
>>
>> I myself have never had to go outside perl to execute ETL.
>>
>> Best Regards,
>>
>> Jed
>>
>> On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee <
>> ashish.mukher...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have been searching for some perl modules related to DataWarehouse
>>> needs and came across few elementary DataWarehouse modules on CPAN by
>>> Nelson Ferraz, which have been last updated in 2010. I tried reaching out
>>> to him over the list previously, without success.
>>>
>>> I was wondering the following -
>>>
>>> 1. Do others also have similar requirements in perl? i.e modules for
>>> ETL, DataWarehousing etc.
>>>
>>> 2. Do people on this list have certain other requirements which are not
>>> so apparent too, but perhaps being met by frameworks in other languages?
>>>
>>> If I am unable to reach Nelson, I was thinking of following the process
>>> to take custodianship of the modules and work on them further.
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>

Re: DataWarehouse modules

Reply via email to