Re: DataWarehouse modules

Jillian Rowe Wed, 20 May 2015 22:23:04 -0700

Hi Ashish,

In addition, have you seen this https://metacpan.org/pod/ETL::Yertl ?Your email made me wonder if there were ETL specific tools in perl, andthere you go. It doesn't seem to be written in Moose, which I findunfortunate, but it does quite a bit of munging. It also takes configfiles. ;)


Best,
Jillian

On 05/20/2015 09:01 PM, Jed Lund wrote:

Ashish,
I am required to use a non-perl workflow/ETL package on occasion at$work and my experience has been that they all try and think for youand ultimately this is limiting. There is some business think outthere that that a "roll-your-own" perl ETL costs more to support thanoff the shelf. So far when I use these packages I seem to find somedata source or some transformation requirements that are not supportedand I go back to building in perl. Additionally most of thosepackages don't link to perl very well which means that I have troublewith a hybrid approach. (perl talks with most everything just not theother way around)
I think the argument against "roll your own" is true for poorlywritten solutions. It may also be true for simple ETL cases and anewbie skill set. But, I also think that perl unfairly gets lumped inwith all scripting languages when this assessment is made at abusiness level for overall business needs. I suspect the area thatwould be most welcome in the perl community would be more of aCookbook style repository demonstrating existing ETL implementationsso ETL newbies could emulate some best practices or even cut and pastebaby steps into their ETL development. This would allow the communityto coalesce around some best practices as well as share some newones. I know there are different versions of this list but for mebest ETL practices include built-in end to end logging that can beturned on an off selectively, (to second Jillian's recommendation)config file management of specific flows rather than writingindividual scripts, and some extraction platform where you manage thetransformative elements and data scrubbing elements so that they canbe added to the codebase and called via config file without rewritingthe the whole ETL scheme you are using. There are a lot of other bestpractices dealing with database connections and architecture that aremostly baked into DBI and DBIx that I don't cover in this list.
With all that said my own code lacks substantially in thedocumentation area and doesn't have a full test suit so I haven'tpublished all the pieces of what I do at work so I think your desireto bring something out should be supported. I look forward to yourcontributions.
Best Regards,

Jed
On Tue, May 19, 2015 at 8:52 PM, Ashish Mukherjee<ashish.mukher...@gmail.com <mailto:ashish.mukher...@gmail.com>> wrote:
    Jed,

    Thanks for your detailed reply.

    While I quite agree that perl indeed throws a number of powerful
    tools and modules at the developer's disposal for most data
    crunching, I was wondering if there may not be a gain in extending
    Nelson's modules further with the aim of providing some
    standardization (at least to serve the typical use cases more
    easily). One may even provide certain methods to override for
    specific processing.

    For even more specific cases, developers may still go the route of
    combining their own modules in a custom way.

    Also, I see lot of people turning to other languages to leverage
    "Big Data" platforms in ETL. This may be an area where perl can
    become more friendly.

    Welcome your thoughts and of others in the group.

    Regards,
    Ashish

    On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com
    <mailto:jandrewl...@gmail.com>> wrote:

        Ashish,

        In one perspective all of perl and CPAN is a very powerful ETL
        package.  I would argue that it's ability to stitch
        practically any system to another system while mangling the
        data in transit to your your own specification is not exceeded
        anywhere.  There are languages and tools outside perl where
the process is more standardized for specific industry flows.However, with that standardization comes some inflexibility.Many if not most of the pieces of those tools also exist in
        the CPAN-verse in a pretty standardized format also.  They
        just aren't collated together under an ETL or data warehouse
        header.  You will be most successful in building ETL flows for
        your needs with perl by searching for each piece of your ETL
        flow separately rather than looking for a one stop shop.  For
        instance Nelson built his data stuff on DBI
        <https://metacpan.org/pod/DBI> and DBIx
        <https://metacpan.org/search?q=DBIx&size=20>.  While Nelsons
        packages aren't doing so well in the CPAN testers universe you
        may find that DBI and DBIx already provide sufficient
        functionality for your needs without his wrapper.  Both of
        those packages are battle tested in many ways and link quite
        well to all of the other interesting things on CPAN.

        I myself have never had to go outside perl to execute ETL.

        Best Regards,

        Jed

        On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee
        <ashish.mukher...@gmail.com
        <mailto:ashish.mukher...@gmail.com>> wrote:

            Hello,

            I have been searching for some perl modules related to
            DataWarehouse needs and came across few elementary
            DataWarehouse modules on CPAN by Nelson Ferraz, which have
            been last updated in 2010. I tried reaching out to him
            over the list previously, without success.

            I was wondering the following -

            1. Do others also have similar requirements in perl? i.e
            modules for ETL, DataWarehousing etc.

            2. Do people on this list have certain other requirements
            which are not so apparent too, but perhaps being met by
            frameworks in other languages?

            If I am unable to reach Nelson, I was thinking of
            following the process to take custodianship of the modules
            and work on them further.

            Regards,
            Ashish

Re: DataWarehouse modules

Reply via email to