Re: DataWarehouse modules

Jillian Rowe Wed, 20 May 2015 08:03:12 -0700

Hi Ashish,

In regards to "Big Data", ETL, and perl, I think perl does very well.Check out MCE for your big data needs. Its great, and has lots ofutilities to handle large files and data structures.

The backend of nearly every application I've written has been in perl,especially using DBI and DBIx, and more recently the MongoDB driver.

If you're looking for an ETL tool that can handle large data Clover ispretty good. There are free and paid versions, but for myself I'd muchrather use perl, or maybe python for a few things. Personally, any ETLsoftware has made me want to bash my head into a wall.

I haven't looked at things like this in awhile, but if I wanted to doworkflow/rules based things in perl I'd use


https://metacpan.org/pod/Class::Workflow::Cookbook

or maybe this:

https://metacpan.org/pod/FSA::Rules

But I like Moose, so I'd probably go with Class::Workflow. Its also theonly one I have personal experience with. It has utilities for storingyour workflow/states in databases, and has the ability to use a configfile. What more could you ask? ;-) If you need to use multiple threadsor processes you have great modules for that too (MCE,Parallel::Forkmanager). If there is a custom (command line) program thatruns stuff, you can run it and monitor it with one of the IPC modules.If you want to submit something to a SLURM scheduler there is a modulefor that too (shameless plugin ;-) )

I checked out panda/numpy (python packages) a few years ago, and theywere pretty good, but had trouble with big data. I think that can bebypassed now with something like Blaze, but I haven't looked deeply intoit. There is also snakemake in python, which is another rule basedworkflow generator.

Generally most things can be abstracted to a workflow. Better yet aworkflow that is made up of config files and/or REST apis.


Best,
Jillian

On 05/20/2015 06:52 AM, Ashish Mukherjee wrote:

Jed,

Thanks for your detailed reply.

While I quite agree that perl indeed throws a number of powerful toolsand modules at the developer's disposal for most data crunching, I waswondering if there may not be a gain in extending Nelson's modulesfurther with the aim of providing some standardization (at least toserve the typical use cases more easily). One may even providecertain methods to override for specific processing.

For even more specific cases, developers may still go the route ofcombining their own modules in a custom way.

Also, I see lot of people turning to other languages to leverage "BigData" platforms in ETL. This may be an area where perl can become morefriendly.


Welcome your thoughts and of others in the group.

Regards,
Ashish

On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com<mailto:jandrewl...@gmail.com>> wrote:


    Ashish,

    In one perspective all of perl and CPAN is a very powerful ETL
    package.  I would argue that it's ability to stitch practically
    any system to another system while mangling the data in transit to
    your your own specification is not exceeded anywhere.  There are
    languages and tools outside perl where the process is more
    standardized for specific industry flows.  However, with that
    standardization comes some inflexibility. Many if not most of the
    pieces of those tools also exist in the CPAN-verse in a pretty
    standardized format also. They just aren't collated together under
    an ETL or data warehouse header.  You will be most successful in
    building ETL flows for your needs with perl by searching for each
    piece of your ETL flow separately rather than looking for a one
    stop shop.  For instance Nelson built his data stuff on DBI
    <https://metacpan.org/pod/DBI> and DBIx
    <https://metacpan.org/search?q=DBIx&size=20>.  While Nelsons
    packages aren't doing so well in the CPAN testers universe you may
    find that DBI and DBIx already provide sufficient functionality
    for your needs without his wrapper.  Both of those packages are
    battle tested in many ways and link quite well to all of the other
    interesting things on CPAN.

    I myself have never had to go outside perl to execute ETL.

    Best Regards,

    Jed

    On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee
    <ashish.mukher...@gmail.com <mailto:ashish.mukher...@gmail.com>>
    wrote:

        Hello,

        I have been searching for some perl modules related to
        DataWarehouse needs and came across few elementary
        DataWarehouse modules on CPAN by Nelson Ferraz, which have
        been last updated in 2010. I tried reaching out to him over
        the list previously, without success.

        I was wondering the following -

        1. Do others also have similar requirements in perl? i.e
        modules for ETL, DataWarehousing etc.

        2. Do people on this list have certain other requirements
        which are not so apparent too, but perhaps being met by
        frameworks in other languages?

        If I am unable to reach Nelson, I was thinking of following
        the process to take custodianship of the modules and work on
        them further.

        Regards,
        Ashish




Disclaimer: This email and its attachments may be confidential and are intended 
solely for the use of the individual to whom it is addressed. If you are not 
the intended recipient, any reading, printing, storage, disclosure, copying or 
any other action taken in respect of this e-mail is prohibited and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately by using the reply function and then permanently delete what you 
have received.

Re: DataWarehouse modules

Reply via email to