Hi Ashish,

In regards to "Big Data", ETL, and perl, I think perl does very well. Check out MCE for your big data needs. Its great, and has lots of utilities to handle large files and data structures.

The backend of nearly every application I've written has been in perl, especially using DBI and DBIx, and more recently the MongoDB driver.

If you're looking for an ETL tool that can handle large data Clover is pretty good. There are free and paid versions, but for myself I'd much rather use perl, or maybe python for a few things. Personally, any ETL software has made me want to bash my head into a wall.

I haven't looked at things like this in awhile, but if I wanted to do workflow/rules based things in perl I'd use

https://metacpan.org/pod/Class::Workflow::Cookbook

or maybe this:

https://metacpan.org/pod/FSA::Rules

But I like Moose, so I'd probably go with Class::Workflow. Its also the only one I have personal experience with. It has utilities for storing your workflow/states in databases, and has the ability to use a config file. What more could you ask? ;-) If you need to use multiple threads or processes you have great modules for that too (MCE, Parallel::Forkmanager). If there is a custom (command line) program that runs stuff, you can run it and monitor it with one of the IPC modules. If you want to submit something to a SLURM scheduler there is a module for that too (shameless plugin ;-) )

I checked out panda/numpy (python packages) a few years ago, and they were pretty good, but had trouble with big data. I think that can be bypassed now with something like Blaze, but I haven't looked deeply into it. There is also snakemake in python, which is another rule based workflow generator.

Generally most things can be abstracted to a workflow. Better yet a workflow that is made up of config files and/or REST apis.

Best,
Jillian

On 05/20/2015 06:52 AM, Ashish Mukherjee wrote:
Jed,

Thanks for your detailed reply.

While I quite agree that perl indeed throws a number of powerful tools and modules at the developer's disposal for most data crunching, I was wondering if there may not be a gain in extending Nelson's modules further with the aim of providing some standardization (at least to serve the typical use cases more easily). One may even provide certain methods to override for specific processing.

For even more specific cases, developers may still go the route of combining their own modules in a custom way.

Also, I see lot of people turning to other languages to leverage "Big Data" platforms in ETL. This may be an area where perl can become more friendly.

Welcome your thoughts and of others in the group.

Regards,
Ashish

On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com <mailto:jandrewl...@gmail.com>> wrote:

    Ashish,

    In one perspective all of perl and CPAN is a very powerful ETL
    package.  I would argue that it's ability to stitch practically
    any system to another system while mangling the data in transit to
    your your own specification is not exceeded anywhere.  There are
    languages and tools outside perl where the process is more
    standardized for specific industry flows.  However, with that
    standardization comes some inflexibility. Many if not most of the
    pieces of those tools also exist in the CPAN-verse in a pretty
    standardized format also. They just aren't collated together under
    an ETL or data warehouse header.  You will be most successful in
    building ETL flows for your needs with perl by searching for each
    piece of your ETL flow separately rather than looking for a one
    stop shop.  For instance Nelson built his data stuff on DBI
    <https://metacpan.org/pod/DBI> and DBIx
    <https://metacpan.org/search?q=DBIx&size=20>.  While Nelsons
    packages aren't doing so well in the CPAN testers universe you may
    find that DBI and DBIx already provide sufficient functionality
    for your needs without his wrapper.  Both of those packages are
    battle tested in many ways and link quite well to all of the other
    interesting things on CPAN.

    I myself have never had to go outside perl to execute ETL.

    Best Regards,

    Jed

    On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee
    <ashish.mukher...@gmail.com <mailto:ashish.mukher...@gmail.com>>
    wrote:

        Hello,

        I have been searching for some perl modules related to
        DataWarehouse needs and came across few elementary
        DataWarehouse modules on CPAN by Nelson Ferraz, which have
        been last updated in 2010. I tried reaching out to him over
        the list previously, without success.

        I was wondering the following -

        1. Do others also have similar requirements in perl? i.e
        modules for ETL, DataWarehousing etc.

        2. Do people on this list have certain other requirements
        which are not so apparent too, but perhaps being met by
        frameworks in other languages?

        If I am unable to reach Nelson, I was thinking of following
        the process to take custodianship of the modules and work on
        them further.

        Regards,
        Ashish






Disclaimer: This email and its attachments may be confidential and are intended 
solely for the use of the individual to whom it is addressed. If you are not 
the intended recipient, any reading, printing, storage, disclosure, copying or 
any other action taken in respect of this e-mail is prohibited and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately by using the reply function and then permanently delete what you 
have received.

Reply via email to