Hi Ashish,
In regards to "Big Data", ETL, and perl, I think perl does very well.
Check out MCE for your big data needs. Its great, and has lots of
utilities to handle large files and data structures.
The backend of nearly every application I've written has been in perl,
especially using DBI and DBIx, and more recently the MongoDB driver.
If you're looking for an ETL tool that can handle large data Clover is
pretty good. There are free and paid versions, but for myself I'd much
rather use perl, or maybe python for a few things. Personally, any ETL
software has made me want to bash my head into a wall.
I haven't looked at things like this in awhile, but if I wanted to do
workflow/rules based things in perl I'd use
https://metacpan.org/pod/Class::Workflow::Cookbook
or maybe this:
https://metacpan.org/pod/FSA::Rules
But I like Moose, so I'd probably go with Class::Workflow. Its also the
only one I have personal experience with. It has utilities for storing
your workflow/states in databases, and has the ability to use a config
file. What more could you ask? ;-) If you need to use multiple threads
or processes you have great modules for that too (MCE,
Parallel::Forkmanager). If there is a custom (command line) program that
runs stuff, you can run it and monitor it with one of the IPC modules.
If you want to submit something to a SLURM scheduler there is a module
for that too (shameless plugin ;-) )
I checked out panda/numpy (python packages) a few years ago, and they
were pretty good, but had trouble with big data. I think that can be
bypassed now with something like Blaze, but I haven't looked deeply into
it. There is also snakemake in python, which is another rule based
workflow generator.
Generally most things can be abstracted to a workflow. Better yet a
workflow that is made up of config files and/or REST apis.
Best,
Jillian
On 05/20/2015 06:52 AM, Ashish Mukherjee wrote:
Jed,
Thanks for your detailed reply.
While I quite agree that perl indeed throws a number of powerful tools
and modules at the developer's disposal for most data crunching, I was
wondering if there may not be a gain in extending Nelson's modules
further with the aim of providing some standardization (at least to
serve the typical use cases more easily). One may even provide
certain methods to override for specific processing.
For even more specific cases, developers may still go the route of
combining their own modules in a custom way.
Also, I see lot of people turning to other languages to leverage "Big
Data" platforms in ETL. This may be an area where perl can become more
friendly.
Welcome your thoughts and of others in the group.
Regards,
Ashish
On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com
<mailto:jandrewl...@gmail.com>> wrote:
Ashish,
In one perspective all of perl and CPAN is a very powerful ETL
package. I would argue that it's ability to stitch practically
any system to another system while mangling the data in transit to
your your own specification is not exceeded anywhere. There are
languages and tools outside perl where the process is more
standardized for specific industry flows. However, with that
standardization comes some inflexibility. Many if not most of the
pieces of those tools also exist in the CPAN-verse in a pretty
standardized format also. They just aren't collated together under
an ETL or data warehouse header. You will be most successful in
building ETL flows for your needs with perl by searching for each
piece of your ETL flow separately rather than looking for a one
stop shop. For instance Nelson built his data stuff on DBI
<https://metacpan.org/pod/DBI> and DBIx
<https://metacpan.org/search?q=DBIx&size=20>. While Nelsons
packages aren't doing so well in the CPAN testers universe you may
find that DBI and DBIx already provide sufficient functionality
for your needs without his wrapper. Both of those packages are
battle tested in many ways and link quite well to all of the other
interesting things on CPAN.
I myself have never had to go outside perl to execute ETL.
Best Regards,
Jed
On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee
<ashish.mukher...@gmail.com <mailto:ashish.mukher...@gmail.com>>
wrote:
Hello,
I have been searching for some perl modules related to
DataWarehouse needs and came across few elementary
DataWarehouse modules on CPAN by Nelson Ferraz, which have
been last updated in 2010. I tried reaching out to him over
the list previously, without success.
I was wondering the following -
1. Do others also have similar requirements in perl? i.e
modules for ETL, DataWarehousing etc.
2. Do people on this list have certain other requirements
which are not so apparent too, but perhaps being met by
frameworks in other languages?
If I am unable to reach Nelson, I was thinking of following
the process to take custodianship of the modules and work on
them further.
Regards,
Ashish
Disclaimer: This email and its attachments may be confidential and are intended
solely for the use of the individual to whom it is addressed. If you are not
the intended recipient, any reading, printing, storage, disclosure, copying or
any other action taken in respect of this e-mail is prohibited and may be
unlawful. If you are not the intended recipient, please notify the sender
immediately by using the reply function and then permanently delete what you
have received.