Hi Ashish,
In addition, have you seen this https://metacpan.org/pod/ETL::Yertl ?
Your email made me wonder if there were ETL specific tools in perl, and
there you go. It doesn't seem to be written in Moose, which I find
unfortunate, but it does quite a bit of munging. It also takes config
files. ;)
Best,
Jillian
On 05/20/2015 09:01 PM, Jed Lund wrote:
Ashish,
I am required to use a non-perl workflow/ETL package on occasion at
$work and my experience has been that they all try and think for you
and ultimately this is limiting. There is some business think out
there that that a "roll-your-own" perl ETL costs more to support than
off the shelf. So far when I use these packages I seem to find some
data source or some transformation requirements that are not supported
and I go back to building in perl. Additionally most of those
packages don't link to perl very well which means that I have trouble
with a hybrid approach. (perl talks with most everything just not the
other way around)
I think the argument against "roll your own" is true for poorly
written solutions. It may also be true for simple ETL cases and a
newbie skill set. But, I also think that perl unfairly gets lumped in
with all scripting languages when this assessment is made at a
business level for overall business needs. I suspect the area that
would be most welcome in the perl community would be more of a
Cookbook style repository demonstrating existing ETL implementations
so ETL newbies could emulate some best practices or even cut and paste
baby steps into their ETL development. This would allow the community
to coalesce around some best practices as well as share some new
ones. I know there are different versions of this list but for me
best ETL practices include built-in end to end logging that can be
turned on an off selectively, (to second Jillian's recommendation)
config file management of specific flows rather than writing
individual scripts, and some extraction platform where you manage the
transformative elements and data scrubbing elements so that they can
be added to the codebase and called via config file without rewriting
the the whole ETL scheme you are using. There are a lot of other best
practices dealing with database connections and architecture that are
mostly baked into DBI and DBIx that I don't cover in this list.
With all that said my own code lacks substantially in the
documentation area and doesn't have a full test suit so I haven't
published all the pieces of what I do at work so I think your desire
to bring something out should be supported. I look forward to your
contributions.
Best Regards,
Jed
On Tue, May 19, 2015 at 8:52 PM, Ashish Mukherjee
<ashish.mukher...@gmail.com <mailto:ashish.mukher...@gmail.com>> wrote:
Jed,
Thanks for your detailed reply.
While I quite agree that perl indeed throws a number of powerful
tools and modules at the developer's disposal for most data
crunching, I was wondering if there may not be a gain in extending
Nelson's modules further with the aim of providing some
standardization (at least to serve the typical use cases more
easily). One may even provide certain methods to override for
specific processing.
For even more specific cases, developers may still go the route of
combining their own modules in a custom way.
Also, I see lot of people turning to other languages to leverage
"Big Data" platforms in ETL. This may be an area where perl can
become more friendly.
Welcome your thoughts and of others in the group.
Regards,
Ashish
On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com
<mailto:jandrewl...@gmail.com>> wrote:
Ashish,
In one perspective all of perl and CPAN is a very powerful ETL
package. I would argue that it's ability to stitch
practically any system to another system while mangling the
data in transit to your your own specification is not exceeded
anywhere. There are languages and tools outside perl where
the process is more standardized for specific industry flows.
However, with that standardization comes some inflexibility.
Many if not most of the pieces of those tools also exist in
the CPAN-verse in a pretty standardized format also. They
just aren't collated together under an ETL or data warehouse
header. You will be most successful in building ETL flows for
your needs with perl by searching for each piece of your ETL
flow separately rather than looking for a one stop shop. For
instance Nelson built his data stuff on DBI
<https://metacpan.org/pod/DBI> and DBIx
<https://metacpan.org/search?q=DBIx&size=20>. While Nelsons
packages aren't doing so well in the CPAN testers universe you
may find that DBI and DBIx already provide sufficient
functionality for your needs without his wrapper. Both of
those packages are battle tested in many ways and link quite
well to all of the other interesting things on CPAN.
I myself have never had to go outside perl to execute ETL.
Best Regards,
Jed
On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee
<ashish.mukher...@gmail.com
<mailto:ashish.mukher...@gmail.com>> wrote:
Hello,
I have been searching for some perl modules related to
DataWarehouse needs and came across few elementary
DataWarehouse modules on CPAN by Nelson Ferraz, which have
been last updated in 2010. I tried reaching out to him
over the list previously, without success.
I was wondering the following -
1. Do others also have similar requirements in perl? i.e
modules for ETL, DataWarehousing etc.
2. Do people on this list have certain other requirements
which are not so apparent too, but perhaps being met by
frameworks in other languages?
If I am unable to reach Nelson, I was thinking of
following the process to take custodianship of the modules
and work on them further.
Regards,
Ashish