Ashish, I am required to use a non-perl workflow/ETL package on occasion at $work and my experience has been that they all try and think for you and ultimately this is limiting. There is some business think out there that that a "roll-your-own" perl ETL costs more to support than off the shelf. So far when I use these packages I seem to find some data source or some transformation requirements that are not supported and I go back to building in perl. Additionally most of those packages don't link to perl very well which means that I have trouble with a hybrid approach. (perl talks with most everything just not the other way around)
I think the argument against "roll your own" is true for poorly written solutions. It may also be true for simple ETL cases and a newbie skill set. But, I also think that perl unfairly gets lumped in with all scripting languages when this assessment is made at a business level for overall business needs. I suspect the area that would be most welcome in the perl community would be more of a Cookbook style repository demonstrating existing ETL implementations so ETL newbies could emulate some best practices or even cut and paste baby steps into their ETL development. This would allow the community to coalesce around some best practices as well as share some new ones. I know there are different versions of this list but for me best ETL practices include built-in end to end logging that can be turned on an off selectively, (to second Jillian's recommendation) config file management of specific flows rather than writing individual scripts, and some extraction platform where you manage the transformative elements and data scrubbing elements so that they can be added to the codebase and called via config file without rewriting the the whole ETL scheme you are using. There are a lot of other best practices dealing with database connections and architecture that are mostly baked into DBI and DBIx that I don't cover in this list. With all that said my own code lacks substantially in the documentation area and doesn't have a full test suit so I haven't published all the pieces of what I do at work so I think your desire to bring something out should be supported. I look forward to your contributions. Best Regards, Jed On Tue, May 19, 2015 at 8:52 PM, Ashish Mukherjee < ashish.mukher...@gmail.com> wrote: > Jed, > > Thanks for your detailed reply. > > While I quite agree that perl indeed throws a number of powerful tools and > modules at the developer's disposal for most data crunching, I was > wondering if there may not be a gain in extending Nelson's modules further > with the aim of providing some standardization (at least to serve the > typical use cases more easily). One may even provide certain methods to > override for specific processing. > > For even more specific cases, developers may still go the route of > combining their own modules in a custom way. > > Also, I see lot of people turning to other languages to leverage "Big > Data" platforms in ETL. This may be an area where perl can become more > friendly. > > Welcome your thoughts and of others in the group. > > Regards, > Ashish > > On Tue, May 19, 2015 at 11:44 PM, Jed Lund <jandrewl...@gmail.com> wrote: > >> Ashish, >> >> In one perspective all of perl and CPAN is a very powerful ETL package. >> I would argue that it's ability to stitch practically any system to another >> system while mangling the data in transit to your your own specification is >> not exceeded anywhere. There are languages and tools outside perl where >> the process is more standardized for specific industry flows. However, >> with that standardization comes some inflexibility. Many if not most of >> the pieces of those tools also exist in the CPAN-verse in a pretty >> standardized format also. They just aren't collated together under an ETL >> or data warehouse header. You will be most successful in building ETL >> flows for your needs with perl by searching for each piece of your ETL flow >> separately rather than looking for a one stop shop. For instance Nelson >> built his data stuff on DBI <https://metacpan.org/pod/DBI> and DBIx >> <https://metacpan.org/search?q=DBIx&size=20>. While Nelsons packages >> aren't doing so well in the CPAN testers universe you may find that DBI and >> DBIx already provide sufficient functionality for your needs without his >> wrapper. Both of those packages are battle tested in many ways and link >> quite well to all of the other interesting things on CPAN. >> >> I myself have never had to go outside perl to execute ETL. >> >> Best Regards, >> >> Jed >> >> On Tue, May 19, 2015 at 8:57 AM, Ashish Mukherjee < >> ashish.mukher...@gmail.com> wrote: >> >>> Hello, >>> >>> I have been searching for some perl modules related to DataWarehouse >>> needs and came across few elementary DataWarehouse modules on CPAN by >>> Nelson Ferraz, which have been last updated in 2010. I tried reaching out >>> to him over the list previously, without success. >>> >>> I was wondering the following - >>> >>> 1. Do others also have similar requirements in perl? i.e modules for >>> ETL, DataWarehousing etc. >>> >>> 2. Do people on this list have certain other requirements which are not >>> so apparent too, but perhaps being met by frameworks in other languages? >>> >>> If I am unable to reach Nelson, I was thinking of following the process >>> to take custodianship of the modules and work on them further. >>> >>> Regards, >>> Ashish >>> >> >> >