> One of the ideas over DIH discussed earlier is making it standalone. Yeah; my beef with the DIH is that it’s tied to Solr. But I’d rather see something other than the DIH outside Solr; it’s not worthy IMO. Why have something Solr specific even? A great pipeline shouldn’t tie itself to any end-point. There are a variety of solutions out there that I tried. There are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t quite ideal in one way or another. And Spring-Integration. And some half-baked data pipelines like OpenPipe & Open Pipeline. I never got around to taking a good look at Findwise’s open-sourced Hydra but I learned enough to know to my surprise it was configured in code versus a config file (like all the others) and that's a big turn-off to me. Today I read through most of the Morphlines docs and a few choice source files and I’m super-impressed. But as you note it’s missing a lot of other stuff. I think something great could be built using it as a core piece.
~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Sun, Jun 8, 2014 at 5:51 PM, Mikhail Khludnev <[email protected] > wrote: > Jack, > I found your considerations quite reasonable. > One of the ideas over DIH discussed earlier is making it standalone. So, > if we start from simple Morphline UI, we can do this extraction. Then, such > externalized ETL, will work better with Solr Cloud than DIH works now. > Presumably we can reuse DIH Jdbc Datasources as a source for Morphline > records. > Still open questions in this approach are: > - joins/caching - seem possible with Morphlines but still there is no such > command > - delta import - scenario we don't need to forget to handle it > - threads (it's completely out Morphline's concerns) > - distributed processing - it would be great if we can partition > datasource eg something what's done by Scoop > ... what else? > > > On Sun, Jun 8, 2014 at 6:54 PM, Jack Krupansky <[email protected]> > wrote: > >> I've avoided DIH like the plague since it really doesn't fit well in >> Solr, so I'm still baffled as to why you think we need to use DIH as the >> foundation for a Solr Morphlines project. That shouldn't stop you, but >> what's the big impediment to taking a clean slate approach to Morphlines - >> learn what we can from DIH, but do a fresh, clean "Solr 5.0" implementation >> that is not burdened from the get-go with all of DIH's baggage? >> >> Configuring DIH is one of its main problems, so blending Morphlines >> config into DIH config would seem to just make Morphlines less attractive >> than it actually is when viewed by itself. >> >> You might also consider how ManifoldCF (another Apache project) would >> integrate with DIH and Morphlines as well. I mean, the core use case is ETL >> from external data sources. And how all of this relates to Apache Flume as >> well. >> >> But back to the original, still unanswered, question: Why use DIH as the >> starting point for integrating Morphlines with Solr - unless the goal is to >> make Morphlines unpalatable and less approachable than even DIH itself?! >> >> Another question: What does Elasticsearch have in this area (besides >> "rivers")? Are they headed in the Morphlines direction as well? >> >> >> -- Jack Krupansky >> >> -----Original Message----- From: Alexandre Rafalovitch >> Sent: Sunday, June 8, 2014 10:16 AM >> >> To: [email protected] >> Subject: Re: Adding Morphline support to DIH - worth the effort? >> >> I see DIH as something that offers a quick way to get things done, as >> long as they fit into DIH's couple of basic scenarios. Going even a >> little beyond hits bugs, bad documentation, inconsistencies and lack >> of ongoing support (e.g. SOLR-4383). >> >> So, if it works for you - great. If it does not - too bad, use SolrJ. >> And given what I observe, I believe the next round of improvements >> might be easier to achieve by moving to a different open-source pipe >> project than trying to keep reinventing and bandaging one of our own. >> Go where strongest community is, etc. >> >> Morphline can be seen as a replacement for DIH's EntityProcessors and >> Transformers (Flume adds other bits). The reasons I think it is worth >> looking at are as follows: >> 1) DIH is not really being maintained or further improved. So, the >> list of EP and Transformers is the same and does not account for new >> requests (which we see periodically on the mailing list); even the new >> implementations get stuck in JIRA (see the JIRA in original email) >> 2) It's not terribly well documented either, so people are always >> struggling to understand how the entity is actually generated and what >> happens when things go wrong >> 3) We are already bundling Morphline jars with Solr. But we are NOT >> using them in any way useful to a non-Hadoop Solr user. Which begs the >> question why did we add them (one answer I guess: because we don't >> have module system). >> 4) Morphlines have more primitives than DIH and the available list keeps >> growing >> 5) What separate module for Solr? We have no discovery method for >> modules. Writing one for general consumption is like trying to sing in >> vacuum - the problem is a lot bigger that with individual offering. >> >> In terms of implementation, I think it take defining a custom >> MorphlineEntityProcessor which basically plugs into DIH's current >> DataSources. So, one could use for example DIH SqlDataSource to get a >> list of files and then to handoff to Morphline's black box to parse >> those files into records (e.g. Multiline records), augment them, etc. >> Then, at the end, this gets handed back to DIH to finish it up. I >> think this would work even with nested entities and transformers. The >> Admin UI should also work >> >> Eventually, I think we need a harder discussion about DIH, so this >> partial handover could be a way to test the waters. >> >> Does this make more sense? >> >> Regards, >> Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky <[email protected]> >> wrote: >> >>> It sounds more like an alternative to DIH rather than an incremental >>> add-on >>> to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"? >>> >>> So, back to Shalin's question, which specific (please detail!) use cases >>> of >>> DIH are enhanced by Morphline? >>> >>> Maybe it would help if you simply elaborate what benefits would accrue to >>> adding Morphline to DIH - as opposed to creating a separate module for >>> Solr. >>> I suppose it depends on whether you consider DIH a solid foundation or a >>> weak link in Solr that desperately needs firming up. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Alexandre Rafalovitch >>> Sent: Sunday, June 8, 2014 1:40 AM >>> To: [email protected] >>> Subject: Re: Adding Morphline support to DIH - worth the effort? >>> >>> >>> Well, it's the same core scenario as DIH supports (apart from actual >>> data sources), but actively supported and developed by a company with >>> a lot more investment in it. For the primitives supported, see >>> http://cloudera.github.io/cdk/docs/current/cdk-morphlines/ >>> morphlinesReferenceGuide.html >>> >>> We don't bundle ALL of these with Solr, but I think we do bundle core, >>> solr-core and solr-cell packages, which is a good number and range of >>> functionality (e.g. readMultiLine). >>> >>> Regards, >>> Alex. >>> Personal website: http://www.outerthoughts.com/ >>> Current project: http://www.solr-start.com/ - Accelerating your Solr >>> proficiency >>> >>> >>> On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar >>> <[email protected]> wrote: >>> >>>> >>>> I do not know much about morphlines but I'd like to know what use-cases >>>> would be possible/easier/faster with such an integration? >>>> >>>> >>>> On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch >>>> <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> Hello, >>>>> >>>>> I had a preliminary look around and it might be possible to plug >>>>> Morphline (already shipped with Solr) into DIH by creating a bridging >>>>> EntityProcessor. >>>>> >>>>> Two questions: >>>>> 1) Do people see value in it? >>>>> 2) DIH is not very supported, so any addition seems to be a bit stuck >>>>> in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't >>>>> want to suddenly be responsible for fixing the bridge before adding a >>>>> standalone piece of code. So, if I write the code, how many general >>>>> DIH externalities would I also have to address (e.g. lack of tests, >>>>> etc)? >>>>> >>>>> Regards, >>>>> Alex. >>>>> P.s. Morphline could also be integrated in update request processor >>>>> chain. So, that could be an alternative project. >>>>> >>>>> Personal website: http://www.outerthoughts.com/ >>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr >>>>> proficiency >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Shalin Shekhar Mangar. >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <[email protected]> >
