Re: Adding Morphline support to DIH - worth the effort?

Alexandre Rafalovitch Sun, 08 Jun 2014 20:29:22 -0700

Not to toot my own horn, but I already had a go at the larger scale
discussion at the heliosearch-dev list:
https://groups.google.com/forum/#!searchin/heliosearch-dev/dih/heliosearch-dev/XVyDsELkOAU/ntM2HgK5p6YJ
(sorry, no fully public link, have to join). But the initial question
was:


```
For the next iteration, what are the thoughts on DIH? My understanding
was that it is no longer actively maintained and is kind of awkward
and limited compared to the current requirements.

Is there a plan on replacing it by other open-source libraries with
more users. The ones I can think of are:
*) Morphline opensourced by Cloudera:
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html
*) Apache Camel (http://camel.apache.org/)
*) Spring XD ( http://projects.spring.io/spring-xd/ )

All of the above have some sort of Solr integration already.

Another option is to kill DIH and instead put the effort into
maintaining strong Solr endpoints for one/more popular products like
the ones above.
```

Many interesting points, but then a dead-end. Somebody has to bite the
bullet and do first full implementation. But there is no consensus in
which way to go and DIH refactoring fits into bigger problems (again,
components/modules, etc). And it does not feel like a small project
that one person can implement as part-time effort.

So, this thread was an attempt to move the conversation forward with
smaller, more achievable goal. But I am more than happy for the big
discussion to happen. As long as there is some sort of Next Action, as
opposed to just ripples in the pond.

My personal opinion on the higher level discussion is that making
separate DIH or 3-rd party client a success, we would ideally need to
provide more recognition and support to the third party Solr
clients/libraries (of which, there are many:
https://twitter.com/SolrStart/status/475814979636314113 ).

A cool/smart thing would be to create a sort-of externally facing test
suite (possibly descriptive). And that would be combined with a
possibly-separate mailing list for Solr-clients implementers. Then,
new features and Solr-changes would be propagated to those external
implementer for consideration of how it affects their own libraries.
For example, at least one popular Solr client library will now
absolutely NOT work with Solr 4.8 because it tries to read the schema
and fails with new flat fields/types. The fix is two lines, but
implementers need to be notified, think it through and maybe even have
the forum to discuss the impact.

A big ask, I know. But, to me, the current situation that even sister
Apache projects' Solr connections are broken - often in trivial ways -
is sad. And hinders the adoption rates, often with preferences going
to shinier, more coherent _promise_ of Elastic Search.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, Jun 9, 2014 at 4:51 AM, Mikhail Khludnev
<[email protected]> wrote:
> Jack,
> I found your considerations quite reasonable.
> One of the ideas over DIH discussed earlier is making it standalone. So, if
> we start from simple Morphline UI, we can do this extraction. Then, such
> externalized ETL, will work better with Solr Cloud than DIH works now.
> Presumably we can reuse DIH Jdbc Datasources as a source for Morphline
> records.
> Still open questions in this approach are:
> - joins/caching - seem possible with Morphlines but still there is no such
> command
> - delta import - scenario we don't need to forget to handle it
> - threads (it's completely out Morphline's concerns)
> - distributed processing - it would be great if we can partition datasource
> eg something what's done by Scoop
> ... what else?
>
>
> On Sun, Jun 8, 2014 at 6:54 PM, Jack Krupansky <[email protected]>
> wrote:
>>
>> I've avoided DIH like the plague since it really doesn't fit well in Solr,
>> so I'm still baffled as to why you think we need to use DIH as the
>> foundation for a Solr Morphlines project. That shouldn't stop you, but
>> what's the big impediment to taking a clean slate approach to Morphlines -
>> learn what we can from DIH, but do a fresh, clean "Solr 5.0" implementation
>> that is not burdened from the get-go with all of DIH's baggage?
>>
>> Configuring DIH is one of its main problems, so blending Morphlines config
>> into DIH config would seem to just make Morphlines less attractive than it
>> actually is when viewed by itself.
>>
>> You might also consider how ManifoldCF (another Apache project) would
>> integrate with DIH and Morphlines as well. I mean, the core use case is ETL
>> from external data sources. And how all of this relates to Apache Flume as
>> well.
>>
>> But back to the original, still unanswered, question: Why use DIH as the
>> starting point for integrating Morphlines with Solr - unless the goal is to
>> make Morphlines unpalatable and less approachable than even DIH itself?!
>>
>> Another question: What does Elasticsearch have in this area (besides
>> "rivers")? Are they headed in the Morphlines direction as well?
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Alexandre Rafalovitch
>> Sent: Sunday, June 8, 2014 10:16 AM
>>
>> To: [email protected]
>> Subject: Re: Adding Morphline support to DIH - worth the effort?
>>
>> I see DIH as something that offers a quick way to get things done, as
>> long as they fit into DIH's couple of basic scenarios. Going even a
>> little beyond hits bugs, bad documentation, inconsistencies and lack
>> of ongoing support (e.g. SOLR-4383).
>>
>> So, if it works for you - great. If it does not - too bad, use SolrJ.
>> And given what I observe, I believe the next round of improvements
>> might be easier to achieve by moving to a different open-source pipe
>> project than trying to keep reinventing and bandaging one of our own.
>> Go where strongest community is, etc.
>>
>> Morphline can be seen as a replacement for DIH's EntityProcessors and
>> Transformers (Flume adds other bits). The reasons I think it is worth
>> looking at are as follows:
>> 1) DIH is not really being maintained or further improved. So, the
>> list of EP and Transformers is the same and does not account for new
>> requests (which we see periodically on the mailing list); even the new
>> implementations get stuck in JIRA (see the JIRA in original email)
>> 2) It's not terribly well documented either, so people are always
>> struggling to understand how the entity is actually generated and what
>> happens when things go wrong
>> 3) We are already bundling Morphline jars with Solr. But we are NOT
>> using them in any way useful to a non-Hadoop Solr user. Which begs the
>> question why did we add them (one answer I guess: because we don't
>> have module system).
>> 4) Morphlines have more primitives than DIH and the available list keeps
>> growing
>> 5) What separate module for Solr? We have no discovery method for
>> modules. Writing one for general consumption is like trying to sing in
>> vacuum - the problem is a lot bigger that with individual offering.
>>
>> In terms of implementation, I think it take defining a custom
>> MorphlineEntityProcessor which basically plugs into DIH's current
>> DataSources. So, one could use for example DIH SqlDataSource to get a
>> list of files and then to handoff to Morphline's black box to parse
>> those files into records (e.g. Multiline records), augment them, etc.
>> Then, at the end, this gets handed back to DIH to finish it up. I
>> think this would work even with nested entities and transformers. The
>> Admin UI should also work
>>
>> Eventually, I think we need a harder discussion about DIH, so this
>> partial handover could be a way to test the waters.
>>
>> Does this make more sense?
>>
>> Regards,
>>   Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky <[email protected]>
>> wrote:
>>>
>>> It sounds more like an alternative to DIH rather than an incremental
>>> add-on
>>> to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"?
>>>
>>> So, back to Shalin's question, which specific (please detail!) use cases
>>> of
>>> DIH are enhanced by Morphline?
>>>
>>> Maybe it would help if you simply elaborate what benefits would accrue to
>>> adding Morphline to DIH - as opposed to creating a separate module for
>>> Solr.
>>> I suppose it depends on whether you consider DIH a solid foundation or a
>>> weak link in Solr that desperately needs firming up.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Alexandre Rafalovitch
>>> Sent: Sunday, June 8, 2014 1:40 AM
>>> To: [email protected]
>>> Subject: Re: Adding Morphline support to DIH - worth the effort?
>>>
>>>
>>> Well, it's the same core scenario as DIH supports (apart from actual
>>> data sources), but actively supported and developed by a company with
>>> a lot more investment in it. For the primitives supported, see
>>>
>>> http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html
>>>
>>> We don't bundle ALL of these with Solr, but I think we do bundle core,
>>> solr-core and solr-cell packages, which is a good number and range of
>>> functionality (e.g. readMultiLine).
>>>
>>> Regards,
>>>   Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>> proficiency
>>>
>>>
>>> On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> I do not know much about morphlines but I'd like to know what use-cases
>>>> would be possible/easier/faster with such an integration?
>>>>
>>>>
>>>> On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch
>>>> <[email protected]>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> I had a preliminary look around and it might be possible to plug
>>>>> Morphline (already shipped with Solr) into DIH by creating a bridging
>>>>> EntityProcessor.
>>>>>
>>>>> Two questions:
>>>>> 1) Do people see value in it?
>>>>> 2) DIH is not very supported, so any addition seems to be a bit stuck
>>>>> in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't
>>>>> want to suddenly be responsible for fixing the bridge before adding a
>>>>> standalone piece of code. So, if I write the code, how many general
>>>>> DIH externalities would I also have to address (e.g. lack of tests,
>>>>> etc)?
>>>>>
>>>>> Regards,
>>>>>    Alex.
>>>>> P.s. Morphline could also be integrated in update request processor
>>>>> chain. So, that could be an alternative project.
>>>>>
>>>>> Personal website: http://www.outerthoughts.com/
>>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>>>> proficiency
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Shalin Shekhar Mangar.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Adding Morphline support to DIH - worth the effort?

Reply via email to