Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Sanne Grinovero Tue, 05 Mar 2013 11:05:58 -0800

Nice!
n+1 is something Hibernate Search has to deal with too, that's why I
was interested in the fetch profiles and graph loading in JPA 2.1


On 5 March 2013 17:44, Emmanuel Bernard <emman...@hibernate.org> wrote:
> I have implemented a solution that gives an entity based on a tuple.
> https://hibernate.onjira.com/browse/OGM-273#comment-50082
>
> Note that it does not currently works for MongoDB, but that's waiting
> for the dedicated GridDialect method as well as OGM-151.
> Also note that I have no idea how that will work for associations. I
> suspect some nasty n+1 is happening as best. Worse case, an exception :)
>
> Emmanuel
>
> On Tue 2013-03-05 10:30, Emmanuel Bernard wrote:
>> We might hope for a stable enough contract on Hibernate Search and
>> hope that we won't break serializability between micro or minor
>> versions. That will need to be taken into account in the test suite and
>> design.
>> On the OGM side though, we are not at that level of maturity and we will
>> force homogenous Hibernate OGM version across all the cluster. The grid
>> will have to go down for upgrades or enforce that no mpa reduce job
>> using OGM is used while the version roll out is in process.
>>
>> Emmanuel
>>
>> On Mon 2013-03-04 18:30, Sanne Grinovero wrote:
>> > Found an example, this is all the code it needs to have a MassIndexer 
>> > working
>> > on top of Infinispan's Map/Reduce:
>> >
>> > https://github.com/infinispan/infinispan/blob/master/query/src/main/java/org/infinispan/query/impl/massindex/IndexingMapper.java#L40
>> >
>> > Note it's initialize method which injects needed components; the
>> > implementation is serialized across nodes.
>> >
>> > Sanne
>> >
>> > On 4 March 2013 18:26, Sanne Grinovero <sa...@hibernate.org> wrote:
>> > > We finished this discussion on IRC, in case someone else was interested:
>> > >
>> > > <sanne> hum I forgot the first step.. transformation from entry into 
>> > > entity
>> > > <sanne> updated
>> > > <sanne> emmanuel, the "hidrate" step is what DavideD is bashing is
>> > > head against, but let's assume he finds a workaround and we focus on
>> > > the pattern as first step?
>> > > <emmanuel> https://gist.github.com/emmanuelbernard/5084039
>> > > <emmanuel> sanne: ^ that's how I would do it if I had an Iterator from 
>> > > the tuple
>> > > <emmanuel> assuming pushToExecutor pushes to whatever concurrent work
>> > > mechanism you planned to use on consumes
>> > > <emmanuel> Plus I am not folloing exactly how you plan consumes(Entry)
>> > > to be executed concurrently
>> > > <emmanuel> is that the GridDialect responsibility?
>> > > <emmanuel> That looks like a lot of work on the dialect's side
>> > > <sanne> emmanuel, imagine the backend is Infinispan and has some large
>> > > amount of data per node, plus that each node has its own backend
>> > > IndexManager (like and ideal sharding)
>> > > <emmanuel> ie pool mgt and cap +  queuing
>> > > <sanne> then with your approach the iterator needs to fetch data from
>> > > all remote nodes, and then enqueue in a local blocking queue which is
>> > > returning the data to the original owners
>> > > <sanne> but if you skip that step, you can just forward the statless
>> > > consumer to each node and have it run on data locality
>> > > <emmanuel> I was thinking that if you had the luncene index locally on
>> > > each node you would ahve a different impl of the MassIndexer anyways
>> > > <emmanuel> that would simply send a command to each local node
>> > > <sanne> To answer your question: that would be an optional GridDialect
>> > > responsibility. I would endorse a trivial first draft doing a
>> > > single-threaded loop.
>> > > <emmanuel> and have GridDialect.getDataFor() returnlocal data
>> > > <sanne> The "consumes" implementation can be either implemented with a
>> > > simple iterator - as in your design - so I don't think it pushes much
>> > > complexity to the GridDialect implementor?
>> > > <sanne> The benefit of the consumer is that *optionally* it can be
>> > > mapped on the Map phase, and that's trivial if your backend supports
>> > > Map/Reduce
>> > > <emmanuel> sanne: I don't follow that soory
>> > > <emmanuel> how does that make it mappable to the Map phase?
>> > > <sanne> "public void consume(Entry e) " is a degenerate (simplified)
>> > > form of map.
>> > > <sanne> mm infinispan IDE crashes at the right moment.
>> > > <emmanuel> I thought Map was about *filtering*
>> > > <emmanuel> not processing
>> > > <sanne> you can decide to accept 100% of values (without filtering),
>> > > but actually you might want to filter on the specified tables only.
>> > > <sanne> also, the return type doesn't have to match the input type:
>> > > hence you define a transformation function, which is inherently
>> > > applied in parallel on all matching entries.
>> > > <emmanuel> sanne: but then you require the OGM code to be everywhere
>> > > (ie on each node of the targetNoSQL
>> > > <emmanuel> to eb able to do tuple -> entity
>> > > <emmanuel> that's not realistic
>> > > <emmanuel> assuming your transform phase is about tuple -> entity and
>> > > some HSearch ops
>> > > <sanne> yes right
>> > > <sanne> but isn;t it worth it? it's optional and much more efficient,
>> > > as you avoid transferring any data.
>> > > <sanne> btw we often assume all nodes in the grid are equally
>> > > configured, so having same apps & libraries deployed.
>> > > <emmanuel> sanne: let me try and summarize what I understand
>> > > <emmanuel> it's more efficient if you store the Lucene index locally
>> > > with the data, and if the grid is written in Java or at least can run
>> > > code in Java including libraries and if you distribute the OGM
>> > > configuration across the whole grid
>> > > <emmanuel> Otherwise, it does not make any difference
>> > > <emmanuel> Also the GridDialect implementation need to know if you are
>> > > doing this trick to only return local data
>> > > <sanne> no there are other drawbacks which get defeated, but minor so
>> > > I didn't mention them
>> > > <emmanuel> am I right?
>> > > <sanne> mainly, you skip the need for the contentions point as there
>> > > is no push to a shared blocking queue
>> > > <sanne> no the GridDialect doesn't need to know.
>> > > <emmanuel> sanne: sure if you can process the code on each node you
>> > > avoid the shared blocking queue, at lest until you reach the
>> > > IndexManager
>> > > <sanne> you'll just forward a simple (standard) M/R task, and it will
>> > > need to execute it as always.
>> > > <sanne> the IndexManager is parallel ;)
>> > > <emmanuel> sanne: parallel on a single node
>> > > <sanne> yes, but no contentions points other than the internal
>> > > structure of the IW
>> > > <emmanuel> I mean updating the index for a given table is better done
>> > > on a singlle node
>> > > <sanne> IndexWriter
>> > > <emmanuel> sorry I meant IndexWriter
>> > > <emmanuel> ah but ou mention perfect sharding
>> > > <emmanuel> you need cosmological alignment for this shit to happen
>> > > <sanne> not if we plan for it :)
>> > > <sanne> you might remember the changes to Segments in the ISPN code,
>> > > to accomodate index storage consistent with the data locality
>> > > <sanne> that's expected in 6.0
>> > > <emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is 
>> > > wrong
>> > > <emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String... 
>> > > tables)
>> > > <emmanuel> as you ened to send the Comsumer impl
>> > > <emmanuel> not simply use it
>> > > <sanne> hu, it needs a reference to the current SearchFactory at very 
>> > > least
>> > > <emmanuel> sanne: but you're telling me you send the M/R task
>> > > <emmanuel> so you need to send the M/R code as well
>> > > <sanne> yes but here we enter Infinspan specific implementation
>> > > <sanne> I would register the needed components in Infinispan and use
>> > > the ServiceRegistry to look them up remotely
>> > > <sanne> not to mention Infinispan could accomodate a custom command for 
>> > > it
>> > > <emmanuel> What I am saying is that you don't pass the Consumer
>> > > *instance* tot he grid dialect but rather the impl, no?
>> > > <sanne> the impl class definition?
>> > > <emmanuel> sanne: you tell me. How do I send M/R code today?
>> > > <emmanuel> certainly not an impl instance
>> > > <sanne> yes you do
>> > > <sanne> JBMar will take care of it, including state.
>> > > <sanne> but in this case that would be wrong of course as I don't want
>> > > to serialize the whole SearchFactory so I'd use injection and lookup,
>> > > but that's a detail of Infinispan.
>> > > <sanne> But this shouldn't be MassIndexer specific right? it's good to
>> > > expose a general "execute on all" method, and I think accepting
>> > > instances would make life easier for most - even though we might need
>> > > to document some limitations.
>> > > <emmanuel> alright, I guess 'll have to live with a visitor pattern
>> > > for a feature that has 5% chance of happening :)
>> > > <sanne> I'm going to punch Davide
>> > > <sanne> as he's yelling "it's not a visitor" but doesn't have the guts
>> > > to write it down :)
>> > > <emmanuel> sanne: DavideD 's would have nothing to do about it, that's
>> > > requires a lot of config and Infinispan machinery I'm not sure is here
>> > > today
>> > > <DavideD> :)
>> > > <emmanuel> ah
>> > > <emmanuel> I don't care how it's called, it's one of those patterns
>> > > that make the code harder to follow
>> > > <DavideD> I was actually trying to remember the name of the pattern
>> > > <sanne> ok now we agree :)
>> > > <emmanuel> Obfuscator pattern family
>> > > <sanne> very popular among consultants, I don't understand why you 
>> > > complain :P
>> > > <sanne> Anyway, let's wrap up and broaden the horizon:
>> > > <emmanuel> ok so we are left with findin to to load a entity from a tuple
>> > > <sanne> you don't think it's useful as a general purpose method?
>> > > <emmanuel> sanne: wil be for queries
>> > > <emmanuel> It's just that it's non obvious
>> > > <sanne> Exactly. Also I think lambda methods are getting widely better 
>> > > known.
>> > > <emmanuel> syntactically yes
>> > > <emmanuel> VM wise, perf improvements will come later
>> > > <sanne> what I mean is that by defining the SPI this way, I don't
>> > > expect it to be more complex for the GridDialect implementors, while
>> > > we can reuse it for a wider scope of needs.
>> > >
>> > >  --Sanne
>> > >
>> > > On 4 March 2013 17:02, Emmanuel Bernard <emman...@hibernate.org> wrote:
>> > >>
>> > >>
>> > >> On 4 mars 2013, at 17:39, Sanne Grinovero <sa...@hibernate.org> wrote:
>> > >>
>> > >>> On 4 March 2013 16:20, Emmanuel Bernard <emman...@hibernate.org> wrote:
>> > >>>> I already gave what I knew on how to load an entity from a tuple 
>> > >>>> (which
>> > >>>> isn't much) but we can try and dig together. Something I thought about
>> > >>>> is that ORM probably has a mechanism to load an entity from a 
>> > >>>> resultset
>> > >>>> via the query parser. And that probably looks also like the second 
>> > >>>> half
>> > >>>> of OgmLoader.load. We could look at this part and see if we can make 
>> > >>>> an
>> > >>>> OGM version of it. We never had the need before as we never had query
>> > >>>> support (the way SQL does it).
>> > >>>
>> > >>> I would also need to study the ORM code, but to add a high level 
>> > >>> observation,
>> > >>> the methods currently defined by the GridDialect are focusing on
>> > >>> loading from well known key instances,
>> > >>> there is nothing to makes us able to scan/inspect for all values.
>> > >>>
>> > >>> In other words: even if we wanted to load keys first, we don't have 
>> > >>> definitions
>> > >>> of functions from raw->primary key instances either.
>> > >>
>> > >> I understand that. I'm not denying the need for the method.
>> > >>
>> > >>>
>> > >>>
>> > >>>> On the visitor vs Iterator approach, I still don't see how 
>> > >>>> implementing
>> > >>>> an Iterator on a map / reduce backend would be harder than the visitor
>> > >>>> but maybe I'm missing something.
>> > >>>>
>> > >>>>    class IteratorAsStream {
>> > >>>>        final Query someMapReduceQuery = ...;
>> > >>>>
>> > >>>>        public Object next() {
>> > >>>>            if (!someMapReduceQuery.started()) {
>> > >>>>                // execute and collect results in parallel
>> > >>>>                someMapReduceQuery.execute();
>> > >>>>            }
>> > >>>>            Object result = someMapReduce.getNextOrBlock();
>> > >>>>            return result;
>> > >>>>        }
>> > >>>>    }
>> > >>>
>> > >>> That could work to *load* all entities in parallel, but I'd like to
>> > >>> process the entities in parallel as well.
>> > >>> And I'd rather not force the GridDialect implementors to write some
>> > >>> Hibernate Search specific code,
>> > >>> so to break out we need some form of "Execute X on each": a closure or 
>> > >>> a lambda.
>> > >>>
>> > >>
>> > >> I can't see how the visitor model helps in your processing of entities 
>> > >> in parallel. To me both approaches are strictly equivalent. Care to 
>> > >> show some pseudo-code?
>> _______________________________________________
>> hibernate-dev mailing list
>> hibernate-dev@lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
_______________________________________________
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Reply via email to