Found an example, this is all the code it needs to have a MassIndexer working on top of Infinispan's Map/Reduce:
https://github.com/infinispan/infinispan/blob/master/query/src/main/java/org/infinispan/query/impl/massindex/IndexingMapper.java#L40 Note it's initialize method which injects needed components; the implementation is serialized across nodes. Sanne On 4 March 2013 18:26, Sanne Grinovero <sa...@hibernate.org> wrote: > We finished this discussion on IRC, in case someone else was interested: > > <sanne> hum I forgot the first step.. transformation from entry into entity > <sanne> updated > <sanne> emmanuel, the "hidrate" step is what DavideD is bashing is > head against, but let's assume he finds a workaround and we focus on > the pattern as first step? > <emmanuel> https://gist.github.com/emmanuelbernard/5084039 > <emmanuel> sanne: ^ that's how I would do it if I had an Iterator from the > tuple > <emmanuel> assuming pushToExecutor pushes to whatever concurrent work > mechanism you planned to use on consumes > <emmanuel> Plus I am not folloing exactly how you plan consumes(Entry) > to be executed concurrently > <emmanuel> is that the GridDialect responsibility? > <emmanuel> That looks like a lot of work on the dialect's side > <sanne> emmanuel, imagine the backend is Infinispan and has some large > amount of data per node, plus that each node has its own backend > IndexManager (like and ideal sharding) > <emmanuel> ie pool mgt and cap + queuing > <sanne> then with your approach the iterator needs to fetch data from > all remote nodes, and then enqueue in a local blocking queue which is > returning the data to the original owners > <sanne> but if you skip that step, you can just forward the statless > consumer to each node and have it run on data locality > <emmanuel> I was thinking that if you had the luncene index locally on > each node you would ahve a different impl of the MassIndexer anyways > <emmanuel> that would simply send a command to each local node > <sanne> To answer your question: that would be an optional GridDialect > responsibility. I would endorse a trivial first draft doing a > single-threaded loop. > <emmanuel> and have GridDialect.getDataFor() returnlocal data > <sanne> The "consumes" implementation can be either implemented with a > simple iterator - as in your design - so I don't think it pushes much > complexity to the GridDialect implementor? > <sanne> The benefit of the consumer is that *optionally* it can be > mapped on the Map phase, and that's trivial if your backend supports > Map/Reduce > <emmanuel> sanne: I don't follow that soory > <emmanuel> how does that make it mappable to the Map phase? > <sanne> "public void consume(Entry e) " is a degenerate (simplified) > form of map. > <sanne> mm infinispan IDE crashes at the right moment. > <emmanuel> I thought Map was about *filtering* > <emmanuel> not processing > <sanne> you can decide to accept 100% of values (without filtering), > but actually you might want to filter on the specified tables only. > <sanne> also, the return type doesn't have to match the input type: > hence you define a transformation function, which is inherently > applied in parallel on all matching entries. > <emmanuel> sanne: but then you require the OGM code to be everywhere > (ie on each node of the targetNoSQL > <emmanuel> to eb able to do tuple -> entity > <emmanuel> that's not realistic > <emmanuel> assuming your transform phase is about tuple -> entity and > some HSearch ops > <sanne> yes right > <sanne> but isn;t it worth it? it's optional and much more efficient, > as you avoid transferring any data. > <sanne> btw we often assume all nodes in the grid are equally > configured, so having same apps & libraries deployed. > <emmanuel> sanne: let me try and summarize what I understand > <emmanuel> it's more efficient if you store the Lucene index locally > with the data, and if the grid is written in Java or at least can run > code in Java including libraries and if you distribute the OGM > configuration across the whole grid > <emmanuel> Otherwise, it does not make any difference > <emmanuel> Also the GridDialect implementation need to know if you are > doing this trick to only return local data > <sanne> no there are other drawbacks which get defeated, but minor so > I didn't mention them > <emmanuel> am I right? > <sanne> mainly, you skip the need for the contentions point as there > is no push to a shared blocking queue > <sanne> no the GridDialect doesn't need to know. > <emmanuel> sanne: sure if you can process the code on each node you > avoid the shared blocking queue, at lest until you reach the > IndexManager > <sanne> you'll just forward a simple (standard) M/R task, and it will > need to execute it as always. > <sanne> the IndexManager is parallel ;) > <emmanuel> sanne: parallel on a single node > <sanne> yes, but no contentions points other than the internal > structure of the IW > <emmanuel> I mean updating the index for a given table is better done > on a singlle node > <sanne> IndexWriter > <emmanuel> sorry I meant IndexWriter > <emmanuel> ah but ou mention perfect sharding > <emmanuel> you need cosmological alignment for this shit to happen > <sanne> not if we plan for it :) > <sanne> you might remember the changes to Segments in the ISPN code, > to accomodate index storage consistent with the data locality > <sanne> that's expected in 6.0 > <emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is wrong > <emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String... tables) > <emmanuel> as you ened to send the Comsumer impl > <emmanuel> not simply use it > <sanne> hu, it needs a reference to the current SearchFactory at very least > <emmanuel> sanne: but you're telling me you send the M/R task > <emmanuel> so you need to send the M/R code as well > <sanne> yes but here we enter Infinspan specific implementation > <sanne> I would register the needed components in Infinispan and use > the ServiceRegistry to look them up remotely > <sanne> not to mention Infinispan could accomodate a custom command for it > <emmanuel> What I am saying is that you don't pass the Consumer > *instance* tot he grid dialect but rather the impl, no? > <sanne> the impl class definition? > <emmanuel> sanne: you tell me. How do I send M/R code today? > <emmanuel> certainly not an impl instance > <sanne> yes you do > <sanne> JBMar will take care of it, including state. > <sanne> but in this case that would be wrong of course as I don't want > to serialize the whole SearchFactory so I'd use injection and lookup, > but that's a detail of Infinispan. > <sanne> But this shouldn't be MassIndexer specific right? it's good to > expose a general "execute on all" method, and I think accepting > instances would make life easier for most - even though we might need > to document some limitations. > <emmanuel> alright, I guess 'll have to live with a visitor pattern > for a feature that has 5% chance of happening :) > <sanne> I'm going to punch Davide > <sanne> as he's yelling "it's not a visitor" but doesn't have the guts > to write it down :) > <emmanuel> sanne: DavideD 's would have nothing to do about it, that's > requires a lot of config and Infinispan machinery I'm not sure is here > today > <DavideD> :) > <emmanuel> ah > <emmanuel> I don't care how it's called, it's one of those patterns > that make the code harder to follow > <DavideD> I was actually trying to remember the name of the pattern > <sanne> ok now we agree :) > <emmanuel> Obfuscator pattern family > <sanne> very popular among consultants, I don't understand why you complain :P > <sanne> Anyway, let's wrap up and broaden the horizon: > <emmanuel> ok so we are left with findin to to load a entity from a tuple > <sanne> you don't think it's useful as a general purpose method? > <emmanuel> sanne: wil be for queries > <emmanuel> It's just that it's non obvious > <sanne> Exactly. Also I think lambda methods are getting widely better known. > <emmanuel> syntactically yes > <emmanuel> VM wise, perf improvements will come later > <sanne> what I mean is that by defining the SPI this way, I don't > expect it to be more complex for the GridDialect implementors, while > we can reuse it for a wider scope of needs. > > --Sanne > > On 4 March 2013 17:02, Emmanuel Bernard <emman...@hibernate.org> wrote: >> >> >> On 4 mars 2013, at 17:39, Sanne Grinovero <sa...@hibernate.org> wrote: >> >>> On 4 March 2013 16:20, Emmanuel Bernard <emman...@hibernate.org> wrote: >>>> I already gave what I knew on how to load an entity from a tuple (which >>>> isn't much) but we can try and dig together. Something I thought about >>>> is that ORM probably has a mechanism to load an entity from a resultset >>>> via the query parser. And that probably looks also like the second half >>>> of OgmLoader.load. We could look at this part and see if we can make an >>>> OGM version of it. We never had the need before as we never had query >>>> support (the way SQL does it). >>> >>> I would also need to study the ORM code, but to add a high level >>> observation, >>> the methods currently defined by the GridDialect are focusing on >>> loading from well known key instances, >>> there is nothing to makes us able to scan/inspect for all values. >>> >>> In other words: even if we wanted to load keys first, we don't have >>> definitions >>> of functions from raw->primary key instances either. >> >> I understand that. I'm not denying the need for the method. >> >>> >>> >>>> On the visitor vs Iterator approach, I still don't see how implementing >>>> an Iterator on a map / reduce backend would be harder than the visitor >>>> but maybe I'm missing something. >>>> >>>> class IteratorAsStream { >>>> final Query someMapReduceQuery = ...; >>>> >>>> public Object next() { >>>> if (!someMapReduceQuery.started()) { >>>> // execute and collect results in parallel >>>> someMapReduceQuery.execute(); >>>> } >>>> Object result = someMapReduce.getNextOrBlock(); >>>> return result; >>>> } >>>> } >>> >>> That could work to *load* all entities in parallel, but I'd like to >>> process the entities in parallel as well. >>> And I'd rather not force the GridDialect implementors to write some >>> Hibernate Search specific code, >>> so to break out we need some form of "Execute X on each": a closure or a >>> lambda. >>> >> >> I can't see how the visitor model helps in your processing of entities in >> parallel. To me both approaches are strictly equivalent. Care to show some >> pseudo-code? _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev