> Anyway, nice brainstorming but I'm not even sure how feasible it would be to do pre-processing without the IndexWriter :)
Where/when is that pre-processing happening today? IMO we must start and consider non-Lucene backends in all our plans. 2015-07-02 18:24 GMT+02:00 Sanne Grinovero <sa...@hibernate.org>: > On 2 July 2015 at 12:50, Hardy Ferentschik <ha...@hibernate.org> wrote: > > Hi, > > > > On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote: > >> Ideally we should provide something similar to the Dynamic Analyzer > >> feature but which also multiplexes an entity property into multiple > >> fieldnames; > >> for example > >> property "title" > >> -> title_en & analyzer en > >> -> title_de & analyzer de > >> > >> The selection would work based on the Discriminator field, much like > >> the current Dynamic Analyzer. > > > > That might be a possibility, even though I am not quite sure how exactly > this would > > look like. I would first need to dig in more into the existing code. > > Do you have a more concrete idea on how this would look like? > > I did not sketch an implementation. > > >> Still, even if we were to find the bandwidth to make that, we'd need a > >> deprecation path for the existing feature > > > > Well, in the above case we are not talking about deprecation right? It > would be more of > > a change in behavior and use!? > > Right, the above example would be quite different, for example queries > would need to target the right field - and using the right analyzer - > and that would need explicit user input. > So to provide a deprecation path, we'd need a version which supports > both approaches so that people can move from one to the other.. which > implies keeping the existing model around for a little longer, which > is problematic. > In other words, discussing a better solution is good but doesn't avoid > the need to keep the existing functionality around. > > > > >> > What's about the alternative to close the IndexWriter and re-open it? > Obviously this could be > >> > optimised, but storing the field to analyzer map together with the > open IndexWriter and only > >> > re-open if the mapping changes. As long as the mapping is the same > the same IndexWriter can be used. > >> > This way we could keep the feature with a potential performance hit > for the people who are using it. > >> > Still better than removing it, right? That said, what are the exact > performance impacts? Did you run > >> > a test? > >> > >> The performace impact is huge as it would prevent you from using both > >> NRT and the new backend strategy to pack multiple blocks in commit > >> cycles; > >> that means the impact is in the 3 to 4 orders of magnitude in > throughput. > > > > Might be still worth testing and prototyping. > > > >> I could apply your suggestion in practice if we go for setting a flag > >> in the backend to change strategy, depending if any entity is using > >> the Discriminator feature, > > > > That would work for me. > > > >> but beyond that we also have the problem of > >> different entities sharing the same index but potentially using a > >> different analyzer for the same named field... I'd agree with the > >> Lucene developers that people should really not do it, but we support > >> that today. > > > > Ok. In this case I am more inclined to enforce the same analyzer. > > Right, especially as we can detect the inconsistency at boot time and > raise an appropriate warning. > In this case I'd not expect a nice deprecation path as the existing > usage (if any user did this) would have been problematic already. > > > > >> > How would that look like and did we not once discuss exactly the > opposite (aka letting even > >> > the Document be built on the master)? > >> > >> We discussed to not create a Document instance on the slave, to only > >> serialize a custom serializable-friendly container, but that doesn't > >> prevent you to pre-tokenize the text on the slaves. > >> AFAIR we discussed to create a "master node" which doesn't need the > >> user classes so that would be an easy to start service w/o need for > >> much more than some configuration properties.. if you don't > >> pre-tokenize this configuration would still need the classes to read > >> our analyzer definitions from annotations. > > > > Ok, that is possible as well. I think we discussed both and I was indeed > > referring to the approach where the master node would do the index using > > user classes and the corresponding Search metadata. > > > > I like the solution you are referring to much better, since it also works > > better with the ideas I have regarding the clustering of the index (eg > with > > RAFT). As you suggest, it would be beneficial if only the slaves would > need > > to know about the user classes. > > +1 > > > > >> >> master/slave clustering approach, that would have several other > >> >> benefits: > >> >> - move the analyzer work to the slaves > >> > > >> > Why is that a benefit? > >> > >> - removes the need to have the analyzer definitions on the master (see > above). > > > > Ok, in the light of the above discussed solution, you would not need the > analyzers > > on the master node. Not sure whether this such an important thing so. > > Above you said you like it much better to not need the user classes on > the master. > We build the analyzers from the annotations on the user classes - not > least we allow the user to provide custom analyzer implementations. > So avoiding the need to have the analyzers on the master node is a > pre-requisite to get rid of the user classes. > > > > >> - spreads out the CPU and memory allocations cost to each slave node: > >> better scalability than have it all done on the master > > > > Well, one could also take the point of view that the slaves should do as > little as possible > > and let the master do the heavy lifting. It depends really for what you > are optimizing imo. > > Good point, one might prefer the opposite. But by decoupling the chain: > entity -> [tokenizing && indexwriting] > into > entity -> tokenizing -> indexwriting > Then you can easily provide an option to let the user make this choice > about were you want the tokenizing to happen. > > I'd wager though that most will want to favour scalability, so I'd > implement that first. > > >> >> - reduce the network payloads > >> > > >> > Really, is it actually not increasing payloads? > >> > >> I would expect so: a pre-filtered token sequence is usually smaller > >> than the source text, often by a good margin. > > > > True, in the usual cases that is probably the case. > > I can only think of the opposite to happen in context of information > enrichment, such as Apache UIMA or Stanbol, but in these cases the > high level of computation would even more want you to choose for > scalability, i.e. pre-process each case on the slave rather than > killing your master nodes. > > Anyway, nice brainstorming but I'm not even sure how feasible it would > be to do pre-processing without the IndexWriter :) > > Sanne > _______________________________________________ > hibernate-dev mailing list > hibernate-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/hibernate-dev > _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev