> I had written the "why" on HSEARCH-2616, but to clarify here: [...]
Thanks. So the problem is that we may not be able to update the batch state upon failure, in which case we would use the less-safe AddLuceneWork upon restart. If we had some way to store the information "this partition has started" *before* we even write to the index, this wouldn't be a problem, but as you might have guessed JSR-352 doesn't allow that. So you're right, deleting everything before we even start working is our best solution. And thus a hidden field will be necessary. I'll continue the discussion on JIRA. Yoann Rodière Hibernate NoORM Team yo...@hibernate.org On 27 April 2017 at 18:19, Sanne Grinovero <sa...@hibernate.org> wrote: > On 27 April 2017 at 15:11, Yoann Rodiere <yo...@hibernate.org> wrote: > > I wonder, what's the benefit for HSEARCH-2616? Do you want to have that > > field so that we can just use AddLuceneWorks everywhere, and run targeted > > delete operations when we start a partition? If so, is it as a fallback > > solution, if what I proposed cannot be implemented, or as a better > > alternative? Note I don't have strong arguments against that solution, > I'm > > just trying to understand the "why". > > I had written the "why" on HSEARCH-2616, but to clarify here: > > I liked your idea of trying to figure out if the current block of work > is being repeated, vs it being a re-try. However while I initially > thought to add such a field as a fallback solution, I believe it's > ultimately the more robust solution as otherwise you have to trust > such state, which could be lost / wrong / corrupted independently for > a number of reasons. > Since the problem being solved is about resuming the process after a > problem happened we can't make many safe assumptions about what kind > of problem we're dealing with; for example if you run out of disk > space you'll have an half-written index but no way to store such > batch-state. Other problems might involve indexes being backed up / > restored / replicated over other technologies (rsync, Infinispan, ..) > so a mismatch between the index and other state is yet another problem > which might need caution, logs and possibly tooling. > Say an IO operation fails during an index write flush: some admin > intervenes fixing hardware and then triggers resume of indexing. > In such conditions I wouldn't trust some additional persistent state > not even if it were cryptographically signed to be correct: corruption > or signature mismatches could be detected but in this case there's the > risk of it being trustful but out of date: with IO unavailable when > this should have been written you're probably reading the previous > version which had been written. Having an out of date batch state > would likely have the opposite effect of what we need. > > On the other hand, inspecting what's in the index is coupled with the > index state so while indexes could be corrupted, the progress tracking > state and the index being one thing you're not easily fooled. > > Since I agree that having additional fields is not something everyone > will like, as I suggested on HSEARCH-2616 we could offer the > alternatives as fallback. > > > > > On adding a hidden field, I wonder what this will mean for > Elasticsearch; if > > we start doing such things, we should clearly and explicitly state in the > > documentation that targeting existing ES schemas without adapting them to > > Hibernate Search is not supported. > > On top of that, it may hurt users upgrading Hibernate Search: Lucene may > > simply ignore queries against a field that doesn't exist in the index, > but > > I'm not sure Elasticsearch behaves that way when the field isn't even > > defined in the mapping. So users may have to upgrade their schema just > for > > that. I know Elasticsearch integration is experimental anyway, but what I > > mean is if we do that, it must be *before* Elasticsearch we drop the > > "experimental" mention on Elasticsearch integration. > > Good point. Such proposals to change some internal field don't happen > very often though. > > We strive to have a stable encoding, but since the index is not the > database well documented changes might be worth it. > Especially "private internal" fields should not be too hard to manage > as we can deal with them explicitly in some lenient way, and if they > don't contain end user state like in this case we don't even have to > require an index rebuild. > > For people not wanting this they can have a slower mass indexer, or > not support recovery. > > Thanks, > Sanne > > > > > > > > Yoann Rodière > > Hibernate NoORM Team > > yo...@hibernate.org > > > > On 27 April 2017 at 15:59, Yoann Rodiere <yrodi...@redhat.com> wrote: > >> > >> I wonder, what's the benefit for HSEARCH-2616? Do you want to have that > >> field so that we can just use AddLuceneWorks everywhere, and run > targeted > >> delete operations when we start a partition? If so, is it as a fallback > >> solution, if what I proposed cannot be implemented, or as a better > >> alternative? Note I don't have strong arguments against that solution, > I'm > >> just trying to understand the "why". > >> > >> On adding a hidden field, I wonder what this will mean for > Elasticsearch; > >> if we start doing such things, we should clearly and explicitly state > in the > >> documentation that targeting existing ES schemas without adapting them > to > >> Hibernate Search is not supported. > >> On top of that, it may hurt users upgrading Hibernate Search: Lucene may > >> simply ignore queries against a field that doesn't exist in the index, > but > >> I'm not sure Elasticsearch behaves that way when the field isn't even > >> defined in the mapping. So users may have to upgrade their schema just > for > >> that. I know Elasticsearch integration is experimental anyway, but what > I > >> mean is if we do that, it must be *before* Elasticsearch we drop the > >> "experimental" mention on Elasticsearch integration. > >> > >> > >> Yoann Rodière > >> Software Engineer, Hibernate NoORM Team > >> Red Hat > >> yrodi...@redhat.com > >> > >> On 27 April 2017 at 15:23, Sanne Grinovero <sa...@hibernate.org> wrote: > >>> > >>> To better implement recovery operations during MassIndexer > >>> [HSEARCH-2616] - specifically in the context of the upcoming JBatch > >>> based implementation - I'm considering the benefits of adding one more > >>> field the the Lucene index for our internal purposes. > >>> > >>> This new field is only useful for Hibernate Search internals so we > >>> shouldn't allow it to be targeted by queries, etc.. > >>> > >>> There is a single precedent: we already encode the entity name, so > >>> "hiding fields" is not a new problem that we have to deal with. It > >>> might be a reason to polish the existing concept and improve the > >>> encapsulation. > >>> > >>> Would anyone have a strong case against this? > >>> > >>> Thanks, > >>> Sanne > >>> _______________________________________________ > >>> hibernate-dev mailing list > >>> hibernate-dev@lists.jboss.org > >>> https://lists.jboss.org/mailman/listinfo/hibernate-dev > >> > >> > > > _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev