Hi, I'm doing research on indexing some documents that are generated from templates. I don't have the exact statistics yet, but I'm estimating that in the standard case 90% of the document is the same across all instances of the document and the other 10% is dynamic (although it is certainly possible for it to be 10/90). Because these documents can be rather large in size and there could potentially be millions of instances of a single document, I don't want to put every instance of the document into the index.
What (I think) I would like to do is index the template and the dynamic content separately, and then merge the search results afterwards. This does not seem too difficult; except for when a query spans both the template and dynamic content. Also, things like proximity queries would be difficult. In theory, it would seem plausible to split the terms of the original search, search both indexes in parallel, rebuild the document, build an in memory index of merged document, and run the original search against it. Most of the searches are going to be "online" meaning that a Hadoop job probably isn't appropriate. I am working on compiler better statistics on the standard deviation for static to dynamic content, but in the meantime I was just curious if anyone else has dealt with a similar scenario or had something to point me at for additional research. Thanks.