Search trough versioned data

Mark Bakker Wed, 09 Dec 2015 06:11:35 -0800

Hello,


We have a need for search for data in a facetted and 'normal' way. For this we 
use Lucene at the moment. Our data is sharded in equal sized 'blocks' and our 
Lucene indexes follow this shards of data. Each block is ( between 256MB and 
4GB ).  So indexes are always for 256MB till 4GB of data. If the data grows it 
will be split (and the index as well).


Currently this setup works. Our Lucene indexes split if we get new blocks of 
data and fill again till the maximum size is reached again.


But now we have an extra requirement. We need to do the search in a versioned 
way (our data is versioned). For us a version is a change in the total dataset 
with a transaction precision of microseconds.


We see a few possibilities:

1. Use the Lucene commit points and keep the data forever with 
NoDeletionPolicy. With this we think we can not scala it to millions of 
different commit points. If I read the documentation correct each commit will 
give me now an extra file and that will not really scale.


2. Save extra versions of the documents on each update while using extra fields 
in the index and the facet index

from, till (long) for normal search and extra facets for the from and extra 
facets for the till 'timestamp' (date split in 5 facets with 300 - 1000 unique 
values each) to speed up the facetted search.


I hope there is some other possible solution which I don't know of.


Our requirements:

-Search through 4GB of data on documents with 5-100 fields and 3-15 facets each.

-Have a response time < 100ms.

-Be able to do 20 queries per second

-Be able to search trough each 'snapshot' where a snapshot is defined as a 
change in the total dataset. Snapshots have a time precision of 1 millionth of 
a second.


The questions I have:

1. What will be the most appropriate way to implement such a search 1,2 or an 
other solution?

2. In case 1, will this be a solution worth looking into?

3. In case 2 will Lucene be efficient with documents which look quite the same 
(different versions of the same document)?

4. In case 2 will this solve our requirements?


Kind regards,



Mark Bakker

Search trough versioned data

Reply via email to