Re: Stored fields and OS file caching

2014-04-04 Thread Vitaly Funstein
Thanks for the explanation, Adrien. I do have a couple of follow-up questions. Isn't this block size used for file caching OS-dependent? And if 4K happens to be the most commonly used size, wouldn't it make more sense for the default stored fields format to have a chunk size equal to or smaller tha

Re: Stored fields and OS file caching

2014-04-04 Thread Adrien Grand
Hi Vitaly, Doc values are indeed well-suited for grouping and sorting. However stored fields remain better at returning field values to users since they guarantee a worst-case of one disk seek per document. The filesystem cache typically caches data by blocks of 4KB. This plays more nicely with d

Re: Stored fields and OS file caching

2014-04-04 Thread Vitaly Funstein
I use stored fields to load values for the following use cases: - to return per-document values as is, requested by the user - similar to listing DB columns you are interested in, in a "select ..." clause. - to perform aggregate function calculations while forming the result set (if requested). - f

RE: Stored fields and OS file caching

2014-04-04 Thread Uwe Schindler
Hi, What are you doing with the stored fields? They are not deprecated and also not really slow, unless you scan over millions of documents in random access order. To display serach results, DocValues are of no use. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetap

RE: Avoid memory issues when indexing terms with multiplicity

2014-04-04 Thread Uwe Schindler
Hi, > The use-case is that some of the fields in the document are made up of > term:frequency pairs. What I am doing right now is to expand these with a > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat", > and > index that. However, the problem is that when these field

Re: Avoid memory issues when indexing terms with multiplicity

2014-04-04 Thread Gregory Dearing
Hi David, I'm not an expert, but I've climbed through the consumers myself in the past. The big limit is that the full postings for a document or document block must fit into memory. There may be other hidden processing limits (ie. memory used per-field). I think it would be possible to create

Stored fields and OS file caching

2014-04-04 Thread Vitaly Funstein
I have heard here that stored fields don't work well with OS file caching. Could someone elaborate on why that is? I am using Lucene 4.6 and we do use stored fields but not doc values; it appears most of the benefit from the latter comes as improvement in sorting performance, and I don't actually u

Avoid memory issues when indexing terms with multiplicity

2014-04-04 Thread Dávid Nemeskey
Hi guys, I have just recently (re-)joined the list. I have an issue with indexing; I hope someone can help me with it. The use-case is that some of the fields in the document are made up of term:frequency pairs. What I am doing right now is to expand these with a TokenFilter, so that for e.g. "do