Thanks for the explanation, Adrien. I do have a couple of follow-up questions. Isn't this block size used for file caching OS-dependent? And if 4K happens to be the most commonly used size, wouldn't it make more sense for the default stored fields format to have a chunk size equal to or smaller than that number? It's a bit of a guess on my part, but I did get better write and search performance with size <= 2K, as opposed to the default 16K.
On Fri, Apr 4, 2014 at 3:50 PM, Adrien Grand <jpou...@gmail.com> wrote: > Hi Vitaly, > > Doc values are indeed well-suited for grouping and sorting. However > stored fields remain better at returning field values to users since > they guarantee a worst-case of one disk seek per document. > > The filesystem cache typically caches data by blocks of 4KB. This > plays more nicely with doc values: given that they are stored in a > column-stride fashion, you are load only those field values into the > filesystem cache. On the other hand with stored fields, data is stored > sequentially in a very large file, so whenever you read a single field > value, the filesystem cache would load a 4KB block of data into the > filesystem cache that likely contains other fields' values that you > are not interested in. > > > > On Sat, Apr 5, 2014 at 12:23 AM, Vitaly Funstein <vfunst...@gmail.com> > wrote: > > I use stored fields to load values for the following use cases: > > - to return per-document values as is, requested by the user - similar to > > listing DB columns you are interested in, in a "select ..." clause. > > - to perform aggregate function calculations while forming the result set > > (if requested). > > - for group-by type queries (would like to switch to the native grouping > > API, but don't think it supports grouping on multiple fields, or > aggregate > > functions). > > - and finally, as I mentioned - to sort search results, also when > requested. > > > > Evidently, even for simple queries that don't require any of the > > post-processing above but ask for a set of values from each document, > > there's still non-trivial amount of disk activity... hence, I started > > second-guessing the implementation. > > > > > > On Fri, Apr 4, 2014 at 3:00 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > >> Hi, > >> > >> What are you doing with the stored fields? They are not deprecated and > >> also not really slow, unless you scan over millions of documents in > random > >> access order. To display serach results, DocValues are of no use. > >> > >> Uwe > >> > >> ----- > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > >> > -----Original Message----- > >> > From: Vitaly Funstein [mailto:vfunst...@gmail.com] > >> > Sent: Friday, April 04, 2014 9:44 PM > >> > To: java-user@lucene.apache.org > >> > Subject: Stored fields and OS file caching > >> > > >> > I have heard here that stored fields don't work well with OS file > >> caching. > >> > Could someone elaborate on why that is? I am using Lucene 4.6 and we > do > >> > use stored fields but not doc values; it appears most of the benefit > >> from the > >> > latter comes as improvement in sorting performance, and I don't > actually > >> use > >> > Lucene for sorting at all; rather, it's done on a post-processing > basis, > >> based on > >> > stored field values (in a nutshell, the reason for this is Lucene's > >> inability to tell > >> > apart terms that are empty strings vs. a missing value, resulting in > >> unstable > >> > sort order on such fields). > >> > > >> > I am not sure if switching to using doc values fields from stored > fields > >> entirely > >> > would help leverage OS file cache better... what worries me is that > when > >> > processing queries requesting multiple values from the document, doc > >> value > >> > fields could cause multiple disk seeks to fetch values for each > field, as > >> > opposed to just one with stored fields. > >> > > >> > Am I way off in my understanding of how this works? Any guidelines, as > >> > general as they may be, are appreciated. > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >