Thanks for your input Rob… On Thu, Apr 2, 2015 at 3:21 PM, Robert Muir <[email protected]> wrote:
> Vectors are totally per-document. Its hard to do anything smarter with > them. Basically by this i mean, IMO vectors aren't going to get better > until the semantics around them improves. From the original > fileformats, i get the impression they were modelled after stored > fields a lot, and I think thats why they will be as slow as stored > fields until things are fixed. > They are fundamentally per-document, yes, like stored fields — yes. But I don’t see how this fundamental constraint prevents the term vector format from returning a light “Fields” instance which loads per-field data on demand when asked for. I understand most of your ideas for a better term vector format below, to varying degrees, but again I don’t see these ideas as being blocking factors for having field term data be stored together so it could be accessed lazily. (don’t fetch fields you don’t need). Maybe you didn’t mean to imply they are? Although I think you did by saying “vectors aren't going to get better until the semantics around them improves”. p.s. my term-vector feature wish-list includes an FST based term dictionary to help make the Terms instance support more features like automaton intersection & easy O(Log(N)) lookup. ~ David * removing the embedded per-document schema of vectors. I can't > imagine a use case for this. I think in general you either have > vectors for docs in a given field X or you do not. > * removing the ability to store broken offsets (going backward, etc) > into vectors. > * removing the ability to store offsets without positions. Why? > > As far as the current impl, its fallen behind the stored fields, which > got a lot of improvements for 5.0. We at least gave it a little love, > it has a super-fast bulk merge when no deletions are present > (dirtyChunks, totalChunks, etc). But in all other cases its very > expensive. > > Compression block sizes, etc should be tuned. It should implement > getMergeInstance() and keep state to avoid shittons of decompressions > on merge. Maybe a high compression option should be looked at, though > getMergeInstance() should be a prerequisite for that or it will be too > slow. When the format matches between readers (typically the case, > except when upgrading from older versions etc), it should avoid > deserialization overhead if that is costly (still the case for stored > fields). > > Fixing some of the big problems (lots of metadata/complexity needed > for embedded schema info, negative numbers when there should not be) > with vectors would also enable better compression, maybe even > underneath LZ4, like stored fields got in 5.0 too. > > > On Thu, Apr 2, 2015 at 2:51 PM, [email protected] > <[email protected]> wrote: > > I was looking at a JIRA issue someone posted pertaining to optimizing > > highlighting for when there are term vectors ( SOLR-5855 ). I dug into > the > > details a bit and learned something unexpected: > > CompressingTermVectorsReader.get(docId) fully loads all term vectors for > the > > document. The client/user consuming code in question might just want the > > term vectors for a subset of all fields that have term vectors. Was this > > overlooked or are there benefits to the current approach? I can’t think > of > > any except that perhaps there’s better compression over all the data > versus > > in smaller per-field chunks; although I’d trade that any day over being > able > > to just get a subset of fields. I could imagine it being useful to ask > for > > some fields or all — in much the same way we handle stored field data. > > > > ~ David Smiley > > Freelance Apache Lucene/Solr Search Consultant/Developer > > http://www.linkedin.com/in/davidwsmiley > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
