Disclaimer: I realize you wouldn't want to do this for anything other than a 
toy collection.

Perk: however, this overall discussion might also be useful people wanting to 
use other codes by default, for example the faster BlockPostingsFormat.

Old info online: Instructions for using enabling SimpleText were written back 
in its early days.  But in more recent versions of Solr these instructions are 
largely obsolete, you DON'T need to do most of that.  You can just add 
postingsFormat="SimpleText" to a <fieldType> tag and get the new behavior.  I 
believe it's similar for using the BlockPostingsFormat.

But when you do this (add it to text_general for example), although your text 
fields reside in the new format, the other files in the index directory are 
still binary. By the time your debugging gets to your text field values, some 
"magic" has already happened via the other files (the system already knows 
about offsets into the file, for example)

Question: Can SimpleText even be used for the other binary files in an index?  
Or is it somehow specific in scope to field tokens?

Question: If it can be used for all the other files, what's the setting for 
that?  I had seen a switch -Dtests.codec=SimpleText in the old instructions, 
but clearly that's for unit tests, and wasn't sure of it's scope or 
applicability.

Question: Has anybody tried using BlockPostingsFormat as a default codec?  (for 
all files)  Did it work?  Was it faster that just applying to your text fields?

Other questions...

Or maybe there's some other aspect to all of this that I'm missing, some other 
question I should really be asking?  The old posts online seem to assume fairly 
deep understanding of Lucene & Solr's overall codec framework, which was 
appropriate at that time.  But now it's included by default, so it's sort of 
"mainstream", and although I generally understand codes, there's still aspects 
of it in Solr that I'm a bit hazy one; wondering if others have the same 
feeling?

Examples of things I'm a bit hazy on:

Are there rules about which codes can be used where?

Can you mix and match codes?  Can you chain them?

I also saw the FilterCodec javadoc.  Would I only use that if I want to reuse 
most of an existing code, but alter just one part of it?  I'm a bit fuzzy 
combining that with other codes.  If there's a java command line -D switch that 
tells the system to use a different (but already existing) code, then I don't 
think I'd need this at all?

--
Mark Bennett / LucidWorks: Search & Big Data / [email protected]
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513







Reply via email to