Re: ways to minimize index size?

Steve Liles Wed, 20 Jun 2007 19:00:07 -0700

Compression aside you could index the "contents" as terms in separatefields instead of tokenized text, and disable storing of norms:


String outgoingNumber="9198408365809";
String incomingNumber="9840861114";

_doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO,Index.NO_NORMS));_doc.add(new Field("incomingNumber", incomingNumber, Store.NO,Index.NO_NORMS));

According to the docs "Index.NO_NORMS" will save you one byte perdocument in the index.

Or you could index all of the data as separate terms in the same"contents" field if you wanted (make the first param "contents" for allof the terms), which is more comparable to what you are currently doing.(Another advantage is that the Analyzer will not be used for fieldswhich are untokenized, and indexing should be faster.)

...

One way to compress numerical data (possibly not the best - i'm noexpert) is to change the base of the number that is indexed / stored inthe index.

java.lang.Long and java.math.BigInteger have methods for converting fromone radix to another. Taking your "outgoingNumber" as an example:


//compression
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
System.out.println(_bi.toString(36));

> 39douufap

//decompression
BigInteger _bi = new java.math.BigInteger("39douufap", 36);
System.out.println(_bi.toString(10));

>9198408365809

Converting to a higher radix will give you better compression but you'll
have to do it yourself as the jdk classes only work up to base 36
<http://en.wikipedia.org/wiki/Base_36>.

It's worth compressing your unstored "contents" field as well as yourstored "records" field, as the unique terms in the "contents" field willeffectively be stored.

Also don't forget to convert the terms when you search too, otherwiseyou won't find anything ;)


Steve.


Sebastin wrote:

When i use the standardAnalyzer storage size increases.how can i minimize
index store

Sebastin wrote:

String outgoingNumber="9198408365809";

String incomingNumber="9840861114";
String datesc="070601";
String imsiNumber="444021365987";
String callType="1";

//Search Fields
 String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
"+imsiNumber+" "+callType );

//Display Fields

String records=(callingPartyNumber+"

"+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
"+outgoingRoute+" "+timeSc);

IndexWriter indexWriter = newIndexWriter(indexDir,new StandardAnalyzer(),true);Document document = new Document();document.add(new

Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));

document.add(new

Field("records",records,Field.Store.YES,Field.Index.NO));

indexWriter.setUseCompoundFile(true);

                             indexWriter.addDocument(document);
                          }

please help me to acheive the minimum size





Erick Erickson wrote:

Show us the code you use to index. Are you storing the fields?
omitting norms? Throwing out stop words?

Best
Erick

On 6/19/07, Sebastin <[EMAIL PROTECTED]> wrote:

Hi Does anyone give me an idea to reduce the Index size to down.now i am
getting 42% compression in my index store.i want to reduce upto 70%.i
use
standardanalyzer to write the document.when i use SimpleAnalyzer it
reduce
upto 58% but i couldnt search the document.please help me to acheive.

Thanks in advance

Jeff-188 wrote:

I found that reducing my index from 8G to 4G (through not stemming)

gave
me

about a 10% performance improvement.

How did you do this? I don't see this as an option.

Jeff

--
View this message in context:
http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ways to minimize index size?

Reply via email to