Searching documents that contain a field (text of field is irrelevant)

2009-06-01 Thread mattspitz
Hey! Consider a bunch of documents that represent, say, students. These students have the following attributes: 1) Student IDs 2) Name 3) Self-description (optional) So, all documents have id: and name:, but only some of the documents have an added desc: Assuming all of the fields are indexed,

Spellchecker Evaluation Criteria

2008-10-14 Thread mattspitz
So, it appears to me that the criteria for a "good suggestion" is the n-gram overlap of a given term, not the edit distance. Thus, if we're looking for "britney", but we mess up and type "birtney", "kortney" will come up before "birtney." Is there a way to force the SpellChecker to use the edit

ThreadSafe SpellChecker?

2008-10-14 Thread mattspitz
I was wondering if the Lucene SpellChecker class was threadsafe, specifically, indexDictionary(). Such that: for (int i = 0; i < numReaders; i++) { //spawn new thread to run: spellchecker.indexDictionary(new LuceneDictionary(readers[i], myField)); } Would work. Thanks, Matt -- Vie

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
Mmmkay. I think I'll wait, then. Thank you so much for your help. I really appreciate it. Also, I really dig Lucene, so thanks for your hard work! -Matt Michael McCandless-2 wrote: > > > mattspitz wrote: > >> Is there no way to ensure consistency

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
#x27;m using an "unfinished" version of Lucene. Is there a rough date for 2.4's release? I poked around the website and couldn't find one. Thanks, Matt Michael McCandless-2 wrote: > > > mattspitz wrote: > >> Are the index files synced on writer.close()? > &g

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
merging? I don't really have a sense for what of the segments are kept in memory during a merge. It doesn't make sense to me that Lucene would pull all of the segments into memory to merge them, but I don't really know how. Thank you so much, Matt Michael McCandless-2 wrote: > &g

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
s what your maxBufferedSize > setting is. If it's too low you will see lots of IO. Increasing it means > less IO, but more JVM heap need. Is your disk IO caused by searches or > indexing only? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch &

Appropriate disk optimization for large index?

2008-08-16 Thread mattspitz
Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML documents. I'm storing ~150 million documents, taking up 150 GB of space. I index the HTML text, but I only store primary key information that allows me to retrieve it later. Thus, my document size is small, but obviously, I