how to reuse a tokenStream?

2010-05-27 Thread Li Li
I want to analyzer a text twice so that I can get some statistic information from this text TokenStream tokenStream=null; Analyzer wa=new WhitespaceAnalyzer(); try { tokenStream = wa.reusableTokenStream(fieldNam

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread kannan chandrasekaran
I am just trying out a few experiments to calculate similarity between terms based on their co-occurences in the dataset... Basically I am trying to build contextual vectors and calculate similarity using a similarity measure ( say cosine similarity). I dont think this is an XY problem .

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread Erick Erickson
OK, let's back up a level. WHY are you building these vectors? Where I'm going with this is I wonder if this is an XY problem, see: http://people.apache.org/~hossman/#xyproblem Best Erick On Thu, May 27, 2010 at 7:49 PM, kannan chandrasekaran wrote: > Uwe, > > I now see the problem with overlapp

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread kannan chandrasekaran
Uwe, I now see the problem with overlapping terms across segments...Thanks... Erik, Good point...My usecase for this is , I am trying to build vectors for individual terms and documents and I need to know the size to handle memory constraints Thanks Kannan

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread Erick Erickson
I suspect it's not supported because it hasn't been seen as valuable enough to put the effort into. You simply asked if it was supported without any use-case, and I'm having a hard time coming up with one on my own. If it's important to your particular situation, you could have a special docum

Re: Customer TokenFilter

2010-05-27 Thread tsuraan
> Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the > new Term Attribute will implement CharSequence, then its even simplier. You > may also look at 3.1's ICU contrib that has support even for Normalizer2. Ok, I've only been looking at 3.0.1 so far; I'll check out the 3.

RE: How to get the number of unique terms in the inverted index

2010-05-27 Thread Uwe Schindler
It's not efficient, because you cannot get it efficient as of overlapping terms (as noted before). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: kannan chandrasekaran [mailto:ckanna...@yahoo.com] > Se

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread kannan chandrasekaran
Hi Yonik, Thanks for the quick response. I am curious as to why this is not supported whereas the numdocs() is supported ? Even in the upcoming version its only supported per segment and not across the index, why ? Is it difficult to implement efficiently ? Pardon my ignorance if I am missing

RE: Customer TokenFilter

2010-05-27 Thread Uwe Schindler
Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the new Term Attribute will implement CharSequence, then its even simplier. You may also look at 3.1's ICU contrib that has support even for Normalizer2. Overriding StandardAnalyzer is the wrong way, as in 3.1 its final (its

Re: Customer TokenFilter

2010-05-27 Thread tsuraan
> I'd like to have all my queries and terms run through Unicode > Normalization prior to being executed/indexed.  I've been using the > StandardAnalyzer with pretty good luck for the past few years, so I > think I'd like to write an analyzer that wraps that, and tacks a > custom TokenFilter onto th

RE: How to get the number of unique terms in the inverted index

2010-05-27 Thread Uwe Schindler
Also in 2.9.2 and 3.0.1: http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/index/IndexRea der.html#getUniqueTermCount() http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/IndexRea der.html#getUniqueTermCount() Please note, this works only with SegmentReaders, so you ha

Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread Yonik Seeley
On Thu, May 27, 2010 at 2:32 PM, kannan chandrasekaran wrote: > I was wondering  if there is a way to retrieve the number of unique terms in > the lucene ( version 2.4.0) ... I am aware of the terms() && terms(Term) > method that returns an enumeration (TermEnum) but that involves iterating > t

Re: Core dumped

2010-05-27 Thread Saurabh Agarwal
I will check it out!! Saurabh Agarwal On Thu, May 27, 2010 at 11:13 PM, Erick Erickson wrote: > The larger your RAMbufferSize, the more memory you consume FWIW. > > OK, then, does it always OOM on the same document? Are you trying to index > any particularly large documents? > > Erick > > On Thu

Re: Core dumped

2010-05-27 Thread Erick Erickson
The larger your RAMbufferSize, the more memory you consume FWIW. OK, then, does it always OOM on the same document? Are you trying to index any particularly large documents? Erick On Thu, May 27, 2010 at 1:28 PM, Saurabh Agarwal wrote: > RAMBufferSize id 50 Mb, i tried with 200 too > the index

Re: Core dumped

2010-05-27 Thread Saurabh Agarwal
RAMBufferSize id 50 Mb, i tried with 200 too the index is unoptimized MergeFactor is Default 10 and I have not changed it MaxBuffered Docs is also default Saurabh Agarwal On Thu, May 27, 2010 at 10:31 PM, Erick Erickson wrote: > What have you set various indexwriter properties to? Particularly

Re: Core dumped

2010-05-27 Thread Erick Erickson
What have you set various indexwriter properties to? Particularly things like merge factor, max buffered docs and ram buffer size. The first thing I'd look at is MergeFactor. From the JavaDocs: Determines how often segment indices are merged by addDocument(). With smaller values, less RAM is used

Re: Test File locks

2010-05-27 Thread Spencer Tickner
Hi Ian and Chris, Thanks for the responses. I seem to have jarred the lock loose. Not exactly sure what step did it as I tried tweeking the OS, using Ian's suggestion of the SimpleFSLockFactory and tried various UNC paths to specify the directory. I've tried to work backwards to re-create the pro

Core dumped

2010-05-27 Thread Saurabh Agarwal
Hi, when I am running Lucene on a 512 MB system. I am getting the following error Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.DocumentsWriter$ByteBlockAllocator.getByteBlock(DocumentsWriter.java:1206) and sometimes An unexpected error h

Re: IndexSearcher - open file handles by deleted files

2010-05-27 Thread Michael McCandless
Just closing IndexSearcher should be enough. Are you really sure you're closing all IndexSearchers you've opened? Hmm the code looks somewhat dangerous. Why sleep for 10 seconds before closing? Is this to ensure any in-flight queries finish? It's better to explicitly track this (eg w/ IndexRea

Re: IndexReader.getSequentialSubReaders() usage in Lucene 2.9+

2010-05-27 Thread Simon Willnauer
Hey Nikolay, On Thu, May 27, 2010 at 11:00 AM, Nikolay Zamosenchuk wrote: > Hi, Dear colleagues! > I have one question concerning IndexReader.getSequentialSubReaders() > and it's usage. getSequentialSubReaders() was introduced to support Per-Segment Search in Lucene 2.9. It is used to access the

IndexReader.getSequentialSubReaders() usage in Lucene 2.9+

2010-05-27 Thread Nikolay Zamosenchuk
Hi, Dear colleagues! I have one question concerning IndexReader.getSequentialSubReaders() and it's usage. Imagine there is a class extending DirectoryReader or MultiReader. Usually directory- or multi-reader consists of sub-readers (i.e. segment-readers). Is it safe enough to return always null in

Re: About loading lazily

2010-05-27 Thread Shaun Senecal
Haha, sorry, ignore my response completely. I used the same term for something completely different, and not related to Lucene at all :) On Mon, May 24, 2010 at 8:06 PM, Grant Ingersoll wrote: > I'd also add that the Document keeps a pointer to the spot in storage where > that value can be loa