I want to analyzer a text twice so that I can get some statistic
information from this text
TokenStream tokenStream=null;
Analyzer wa=new WhitespaceAnalyzer();
try {
tokenStream = wa.reusableTokenStream(fieldNam
I am just trying out a few experiments to calculate similarity between terms
based on their co-occurences in the dataset... Basically I am trying to build
contextual vectors and calculate similarity using a similarity measure ( say
cosine similarity).
I dont think this is an XY problem .
OK, let's back up a level. WHY are you building these
vectors? Where I'm going with this is I wonder if this
is an XY problem, see:
http://people.apache.org/~hossman/#xyproblem
Best
Erick
On Thu, May 27, 2010 at 7:49 PM, kannan chandrasekaran
wrote:
> Uwe,
>
> I now see the problem with overlapp
Uwe,
I now see the problem with overlapping terms across segments...Thanks...
Erik,
Good point...My usecase for this is ,
I am trying to build vectors for individual terms and documents and I need to
know the size to handle memory constraints
Thanks
Kannan
I suspect it's not supported because it hasn't been seen
as valuable enough to put the effort into. You simply asked
if it was supported without any use-case, and I'm having a
hard time coming up with one on my own.
If it's important to your particular situation, you could
have a special docum
> Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the
> new Term Attribute will implement CharSequence, then its even simplier. You
> may also look at 3.1's ICU contrib that has support even for Normalizer2.
Ok, I've only been looking at 3.0.1 so far; I'll check out the 3.
It's not efficient, because you cannot get it efficient as of overlapping terms
(as noted before).
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: kannan chandrasekaran [mailto:ckanna...@yahoo.com]
> Se
Hi Yonik,
Thanks for the quick response. I am curious as to why this is not supported
whereas the numdocs() is supported ? Even in the upcoming version its only
supported per segment and not across the index, why ? Is it difficult to
implement efficiently ?
Pardon my ignorance if I am missing
Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the
new Term Attribute will implement CharSequence, then its even simplier. You
may also look at 3.1's ICU contrib that has support even for Normalizer2.
Overriding StandardAnalyzer is the wrong way, as in 3.1 its final (its
> I'd like to have all my queries and terms run through Unicode
> Normalization prior to being executed/indexed. I've been using the
> StandardAnalyzer with pretty good luck for the past few years, so I
> think I'd like to write an analyzer that wraps that, and tacks a
> custom TokenFilter onto th
Also in 2.9.2 and 3.0.1:
http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/index/IndexRea
der.html#getUniqueTermCount()
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/IndexRea
der.html#getUniqueTermCount()
Please note, this works only with SegmentReaders, so you ha
On Thu, May 27, 2010 at 2:32 PM, kannan chandrasekaran
wrote:
> I was wondering if there is a way to retrieve the number of unique terms in
> the lucene ( version 2.4.0) ... I am aware of the terms() && terms(Term)
> method that returns an enumeration (TermEnum) but that involves iterating
> t
I will check it out!!
Saurabh Agarwal
On Thu, May 27, 2010 at 11:13 PM, Erick Erickson wrote:
> The larger your RAMbufferSize, the more memory you consume FWIW.
>
> OK, then, does it always OOM on the same document? Are you trying to index
> any particularly large documents?
>
> Erick
>
> On Thu
The larger your RAMbufferSize, the more memory you consume FWIW.
OK, then, does it always OOM on the same document? Are you trying to index
any particularly large documents?
Erick
On Thu, May 27, 2010 at 1:28 PM, Saurabh Agarwal wrote:
> RAMBufferSize id 50 Mb, i tried with 200 too
> the index
RAMBufferSize id 50 Mb, i tried with 200 too
the index is unoptimized
MergeFactor is Default 10 and I have not changed it
MaxBuffered Docs is also default
Saurabh Agarwal
On Thu, May 27, 2010 at 10:31 PM, Erick Erickson wrote:
> What have you set various indexwriter properties to? Particularly
What have you set various indexwriter properties to? Particularly
things like merge factor, max buffered docs and ram buffer size.
The first thing I'd look at is MergeFactor. From the JavaDocs:
Determines how often segment indices are merged by addDocument(). With
smaller values, less RAM is used
Hi Ian and Chris,
Thanks for the responses.
I seem to have jarred the lock loose. Not exactly sure what step did
it as I tried tweeking the OS, using Ian's suggestion of the
SimpleFSLockFactory and tried various UNC paths to specify the
directory. I've tried to work backwards to re-create the pro
Hi,
when I am running Lucene on a 512 MB system.
I am getting the following error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.index.DocumentsWriter$ByteBlockAllocator.getByteBlock(DocumentsWriter.java:1206)
and sometimes
An unexpected error h
Just closing IndexSearcher should be enough.
Are you really sure you're closing all IndexSearchers you've opened?
Hmm the code looks somewhat dangerous. Why sleep for 10 seconds
before closing? Is this to ensure any in-flight queries finish? It's
better to explicitly track this (eg w/ IndexRea
Hey Nikolay,
On Thu, May 27, 2010 at 11:00 AM, Nikolay Zamosenchuk
wrote:
> Hi, Dear colleagues!
> I have one question concerning IndexReader.getSequentialSubReaders()
> and it's usage.
getSequentialSubReaders() was introduced to support Per-Segment Search
in Lucene 2.9. It is used to access the
Hi, Dear colleagues!
I have one question concerning IndexReader.getSequentialSubReaders()
and it's usage.
Imagine there is a class extending DirectoryReader or MultiReader.
Usually directory- or multi-reader consists of sub-readers (i.e.
segment-readers). Is it safe enough to return always null in
Haha, sorry, ignore my response completely. I used the same term for
something completely different, and not related to Lucene at all :)
On Mon, May 24, 2010 at 8:06 PM, Grant Ingersoll wrote:
> I'd also add that the Document keeps a pointer to the spot in storage where
> that value can be loa
22 matches
Mail list logo