applying cosine similarity directly

2009-09-11 Thread Alexy Khrabrov
Given that I have a field for which term vector was computed and stored, and that field is the text of a document, I'd like to rank a subset of such documents by similarity to a given held-out document, or query, directly using the cosine measure. How can that be done without going through creatin

Re: Stopping a runaway search, any ideas?

2009-09-11 Thread Chris Hostetter
: Subject: Stopping a runaway search, any ideas? : References: <5b20def02611534db08854076ce825d8032db...@sc1exc2.corp.emainc.com> : <5b20def02611534db08854076ce825d803626...@sc1exc2.corp.emainc.com> : <24098ed350c76d46a4fdbd81b51be0e903f040b...@exchange.windows.mmu.acquireme :

Enumerating NumericField using TermEnum?

2009-09-11 Thread Phil Whelan
Hi, I've used NumericField to store my "hour" field. Example... doc.add(new NumericField("hour").setIntValue(Integer.parseInt("12"))); Before I was using plain string Field and enumerating them with TermEnum, which worked fine. Now I'm using NumericField's I'm not sure how to port this enu

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Michael McCandless
On Fri, Sep 11, 2009 at 1:15 PM, wrote: > I've been testing out "paging" the document this past week. I'm > still working on getting a successful test and think I'm close. The > down side was a drastic slow down in indexing speed, and lots of > open files, but that was expected. You mean a sl

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Paul_Murdoch
Thanks Mike! I've been testing out "paging" the document this past week. I'm still working on getting a successful test and think I'm close. The down side was a drastic slow down in indexing speed, and lots of open files, but that was expected. I tried with small mergeFactors, maxBufferedDoc

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Michael McCandless
To minimize Lucene's RAM usage during indexing, you should flush after every document, eg by setting the ramBufferSizeMB to something tiny (or maxBufferedDocs to 1). But, unfortunately, Lucene cannot flush partway through indexing one document. Ie, the full document must be indexed into RAM befor

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Brian Pinkerton
Quite possibly, but shouldn't one expect Lucene's resource to track the size of the problem in question? Paul's two examples below use input files of 5 and 62MB, hardly the size of input I'd expect to handle in a memory-compromised environment. bri On Sep 11, 2009, at 7:43 AM, Glen New

How to delete documents from an index and how to reset de remote multisearcher so the deleted docs not being shown in the search results ???

2009-09-11 Thread Ariel
Hi every body: I am using lucene version 2.3.2 to index and search my documents. The problem is that I have a remote search server implemented this way: [code] Searcher parallelSearcher; try { parallelSearcher = new ParallelMultiSearcher(search

Re: Stopping a runaway search, any ideas?

2009-09-11 Thread Daniel Shane
Wow thats exactly what I was looking for! In the mean time I'll use the time based collector. Thanks Uwe and Mark for your help! Daniel Shane mark harwood wrote: Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout testing at all index access stages prior to calls t

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Paul_Murdoch
Glen, Absolutely. I think a RMFC Lucene would great, especially for reduced memory or low bandwidth client/server scenarios. I just looked at your LuSql tool and it just what I needed about 9 months ago :-). I wrote a simple re-indexer that interfaces to an SQL Server 2005 database and Lucen

Re: Stopping a runaway search, any ideas?

2009-09-11 Thread mark harwood
Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout testing at all index access stages prior to calls to Collector e.g. will catch a runaway fuzzy query during it's expensive term expansion phase. - Original Message From: Uwe Schindler To: java-user@lucene

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
Paul, I saw your last post and now understand the issues you face. I don't think there has been any effort to produce a reduced-memory-footprint configurable (RMFC) Lucene. With the many mobile devices, embedded and other reduced memory devices, should this perhaps be one of the areas the Lucene

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Paul_Murdoch
Thanks Glen! I will take at your project. Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine. I agree with you on the C\C++ comment. That is what I would normally use for memory intense software. I

RE: Stopping a runaway search, any ideas?

2009-09-11 Thread Uwe Schindler
Yes: TimeLimitedCollector in 2.4.1 (and the new non-deprecated ones in 2.9). Just wrap your own collector (like TopDocsCollector) with this class. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Daniel Sh

Stopping a runaway search, any ideas?

2009-09-11 Thread Daniel Shane
I don't think its possible, but is there something in lucene to cap a search to a predefined time length or is there a way to stop a search when its running for too long? Daniel Shane - To unsubscribe, e-mail: java-user-unsub

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Paul_Murdoch
Thanks Dan! I upgraded my JVM from .12 to .16. I'll test with that. I've been testing by setting many IndexWriter parameters manually to see where the best performance is. Then net result was just delaying the OOM. The scenario is a test with an empty index. I have a 5 MB file with 800,000 un

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
In this project: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html I concatenate all the text of all of articles of a single journal into a single text file. This can create a text file that is 500MB in size. Lucene is OK in indexing files this size (in parallel even),

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Dan OConnor
Paul: My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps). Next, your IndexWriter parameters would help figure out why you are using so mu

RE: Indexing large files? - No answers yet...

2009-09-11 Thread Paul_Murdoch
This issue is still open. Any suggestions/help with this would be greatly appreciated. Thanks, Paul -Original Message- From: java-user-return-42080-paul_murdoch=emainc@lucene.apache.org [mailto:java-user-return-42080-paul_murdoch=emainc@lucene.apache.org ] On Behalf Of paul_mur

Re: Index docstore flush problem

2009-09-11 Thread Michael McCandless
Phew :) Mike On Thu, Sep 10, 2009 at 8:14 PM, Jason Rutherglen wrote: > Indexing locking was off, there was a bug higher up clobbering the > index.  Sorry and thanks! > > On Thu, Sep 10, 2009 at 4:49 PM, Michael McCandless > wrote: >> That's an odd exception.  It means IndexWriter thinks 468 do