Re: updating index

2007-03-01 Thread Doron Cohen
Daniel Noll <[EMAIL PROTECTED]> wrote on 01/03/2007 22:10:15: > > API IndexWriter.updateDocument() may be useful. > > Whoa, nice convenience method. > > I don't suppose the new document happens to be given the same ID as the > old one. That would make many people's lives much easier. :-) Oh no,

Re: updating index

2007-03-01 Thread Daniel Noll
Doron Cohen wrote: Once indexing the database_id field this way, also the newly added API IndexWriter.updateDocument() may be useful. Whoa, nice convenience method. I don't suppose the new document happens to be given the same ID as the old one. That would make many people's lives much easie

Re: Field Selector in Searcher interface

2007-03-01 Thread Grant Ingersoll
The odds increase significantly in correlation to patches submitted! :-) The odds increase slightly by at least filing an "enhancement" issue in JIRA. They increase a tiny bit by bringing it up here! I may have some time in the not too distant future for this, but we always appreciate t

Field Selector in Searcher interface

2007-03-01 Thread Mark Miller
What are the odds (or reasons against) bubbling up doc(int, fieldSeclector) to Searcher? I would love to take advantage of the selective field loading but I am working with MultiSearchers and Searchers so I cannot count on getReader (in IndexSearcher) for access. - Mark --

RE: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread Steven Parkes
If all you want to do is find docs containing dates within a range, it probably doesn't make much difference whether you give dates their own field or put them into your content field. It'll probably be easier to just add them into the token stream since that's the way the analyzer architecture wan

More long running queries

2007-03-01 Thread Tim Johnson
I'm still having issues with long running queries. I'm using a custom HitCollector to bring back ALL docs that match a search has suggested in a previous post/relpy (e.g. Nutch LuceneQueryOptimizer). This solution works most of the time; however, in testing a very complex query using several ran

Re: document field updates

2007-03-01 Thread Erik Hatcher
On Mar 1, 2007, at 1:35 PM, Neal Richter wrote: Collex is quite open source, its just ugly source :) We're the 'patacriticism' project at SourceForge, under the "collex" directory in Subversion. Collex implements tagging by implementing JOIN cross-references between user/tag documents and regu

Re: [Fwd: Re: indexing performance]

2007-03-01 Thread Mike Klaas
On 3/1/07, Saravana <[EMAIL PROTECTED]> wrote: Is this still hold good now ? Thanks for your reply. Probably most of that still applies to some extent. However, it is unclear whether it will speed up your application. First thing is to find out what your bottleneck is. Looking at the stats

Re: document field updates

2007-03-01 Thread Neal Richter
Collex is quite open source, its just ugly source :) We're the 'patacriticism' project at SourceForge, under the "collex" directory in Subversion. Collex implements tagging by implementing JOIN cross-references between user/tag documents and regular object documents. It's scalability is not goi

Re: document field updates

2007-03-01 Thread Andrzej Bialecki
Erik Hatcher wrote: I'm pretty sure this has been done, I'm just not 100% sure where. Does Nutch index link text? Nutch does do this sort of thing, but I'm not quite sure how. It isn't doing any operations to the Lucene index beyond what plain ol' Lucene does. Nutch maintains a set of s

Re: document field updates

2007-03-01 Thread Erik Hatcher
On Feb 28, 2007, at 8:59 AM, Steven Parkes wrote: Are unindexed fields stored seperately from the main inverted index? If so then, one could implement the field value change as a delete and re-add of just that value? The short answer is that won't work. Field values are

Re: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread Walt Stoneburner
Thank you all for the suggestions steering me down the right path. As an aside, the easy part, at least for me, is extracting the dates -- Peter was dead on about how doing that: heuristics, multiple regular expressions, and data structures. As Steve pointed out, this isn't as trivial as it soun

RE: TextMining.org Word extractor

2007-03-01 Thread Bruce Ritchie
I can't speak to where you can get a copy of the original code, but the modified code I have is not GPL licenced - the license header in at least one file is as follows: /* Copyright 2004 Ryan Ackley * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this f

Re: TextMining.org Word extractor

2007-03-01 Thread Bill Taylor
On Feb 23, 2007, at 2:00 PM, [EMAIL PROTECTED] wrote: Re: TextMining.org Word extractor Someone noted that textmining.org gets hacked. There is test- mining.org which appears to be a commercial site. Can someone tell me where to get the download of the original GPL textmining.org so

Re: [Fwd: Re: indexing performance]

2007-03-01 Thread Saravana
Hi, You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use "keyword" (untokenized) fields, not "tokenized" fields. Sorry for not elaborating my requirement more. Actu

question about ScoreDocComparator

2007-03-01 Thread Ulf Dittmer
Hello- One of the fields in my index is an ID, which maps to a full text description behind the scenes. Now I want to sort the search results alphabetically according to the description, not the ID. This can be done via SortComparatorSource and a ScoreDocComparator without problems. But t

RE: Performance in having Multiple Index files

2007-03-01 Thread Mordo, Aviran (EXP N-NANNATEK)
Yes, it will affect the search performance because you need to merge the results from the different indexes. The best performance is from a single index. The more indexes you have the more time it takes to search. Aviran http://www.aviransplace.com -Original Message- From: Raaj [mailto:[

retrieve term positions in query

2007-03-01 Thread matpil
Hi! My problem is to retrieve the term positions in a "general" query with more than one terms. It seems that with the phrase query it's possible (with SpanQuery) but with "AND" and "OR" query I can't get the position for each document I search. I'm looking for a high level implementation because

Re: Spanned indexes

2007-03-01 Thread Otis Gospodnetic
Sachin, A lof of the questions you are asking are covered either in the FAQ or on the Lucene site somewhere, or in various Lucene articles or in LIA. You should check those places first (the traffic on java-user is already high!), you'll save yourself a lot of time. For this particular questio

Re: Sorting by Score

2007-03-01 Thread Peter Keegan
Erick, I think you're right because you'd wouldn't know the max score before the comparisons. I'm just thinking about a rounding algorithm that involves comparing the raw scores to the theoretical maximum score, which I think could be computed from the Similarity class and knowing the max boost v

Re: [ANN] ParallelSearcher in multi-node environment

2007-03-01 Thread Sharad Agarwal
yeah I am too looking forward to this feature, using thread pool and minimize the remote calls in ParallelSearcher [EMAIL PROTECTED] wrote: e.g. I've changed original ParallelSearcher to use thread pool (java.util.concurrent.ThreadPoolExecutor from jdk 1.5). But implementing multi-host insta

Re: Sorting by Score

2007-03-01 Thread Erick Erickson
Peter: About a custom ScoreComparator. The problem I couldn't get past was that I needed to know the max score of all the docs in order to divide the raw scores into quintiles since I was dealing with raw scores. I didn't see how to make that work with ScoreComparator, but I confess that I didn't

Spanned indexes

2007-03-01 Thread Kainth, Sachin
Hi all, Is it possible in Lucene for an index to span multiple files? If so what is the recommendation in this case? Is it better to span after the index reaches a particular size? Furthermore, does Lucene ever span a single record between two or more index files in this case or does it ensure

Re: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread mark harwood
GATE is the other entity extraction framework ( http://gate.ac.uk) and comes out of the box with a lot of this stuff. Even once you've parsed the dates your next problem is representing and querying time - you referred to the fact that documents could represent single dates, multiple dates or t

RE: Update - IOException

2007-03-01 Thread Michael McCandless
"DECAFFMEYER MATHIEU" <[EMAIL PROTECTED]> wrote: > I deleted the lock file, now it seems to work ... > > When can such an error happen ? See my response I just sent to java-user on this same error. Even though you are running Lucene 2.0, the same causes can lead to that "Lock obtain timed out"

Re: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread Otis Gospodnetic
Ah, I once worked in a place where we did exactly that - recognition and extraction of useful nuggets from emails - dates, emails, URLs, attachments, people, places...see divmod.com for the next generation of that. I believe Zoe subsequently did something very similar. I think Zoe is still fre

Re: Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@

2007-03-01 Thread Michael McCandless
"Jerome Chauvin" <[EMAIL PROTECTED]> wrote: > We encounter issues while updating the lucene index, here is the stack > trace: > > Caused by: java.io.IOException: Lock obtain timed out: > SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:6

Re: Best way to returning hits after search?

2007-03-01 Thread Antony Bowesman
If you decide to cache stored field value in memory, FieldCache may be useful for this - so you don't have to implement your own cache - you can access the field values with something like: FieldCache fieldCache = FieldCache.DEFAULT; String db_id_field[] = fieldCache.getStrings(indexReader,"

RE: Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
I deleted the lock file, now it seems to work ... When can such an error happen ? __ Matt From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 9:56 AM To: java-user@lucene.apache.org Subjec

Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
Hi, While updating my index I have the following error : [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.l

Re: indexing performance

2007-03-01 Thread Nadav Har'El
On Tue, Feb 27, 2007, Saravana wrote about "indexing performance": > Hi, > > Is it possible to scale lucene indexing like 2000/3000 documents per > second? I don't know about the actual numbers, but one trick I've used in the past to get really fast indexing was to create several independent inde

Re: [ANN] ParallelSearcher in multi-node environment

2007-03-01 Thread dmitri
e.g. I've changed original ParallelSearcher to use thread pool (java.util.concurrent.ThreadPoolExecutor from jdk 1.5). But implementing multi-host installation still requires a lot of changes since ParallelSearcher calles underlying Searchables too many times (e.g. for separate network call for ev

Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@

2007-03-01 Thread Jerome Chauvin
All, We encounter issues while updating the lucene index, here is the stack trace: Caused by: java.io.IOException: Lock obtain timed out: SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:69) at org.apache.lucene.index.IndexReader.aquir