Re: document field updates

2007-02-27 Thread Neal Richter
On 2/27/07, Steven Parkes <[EMAIL PROTECTED]> wrote: It is true that you can store more data and that will make it possible to get it back. Storing fields (w/ or w/o indexing) allows you to pull them back. Storing term vectors gives you something in-between nothing and everything. I will look i

Re: Sorting by Score

2007-02-27 Thread Chris Hostetter
: The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the "bucketing" to happen as part of hte scoring so that the secondary s

Re: Sorting by Score

2007-02-27 Thread Chris Hostetter
: The constructor doesn't complain, but FieldSortedHitQueue expects a field : name when it tries to locate the comparator from the cache: can't you pick any arbitrary "marker" field name (that's not a real field name) and use that? -Hoss ---

Re: all records within distance -- small index

2007-02-27 Thread no spam
I just dug my book out. Chapter six shows a custom sort that implements a SortComparatorSource combined with a TermQuery. I like the way that works but I guess what I really need to do is a RangeQuery as well. I have another large index that has 1.2 million docs. I use a query along with a hit

Re: Best way to returning hits after search?

2007-02-27 Thread Antony Bowesman
Doron Cohen wrote: The collect() method is going to be invoked once for each document that matches the query (having nonzero score). If the index is very large, that may turn to be a very large number of calls. Often, search applications only fetch additional data (doc fields) for only a small su

Re: optimizing single document searches

2007-02-27 Thread Russ
I will definatelly check it out tommorow. I also forgot to mention that I am not interested in the hits themselves, only whether or not there was a hit. Is there something I can use that's optimized for this scenario, or should I look into rewriting the search method of the indexarsearcher? C

Re: optimizing single document searches

2007-02-27 Thread karl wettin
28 feb 2007 kl. 00.49 skrev Russ: Thanks, I will try it tommorow... Is it significantly different from using a standard index on a ramdir? A bit different. You can also try LUCENE-550. It has about the same speed as contrib/ memory but can handle multiple documents and use reader, writer

Re: optimizing single document searches

2007-02-27 Thread Russ
Thanks, I will try it tommorow... Is it significantly different from using a standard index on a ramdir? Russ Sent wirelessly via BlackBerry from T-Mobile. -Original Message- From: karl wettin <[EMAIL PROTECTED]> Date: Wed, 28 Feb 2007 00:37:55 To:java-user@lucene.apache.org Subject:

Re: optimizing single document searches

2007-02-27 Thread Erick Erickson
Which is very, very cool. I wound up using it for hit counting and it works like a charm On 2/27/07, karl wettin <[EMAIL PROTECTED]> wrote: 28 feb 2007 kl. 00.25 skrev Ruslan Sivak: ] > On a single document of 10k characters, doing about 40k searches > takes about 5 seconds. This is not

Re: Sorting by Score

2007-02-27 Thread Erick Erickson
This may be off base, but I've recently been in something similar. The first part was just to iterate through the TopDocs that's available to my and normalize the scores right in the ScoreDocs. Like this... for (ScoreDoc scd : this.topDocs.scoreDocs) { scd.score = this.getBucke

Re: optimizing single document searches

2007-02-27 Thread karl wettin
28 feb 2007 kl. 00.25 skrev Ruslan Sivak: ] On a single document of 10k characters, doing about 40k searches takes about 5 seconds. This is not bad, but I was wondering if I can somehow speed this up. Your corpus contains only one document? Try contrib/memory, an index optimized for tha

optimizing single document searches

2007-02-27 Thread Ruslan Sivak
I am using Lucene in a little bit weird way, instead of searching all the documents for a specific query, I am searching a single document for many specific queries. On a single document of 10k characters, doing about 40k searches takes about 5 seconds. This is not bad, but I was wondering if

Re: Sorting by Score

2007-02-27 Thread Peter Keegan
I'm building up the Sort object for the search with 2 SortFields - first is for the custom rounded scoring, second is for date. This Sort object is used to construct a FieldSortedHitQueue which is used with a custom HitCollector. And yes, this comparator ignores the field name. hmmm, actually i

Re: Best way to returning hits after search?

2007-02-27 Thread Doron Cohen
The collect() method is going to be invoked once for each document that matches the query (having nonzero score). If the index is very large, that may turn to be a very large number of calls. Often, search applications only fetch additional data (doc fields) for only a small subset of the entire se

RE: ConstantScoreQuery and MatchAllDocsQuery

2007-02-27 Thread Jean-Francois Beaulac
Hi, The existing code retrieved a TermPositionVector with IndexReader.getTermFreqVector(docId, field). It then extracted the terms for the query and stores them in two different array. One containing single word terms, the other containing the phrases. For single word term it loops on the array

Re: indexing and searching the document title question

2007-02-27 Thread Daniel Naber
On Tuesday 27 February 2007 23:07, Phillip Rhodes wrote: > NAME:"color me mine"^2.0 (CONTENTS:color CONTENTS:me CONTENTS:mine) Try a (much) higer boost like 20 or 50, does that help? Regards Daniel -- http://www.danielnaber.de -

Re: Sorting by Score

2007-02-27 Thread Chris Hostetter
: Suppose one wanted to use this custom rounding score comparator on all : fields and all queries. How would you get it plugged in most efficiently, : given that SortField requires a non-null field name? i'm not sure i understand the first part of question .. this custom SortComparatorSource woul

Re: Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+, numbers attached

2007-02-27 Thread Grant Ingersoll
Cool, Erick. Thanks for sharing. Actually, I would like to start a use case section on the wiki for just these types of contributions... -Grant On Feb 27, 2007, at 9:30 AM, Erick Erickson wrote: I thought I'd put up some numbers that may be useful for people who find themselves doing perfo

Re: indexing and searching the document title question

2007-02-27 Thread Phillip Rhodes
I am confused. I am following the faq that says indexing/searching a title of a document will cause it be ranked higher. When I do a search on the title of my document (name in my case), the document is being returned. But it does not get ranked higher, in fact, it gets buried in the results.

Re: indexing norms???

2007-02-27 Thread Doron Cohen
* Indexing size cost - 1 byte per field per doc * Search time memory cost - 1 byte per field per doc * Usage - document score normalization by the doc field length, and a place to hold doc/field indexing time boosts, also applied during scoring. (more info in the Scoring documentation.) *

Best way to returning hits after search?

2007-02-27 Thread Antony Bowesman
I am doing what I should not, i.e. iterating the Hits after a search to collect two ID fields from each document in Hits to pass back to the searcher along with the score. The index is approx 10-15 fields per doc, and indexes mail data, which is not stored, as it exists elsewhere. Each mail h

Re: Sorting by Score

2007-02-27 Thread Peter Keegan
Suppose one wanted to use this custom rounding score comparator on all fields and all queries. How would you get it plugged in most efficiently, given that SortField requires a non-null field name? Peter On 2/1/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I've not used the sorting code ye

RE: document field updates

2007-02-27 Thread Steven Parkes
It is true that you can store more data and that will make it possible to get it back. Storing fields (w/ or w/o indexing) allows you to pull them back. Storing term vectors gives you something in-between nothing and everything. However, you're still gonna get stuck on the "update" part. Lucene do

Re: all records within distance -- small index

2007-02-27 Thread Phillip Rhodes
I am doing this, but for 16000+ records. I indexed each document with the lat/long values as keywords. I added a 1000 to each value to get it into the positive range. I do a range query for the lat long, calculating the min/max for the long/lat from the origination point. Don't forget to add

Re: all records within distance -- small index

2007-02-27 Thread no spam
I do have that book (at home) ... now that I think about it I believe I marked that page. I've definitely learned a lot more and I need to re-parse the book :) On 2/27/07, Erick Erickson <[EMAIL PROTECTED]> wrote: See Lucene In Action. There's an example in the book that is almost exactly what

RE: document field updates

2007-02-27 Thread Neal Richter
Steven Parkes wrote: There are no plans to do this. It's essentially impossible, given (1) the reverse nature of text indexes and (2) Lucene's write-once segment architecture. What if the field is stored and unindexed? It should be possible to update the contents of that in isolation. How wo

Re: Fwd: Unable to retreive 2/13 field values

2007-02-27 Thread Daniel Naber
On Tuesday 27 February 2007 19:21, Michael Barbarelli wrote: > GB821628930  (+VAT_reg:GB* doesn't work) What about VAT_reg:gb*? Also see QueryParser.setLowercaseExpandedTerms() Regards Daniel -- http://www.danielnaber.de - T

Re: all records within distance -- small index

2007-02-27 Thread Erick Erickson
See Lucene In Action. There's an example in the book that is almost exactly what you want, see section 6.1 Erick On 2/27/07, no spam <[EMAIL PROTECTED]> wrote: I have a very small index of 500 docs with an index size of < 100k on disk so far. I want to whip through the docs and get only the o

Re: indexing performance

2007-02-27 Thread Chris Hostetter
: : > I am trying to index the syslogs generated from one of my busy ftp : > server so : > that I can get counts specific to an user with the given time : > frame. Since : My immediate thought when reading this is if it really is a text : search engine you want to use for this? ditto ... if you a

Re: indexing and searching the document title question

2007-02-27 Thread Chris Hostetter
: 7> I think your underlying problem is that the syntax of the search : isn't correct. You're really searching on : NAME:color : defaultfield:me : defaultfield:mine : : You want something like +NAME:color +NAME:me +NAME:mine or... NAME:"color me mine" -Hoss -

all records within distance -- small index

2007-02-27 Thread no spam
I have a very small index of 500 docs with an index size of < 100k on disk so far. I want to whip through the docs and get only the ones within a lat/lon within radius. I realize this isn't how lucene wants to do things (normally query search first) but how can I do this in an efficient manner?

Re: recovering an index from RAM disk.

2007-02-27 Thread Chris Hostetter
IndeexWriter has an addIndexes method which takes in a directory ... so open a new IndexWRiter pointed at the FSDirectory you want to write to and add your RAMDirectory to it. : Date: Tue, 27 Feb 2007 11:25:32 + : From: Martin Spamer <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org

Re: Highlighting issues

2007-02-27 Thread mark harwood
This snippet from the Highlighter JUnit test should reveal the solution: public void testFieldSpecificHighlighting() throws IOException, ParseException { String docMainText="fred is one of the people"; QueryParser parser=new QueryParser(FIELD_NAME,analyzer); Query

Re: updating index

2007-02-27 Thread no spam
Yes correct, I'll be using the new updateDocument() api call! Erick thanks for correcting my poor use of termdocs :) On 2/27/07, Doron Cohen <[EMAIL PROTECTED]> wrote: "Erick Erickson" <[EMAIL PROTECTED]> wrote on 25/02/2007 07:05:21: > Yes, I'm pretty sure you have to index the field (UN_TOK

RE: document field updates

2007-02-27 Thread Steven Parkes
There are no plans to do this. It's essentially impossible, given (1) the reverse nature of text indexes and (2) Lucene's write-once segment architecture. -Original Message- From: Arnone, Anthony [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 27, 2007 10:18 AM To: java-user@lucene.apac

Fwd: Unable to retreive 2/13 field values

2007-02-27 Thread Michael Barbarelli
Hello. I'm using Lucene.NET, but would like to pose the question here in the Java group since I think the collective expertise here is still valid. Hope you don't mind. After indexing data from an Oracle DB using the standard analyzer, I am using Luke (standardanalyzer) to query at the moment.

document field updates

2007-02-27 Thread Arnone, Anthony
I know this has been asked before, but I'd like to ask once more for peace of mind. Is there a way to do single field inserts/updates without deleing and reinserting a document? If the answer is no, what exactly would be entailed in adding this functionality, or, better yet, is this plan

Re: indexing and searching the document title question

2007-02-27 Thread Erick Erickson
You've probably got it right. But I'd add a couple of things 1> by using the correct analyzer at index and query time, the casing will be taken care of for you. 2> you don't want UN_TOKENIZED for fields you search on in general because there's no parsing. So if you indexed "This is a String"

Re: Soliciting Design Thoughts on Date Searching

2007-02-27 Thread Erick Erickson
If you search the mailing list archive for 'date', you'll find a wealth of discussion on this topic. Also, try DateTools, DateRange, etc. http://www.gossamer-threads.com/lists/lucene/java-user/ Erick On 2/27/07, Walt Stoneburner <[EMAIL PROTECTED]> wrote: I've been asked if it's possible to

Soliciting Design Thoughts on Date Searching

2007-02-27 Thread Walt Stoneburner
I've been asked if it's possible to search on dates within a document. The high level goal is to index a number of documents which mention specific dates, and then perform a broad query for documents that mention dates within a certain time period. In thinking about how to go about solving this p

Re: Benchmarker

2007-02-27 Thread Doron Cohen
karl wettin <[EMAIL PROTECTED]> wrote on 27/02/2007 01:04:57: > > But first I need to make sure I get it running without my changes to > the code :) > Ok, this should be easy: - checkut current trunk; - cd to contrib/benchmark; - run "ant run-task"; This should run the benchmark-by-task and at

indexing and searching the document title question

2007-02-27 Thread Phillip Rhodes
Hi, According to the FAQ, by indexing the title of the document and performing a search against the shorter field will automatically give it a higher weight than matches against the document content. That is what I am trying to accomplish with a "NAME" field. If someone enters a close match of

Re: Storing extra data in index

2007-02-27 Thread Erick Erickson
Keep in mind that you'll have to store the length as you index. If you tried to store the length with each document as a post-step, you'd delete and re-add the document to the index... That said, it's really up to you. It's very quick to use TermEnum/ TermDocs to enumerate all the lengths. Even t

RE: Storing extra data in index

2007-02-27 Thread Mike O'Leary
So if I wanted to record the length of each individual document, would it be better to store that information with each document, perhaps as an unindexed field? Or are there ways to refer to the indexed documents that don't change through delete and optimize steps? Thanks. Mike O'Leary _

Highlighting issues

2007-02-27 Thread moraleslos
In my search query I have two fields to search, a metadata field and the actual contents. The metadata field is just an enum containing FIRST and LAST. Here is an example search query: Content:"Barry Bonds" and Metadata:FIRST I have Lucene highlight the hits like this: ... getBestFragment(stan

Re: Storing extra data in index

2007-02-27 Thread Erick Erickson
You can just add a document. I used this technique in an application, and it hinges upon realizing that not all documents in an index need to have the same fields. So, say your regular documents have fields f1, f2, f3...fn. Create a special document with fields s1, s2, s3, s4 that contain your met

Storing extra data in index

2007-02-27 Thread Mike O'Leary
Is there a standard programming idiom for adding extra data to an index that has been created? I am trying to write code to index and search a set of documents using the BM25 algorithm, so (as I understand it) I need to store the length of each document somewhere and the average document length for

Re: searchs based on text

2007-02-27 Thread Ricardo Pereira da Silva
Hi, I'm very happy that I can make myself understandable, so , thank's very much fou your opinion! I really appreciate your hints, and now I have some direction to take in my study. Thank you very much! Best Regards, Ricardo Pereira - Brazil On 2/27/07, mark harwood <[EMAIL PROTE

Re: indexing performance

2007-02-27 Thread karl wettin
27 feb 2007 kl. 16.49 skrev Saravana: I am trying to index the syslogs generated from one of my busy ftp server so that I can get counts specific to an user with the given time frame. Since my ftp server is very busy it can generate so much syslogs per second. And the important point her

Re: indexing performance

2007-02-27 Thread Saravana
Hi, I thought of getting the maximum indexing rate by lucene. However I did the test with sample strings and I am getting close to 600 documents/sec in a 512 MB RAM with 1.9 GHz Linux machine. Searching is pretty fast and I can create new index files based on user or based on time etc so that I w

Re: searchs based on text

2007-02-27 Thread mark harwood
Take a look at the "MoreLikeThis" class in the "contrib" section will reduce large amounts of text like your example paragraphs to only the "important" words which are useful for searching and provide you with a query object you can run. The disadvantage of trying to feed your example content t

Index Write Access Strategies for Distributed Systems to Shared Index

2007-02-27 Thread Andreas Guther
Hi, I am seeking for a best practice recommendation regarding distributed write access to a Lucene index. We have the following scenario: * Our Lucene index is on a shared drive. * The Lucene lock folder is on the same shared drive * Our web application writing to the index will run on multiple

Re: searchs based on text

2007-02-27 Thread Erick Erickson
Ricardo: Well, your English is so much better than my Portuguese that I can only congratulate you. You're perfectly understandable, so your efforts are paying off! About your question. It's quite easy to search on multiple terms. Just submit the entire text to QueryParser.parse. Be sure that the

searchs based on text

2007-02-27 Thread Ricardo Pereira da Silva
I've started to study Lucene just today and the demos give me much information on how to begin to use it. But I have one doubt that I couldn't resolve reading the docs and the demo sources: All the examples just search in the index by just a single word, but I need to know if it's possible to sea

Re: indexing performance

2007-02-27 Thread Erick Erickson
How do you expect anyone to be able to answer such an open-ended question? What I'd do is create a test harness that generates a random set of strings and try it. Off the top of my head, this seems like a pretty steep requirement. And at 2,000 docs a second you're going to have a huge index prett

Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+, numbers attached

2007-02-27 Thread Erick Erickson
I thought I'd put up some numbers that may be useful for people who find themselves doing performance tuning and/or are just curious. See then end of this e-mail for design notes DISCLAIMER: Your results may vary. Once I figured out the speed-up I got by using FieldSelector, I stopped looking fo

indexing performance

2007-02-27 Thread Saravana
Hi, Is it possible to scale lucene indexing like 2000/3000 documents per second? I need to index 10 fields each with 20 bytes long. I should be able to search by just giving any of the field values as criteria. I need to get the count that has same field values. Will it be possible? with rega

ParallelSearcher in multi-node environment

2007-02-27 Thread dmitri
Hi, I want to execute parallel search over several machines. But ParallelSearcher doesn't look perfect. It creates threads for and spawns many requests to the underlying Searchables (over a network) for a single search. Is there a decent implementation of the parallel search over remote indexes s

Re: recovering an index from RAM disk.

2007-02-27 Thread Michael McCandless
"Martin Spamer" wrote: > I generate my index to the file system and load that index into a > RAMDirectory for speed. If my indexer fails the directory based index > can be left in an inadequate state for my needs. I therefore wish to > flush the current index from the RAMDirectory back to the Fil

recovering an index from RAM disk.

2007-02-27 Thread Martin Spamer
I generate my index to the file system and load that index into a RAMDirectory for speed. If my indexer fails the directory based index can be left in an inadequate state for my needs. I therefore wish to flush the current index from the RAMDirectory back to the File system. The RAMDirectory cla

indexing norms???

2007-02-27 Thread zzzzz shalev
can someone explain to me the norm issue that is stored in each field at index time for scoring, how in impacts the index size, for lucene 1.4.3 is it active by default. and the penalty of disabling it much much thanks in advance - Any questions? Get answ

Re: Indexing-Error: Cannot delete

2007-02-27 Thread Michael McCandless
"robisbob" <[EMAIL PROTECTED]> wrote: > Thanx for your answer. I will use the latest version to check this. > Unfortunately I have only access to the computer, where the > application will be run, once a week. And I can't reproduce the error > at my local machine or any other computer I have acces

Re: Can I use Lucene to retrieve a list of duplicates

2007-02-27 Thread Paul Taylor
Thanks Eric/Chris for all your help Yes the missing magic was: TermDocs termDocs = ir.termDocs(terms.term()); instead of TermDocs termDocs = ir.termDocs(); I will try and use FieldCache as well. This is the first time Ive used Lucene, but it certainly looks like it will help me with a

Re: Benchmarker

2007-02-27 Thread karl wettin
27 feb 2007 kl. 08.04 skrev Doron Cohen: Hi Karl, Seems I missed this email... What is the status of this, have you solved it? I didn't do anything since I wrote this. If you have 10 minutes to spare some day for guiding me in the output and code via voice (skype?), I'd very much appreciat