Re: search timeout

2007-03-18 Thread Chris Hostetter
: > imeediately? ... in the totally generic case, this isn't a safe : This was implemented as an easy way to control the maximum search time : for typical queries. I'm open for suggestions how to improve it. One The only thing i can think of that would truely timeout *any* query is a seperate Ti

Storing whole documents in the index

2007-03-18 Thread jafarim
Hello It's a whil that I am using lucene and as most of people seemingly do, I used to save only some important fields of a docuemnt in the index. But recently I thought why not store the whole document bytes as an untokenized field in the index in order to ease the retrieval process? For example

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
I will try to explain it like an algorithm what I am trying to do: 1. There are 70 Dump files which have 10,000 record tags which I have pasted in my earlier mails. I split every dumpfile and create 10,000 xml files each with a single and its child tags. This is because there are some parsing is

Re: Eliminate duplicates

2007-03-18 Thread Erick Erickson
But I think you have a problem here with searching the lucene index and deleting duplicate titles. Say you have the following titles: title one title one is a nice file title one is a really nice file Further assume you're about to add a duplicate "title one" Searching on "title one" will give y

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Erick Erickson
I'm not sure what the lock issues is. What version of Lucene are you using? And what is your filesystem like? There are some known locking issues with some versions of Lucene and some filesystems, particularly NFS mounts as I remember... It would help if you told us the entire stack trace rather t

Re: Eliminate duplicates

2007-03-18 Thread Otis Gospodnetic
Markus, What you were thinking is fine - search and, if found, delete first, then add. Lucene allows duplicate and offers no automated way for avoiding them. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Origi

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
Yep I did that, and now my code looks as follows. The time taken for indexing one file is now => Elapsed Time in Minutes :: 0.3531, which is really great, but after processing 4 dumpfiles(which means 40,000 small xml's), I get : caught a class java.io.IOException 40114 with message: Lock obta

Re: categorizing results

2007-03-18 Thread Dima May
Nicolas, Thank you for the reply! The problem that my categories are not static they are generated at runtime, they are added and removed all the time. For that reason I have no way to pre-generate the filters. Dima On 3/18/07, Nicolas Lalevée <[EMAIL PROTECTED]> wrote: Le dimanche 18 mars 20

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Grant Ingersoll
Move index writer creation, optimization and closure outside of your loop. I would also use a SAX parser. Take a look at the demo code to see an example of indexing. Cheers, Grant On Mar 18, 2007, at 12:31 PM, Lokeya wrote: Erick Erickson wrote: Grant: I think that "Parsing 70 file

contrib/benchmark questions

2007-03-18 Thread Grant Ingersoll
I'm using contrib/benchmark to do some tests for my ApacheCon talk and have some questions. 1. In looking at micro-standard.alg, it seems like not all braces are closed. Is a line ending a separator too? 2. Is there anyway to dump out what params are supported by the various tasks? I am e

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
Erick Erickson wrote: > > Grant: > > I think that "Parsing 70 files totally takes 80 minutes" really > means parsing 70 metadata files containing 10,000 XML > files each. > > One Metadata File is split into 10,000 XML files which looks as below: > > > > > oai:CiteSee

Re: search timeout

2007-03-18 Thread Erick Erickson
On 3/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: Ack! ... this is what happens when i only skim a patch and then write with my odd mix of authority and childlike speling I'm telling ya, man, ya gotta get Firefox, use Gmail (or at least a web-interfaced e-mail client) and turn on th

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Erick Erickson
Grant: I think that "Parsing 70 files totally takes 80 minutes" really means parsing 70 metadata files containing 10,000 XML files each. Lokeya: Can you confirm my supposition? And I'd still post the code Grant requested if you can. So, you're talking about indexing 10,000 xml files in

Re: categorizing results

2007-03-18 Thread Erick Erickson
You might also want to search the mail archive for "faceted search" and/or Categories. This topic has been discussed under that heading, I believe... Erick On 3/18/07, Dima May <[EMAIL PROTECTED]> wrote: I have a Lucene related questions/problem. My search results can potentially get very l

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Grant Ingersoll
Can you post the relevant indexing code? Are you doing things like optimizing after every file? Both the parsing and the indexing sound really long. How big are these files? Also, I assume you machine is at least somewhat current, right? On Mar 18, 2007, at 1:00 AM, Lokeya wrote: Thank

Re: search timeout

2007-03-18 Thread Andrzej Bialecki
Chris Hostetter wrote: Ack! ... this is what happens when i only skim a patch and then write with my odd mix of authority and childlike speling : * it creates a single (static) timer thread, which counts the "ticks", : every couple hundred ms (configurable). It uses a volatile int counter, :

Re: categorizing results

2007-03-18 Thread Nicolas Lalevée
Le dimanche 18 mars 2007 06:55, Dima May a écrit : > > > I have a Lucene related questions/problem. > > My search results can potentially get very large 200,000+. I want to > categorize my results. So for example if I have an indexed field "type" > that has such things as CDs, books, videos, powe