Some problems with lucene in searching

2005-06-12 Thread sriram Thota
Hi, I am working on lucene.I had seen ur suggestion about lucene in google search.Iam facing some problems in searching.Please go through my sample code and suggest me where i had gone wrong. I will be thankful to you. This is my sample code: private static Document createDocument(File f

Mobile Lucene

2005-06-12 Thread christopher may
Hey all I am working on a project that requires a search engine on a embedded linux that is also bluetooth capable. Is there a lucene mobile or can I recompile the code in the J2me wireless toolkit ? Any help would be appreciated, Thanks --

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Hostetter
: Yes, when I say "duplicate" sentences, they are exact copies of the same : string. you still haven't explained how you indexed these sentences, what do you mean by "each lucene document actually contains exactly one sentence." ? Did you tokenize the sentence into one field? do you a field for

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Lamprecht
I'd have to see your indexing code to see if there are any obvious performance gotchas there. If you can run your indexer under a profiler (OptimizeIt, JProbe, or just the free one with java using -Xprof), it will tell you in which methods most of your CPU time is spent. If you're using StandardA

AW: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Karsten Konrad
Hi David, >> I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. >> Do you mean 100% duplicates or some kind of similarity? >> Obviously the brute force method of pairwise compares would take forever. I have tried grouping sen

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Dave Kor
Thanks for the quick reply, Chris. Yes, when I say "duplicate" sentences, they are exact copies of the same string. The MD5 hash is a good idea, I wish I had thought of it earlier as it would have saved me a lot of trouble. Right now it is not feasible to reindex again because indexing is a very

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Lamprecht
Dave, Can you define exactly what you consider "duplicate sentences"? Is it the same exact string, or the same words in the same order, or the same words in any order, etc? If you can normalize each sentence first, so two "duplicate" sentences are always the exact same string, then you should be

Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Dave Kor
Hi, I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. You see, I have an index containing roughly 25 million lucene documents. My task requires me to work at sentence level so each lucene document actually contains exactly one s

Webstart problem with Lucene

2005-06-12 Thread Ahmet Aksoy
Hi, I prepared a dictionary application which uses Lucene. I made my application to be downloaded with webstart. Everything is OK, but I can't access Lucene index files. When I made a search on the internet about the subject, I found some clues saying that it is impossible to put Lucene indexes in

Re: DBSight, search on database by Lucene

2005-06-12 Thread Paul Querna
Joshua Slive wrote: On Sat, 11 Jun 2005, Erik Hatcher wrote: On Jun 11, 2005, at 1:08 PM, Chris Lu wrote: Thanks. Somehow I found the "Powered By" Lucene page is "Immutable Page", even if I logged in. http://wiki.apache.org/jakarta-lucene/PoweredBy Wow, it sure is. I'm CC'ing infra

Re: DBSight, search on database by Lucene

2005-06-12 Thread Chris Lu
Thanks, guys! I have made the changes to the wiki, following Joshua's advice. It's the cookie/refreshing problem. Chris Lu Joshua Slive wrote: On Sat, 11 Jun 2005, Erik Hatcher wrote: On Jun 11, 2005, at 1:08 PM, Chris Lu wrote: Thanks. Somehow I found the "Powered By" Lucene page is