Re: Confused about non-tokenized fields

2005-05-27 Thread Gusenbauer Stefan
Erik Hatcher wrote: > > On May 27, 2005, at 12:14 PM, Gusenbauer Stefan wrote: > >> Max Pfingsthorn wrote: >> >> >>> Hi! >>> >>> Thanks for the reply. I figured already that fields are actually >>> not tokenized... I lost track of the filenames/dirnames and there >>> were some duplicates... >>>

Re: Lucene Arabic Internationalization Question

2005-05-27 Thread Nader Henein
Dear Rasha, Sorry for the delay, I've indexed Arabic and English seamlessly on Lucene, the only thing you have to watch out for is stemming, as for indexing PDFs, I have not used that part of the API, but from experience, this comes down to using or in some cases forcing the correct encoding,

Re: Confused about non-tokenized fields

2005-05-27 Thread Erik Hatcher
On May 27, 2005, at 12:14 PM, Gusenbauer Stefan wrote: Max Pfingsthorn wrote: Hi! Thanks for the reply. I figured already that fields are actually not tokenized... I lost track of the filenames/dirnames and there were some duplicates... About case-insensitivity: Okay, I can make my qu

Re: SourceForge.net Lucene based search announcement

2005-05-27 Thread Chris Conrad
Hello, The "results by YAHOO! search" is a marketing thing that we have no real control over. I promise that the actual project search engine is using Lucene. As for defaulting to OR, it was decided that the new search should function as similarly as possible to the old search system by

Re: Confused about non-tokenized fields

2005-05-27 Thread Erik Hatcher
On May 27, 2005, at 11:22 AM, Max Pfingsthorn wrote: Hi! In my application, I index some strings (like filenames) untokenized, meaning via doc.add(new Field(FIELD,VALUE,false,true,false)); When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead

java-user@lucene.apache.org

2005-05-27 Thread Grant Ingersoll
Also, see if http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages helps at all. >>> [EMAIL PROTECTED] 5/27/2005 12:09:32 PM >>> Probably your Unix system has a different default encoding than your Windows machine. You have to make sure you give the IndexWriter a string that has the corre

Re: Confused about non-tokenized fields

2005-05-27 Thread Gusenbauer Stefan
Max Pfingsthorn wrote: >Hi! > >Thanks for the reply. I figured already that fields are actually not >tokenized... I lost track of the filenames/dirnames and there were some >duplicates... > >About case-insensitivity: Okay, I can make my query lower case, but my strings >in the field are not...

java-user@lucene.apache.org

2005-05-27 Thread Angelov, Rossen
Probably your Unix system has a different default encoding than your Windows machine. You have to make sure you give the IndexWriter a string that has the correct encoding. Do you specifically set the encoding in you code before you index it with Lucene? Ross -Original Message- From: gau

RE: Confused about non-tokenized fields

2005-05-27 Thread Max Pfingsthorn
Hi! Thanks for the reply. I figured already that fields are actually not tokenized... I lost track of the filenames/dirnames and there were some duplicates... About case-insensitivity: Okay, I can make my query lower case, but my strings in the field are not... I guess I have to do that manual

Re: Confused about non-tokenized fields

2005-05-27 Thread Gusenbauer Stefan
Max Pfingsthorn wrote: >Hi! > >In my application, I index some strings (like filenames) untokenized, meaning >via > >doc.add(new Field(FIELD,VALUE,false,true,false)); > >When I later take a look at it with Luke, I still get tokens of the filenames >(like "news" instead of "news-item.xml") in the

java-user@lucene.apache.org

2005-05-27 Thread gaudinat
Hi, I haven't got no utf-8 index when I use Lucene with Solaris while my characters are OK under windows. My indexing program is the same and it uses lucene 1.4.3. Is someone have an Idea to help me? Regards, Arnaud. - To

how long should optimizing take

2005-05-27 Thread Angelov, Rossen
Hi, I'm having problems with the Lucene optimization. Two of the indexes are about 2BG big and every day about 30 documents are added to each of these indexes. At the end of the indexing the IndexWriter optimize() method is executed and it takes about 30 minutes to finish the optimization for each

RE: Deleting duplicates from a Lucene index

2005-05-27 Thread Omar Didi
what you can do is open the index and loop through all the documents in decending order. the code below will explain more. Directory dir = FSDirectory.getDirectory( args[ 0 ], false ); IndexReader reader = IndexReader.open( dir ); int numDocs = reader.numDocs(); HashSet items = new HashSet( size

Confused about non-tokenized fields

2005-05-27 Thread Max Pfingsthorn
Hi! In my application, I index some strings (like filenames) untokenized, meaning via doc.add(new Field(FIELD,VALUE,false,true,false)); When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead of "news-item.xml") in the list of most frequent terms. Sh

Re: SourceForge.net Lucene based search announcement

2005-05-27 Thread Chris Lu
Hi, I found Sourceforge's search is still "results by YAHOO! search". What does that mean? And currently, seems the search condition for the keywords is still OR, not AND. This makes search for "lucene java" returns a long list, yet without the one I wanted in the first several rows. Chris L

Re: Code search

2005-05-27 Thread 田春峰
hi, Lucene is greate project to serve as a source code search engine. I had made a source code search engine based on lucene , it perfermance very well. unforturnately , my version is chinese version. the url is ; http://www.domolo.com/domolo/ctrlc/index.aspx it search 101732 j