Re: StopWords problem

2007-12-27 Thread Doron Cohen
Try printing all these after you close the writer: - ((FSDirectory) dir).getFile().getAbsolutePath() - dir.list().length (n) - dir.list()[0], .. , dir.list[n] This should at least help you verify that an index was created and where. Regards, Doron On Dec 27, 2007 12:26 PM, Liaqat Ali <[EMAIL PR

Re: StopWords problem

2007-12-27 Thread Liaqat Ali
Doron Cohen wrote: On Dec 27, 2007 11:49 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote: I got your point. The program given does not give not any error during compilation and it is interpreted well. But the it does not create any index. when the StandardAnalyzer() is called without Stopwords list

Re: StopWords problem

2007-12-27 Thread Doron Cohen
On Dec 27, 2007 11:49 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > I got your point. The program given does not give not any error during > compilation and it is interpreted well. But the it does not create any > index. when the StandardAnalyzer() is called without Stopwords list it > works well, b

Re: StopWords problem

2007-12-27 Thread Liaqat Ali
Doron Cohen wrote: This is not a self contained program - it is incomplete, and it depends on files on *your* disk... Still, can you show why you're saying it indexes stopwords? Can you print here few samples of IndexReader.terms().term()? BR, Doron On Dec 27, 2007 10:22 AM, Liaqat Ali <[EMAIL

Re: StopWords problem

2007-12-27 Thread Doron Cohen
This is not a self contained program - it is incomplete, and it depends on files on *your* disk... Still, can you show why you're saying it indexes stopwords? Can you print here few samples of IndexReader.terms().term()? BR, Doron On Dec 27, 2007 10:22 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote:

Re: StopWords problem

2007-12-27 Thread N. Hira
Hi Liaqat, Are you sure that the Urdu characters are being correctly interpreted by the JVM even during the file I/O operation? I would expect Unicode characters to be encoded as multi-byte sequences and so, the string-matching operations would fail (if the literals are different from the

Re: StopWords problem

2007-12-27 Thread Liaqat Ali
Doron Cohen wrote: Hi Liagat, This part of the code seems correct and should work, so problem must be elsewhere. Can you post a short program that demonstrates the problem? You can start with something like this: Document doc = new Document(); doc.add(new Field("text",URDU_STOP_WOR

Re: StopWords problem

2007-12-27 Thread Doron Cohen
Hi Liagat, This part of the code seems correct and should work, so problem must be elsewhere. Can you post a short program that demonstrates the problem? You can start with something like this: Document doc = new Document(); doc.add(new Field("text",URDU_STOP_WORDS[0] +

Re: StopWords problem

2007-12-26 Thread Liaqat Ali
Grant Ingersoll wrote: On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No, at this level I am not using any stemming technique. I

Re: StopWords problem

2007-12-26 Thread Grant Ingersoll
On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No, at this level I am not using any stemming technique. I am just trying to elim

Re: StopWords problem

2007-12-26 Thread Liaqat Ali
Grant Ingersoll wrote: Are you altering (stemming) the token before it gets to the StopFilter? On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote: Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. ur

Re: StopWords problem

2007-12-26 Thread Grant Ingersoll
Are you altering (stemming) the token before it gets to the StopFilter? On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote: Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal

Re: StopWords problem

2007-12-26 Thread Liaqat Ali
Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? If you have the stop-words in a file, say one word in a l

Re: StopWords problem

2007-12-26 Thread Doron Cohen
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > Using javac -encoding UTF-8 still raises the following error. > > urduIndexer.java : illegal character: \65279 > ? > ^ > 1 error > > What I am doing wrong? > If you have the stop-words in a file, say one word in a line, they can be

Re: StopWords problem

2007-12-26 Thread 李晓峰
or you can save it as "Unicode" and javac -encoding Unicode this way you can still use notepad. Liaqat Ali 写道: 李晓峰 wrote: "javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode e

Re: StopWords problem

2007-12-26 Thread 李晓峰
It's the notepad. It adds byte-order-mark(BOM, in this case 65279, or 0xfeff.) in front of your file, which javac does not recognize for reasons not quite clear to me. here is the bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 it won't be fixed, so try to eliminate BOM before co

Re: StopWords problem

2007-12-26 Thread Liaqat Ali
李晓峰 wrote: "javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choic

Re: StopWords problem

2007-12-26 Thread 李晓峰
"javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choice, you can se

StopWords problem

2007-12-26 Thread Liaqat Ali
Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(