Re: Problem using Lucene on Ubuntu

2008-02-18 Thread Grant Ingersoll
Good point Jan! On Feb 18, 2008, at 9:00 AM, Jan Peter Stotz wrote: Grant Ingersoll wrote: Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if that is what your files are in. I think there is a misunderstanding, the WordExtractor extracts text from MS Word (.doc) files.

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread Jan Peter Stotz
Grant Ingersoll wrote: Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if that is what your files are in. I think there is a misunderstanding, the WordExtractor extracts text from MS Word (.doc) files. Those files are binary and therefore does not have an encoding. I wou

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread Grant Ingersoll
Not sure about WordExtractor, does it also take a Reader? I would try: Reader input = new InputStreamReader(new FileInputStream(file), "ENCODING"); WordExtractor extractor = new WordExtractor(input); content = extractor.getText(); Note: ENCODING is whatever encoding the file is in, as in "UT

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread kratoras
No problem about the misunderstanding. I am using InputStream input =new URL ( "file:///"+file.getAbsolutePath() ).openStream (); WordExtractor extractor = new WordExtractor(input); content=extractor.getText(); where the wordextractor is org.apache.poi.hwpf.extractor.WordExtractor; The word

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread Grant Ingersoll
How are you loading the document into the content variable below? My guess is still that you have different locales on Windows and Ubuntu. (Btw, sorry about the java-user comment. I should wake up before sending responses. For some reason I thought the email was sent to java-dev) -Gran

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread kratoras
Actually what i figured out just now is that the problem is on the indexing part. A document with a 15MB size is transformed in a 23MB index which is not normal since on windows for the same document the index is 3MB. For the indexing i use: writer = new IndexWriter(index, new GreekAnalyzer(), !in

Re: Problem using Lucene on Ubuntu

2008-02-18 Thread Grant Ingersoll
This question is best asked on java-user. However, my guess is that it is related to your Locale and that you need to set the character encoding to Greek on Ubuntu when reading in your files. Something like: Reader reader = new InputStreamReader(new FileInputStream(file), "GREEK Char Enco