Good point Jan!
On Feb 18, 2008, at 9:00 AM, Jan Peter Stotz wrote:
Grant Ingersoll wrote:
Note: ENCODING is whatever encoding the file is in, as in "UTF-8",
if that is what your files are in.
I think there is a misunderstanding, the WordExtractor extracts text
from MS Word (.doc) files.
Grant Ingersoll wrote:
Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if
that is what your files are in.
I think there is a misunderstanding, the WordExtractor extracts text
from MS Word (.doc) files. Those files are binary and therefore does not
have an encoding.
I wou
Not sure about WordExtractor, does it also take a Reader? I would try:
Reader input = new InputStreamReader(new FileInputStream(file),
"ENCODING");
WordExtractor extractor = new WordExtractor(input);
content = extractor.getText();
Note: ENCODING is whatever encoding the file is in, as in "UT
No problem about the misunderstanding.
I am using
InputStream input =new URL ( "file:///"+file.getAbsolutePath()
).openStream ();
WordExtractor extractor = new WordExtractor(input);
content=extractor.getText();
where the wordextractor is org.apache.poi.hwpf.extractor.WordExtractor;
The word
How are you loading the document into the content variable below? My
guess is still that you have different locales on Windows and Ubuntu.
(Btw, sorry about the java-user comment. I should wake up before
sending responses. For some reason I thought the email was sent to
java-dev)
-Gran
Actually what i figured out just now is that the problem is on the indexing
part. A document with a 15MB size is transformed in a 23MB index which is
not normal since on windows for the same document the index is 3MB. For the
indexing i use:
writer = new IndexWriter(index, new GreekAnalyzer(), !in
This question is best asked on java-user. However, my guess is that
it is related to your Locale and that you need to set the character
encoding to Greek on Ubuntu when reading in your files.
Something like: Reader reader = new InputStreamReader(new
FileInputStream(file), "GREEK Char Enco