Re: encoding question.

2007-07-19 Thread Peter Keegan
The source data for my index is already in standard UTF-8 and available as a simple byte array. I need to do some simple tokenization of the data (check for whitespace and special characters that control position increment). What is the most efficient way to index this data and avoid unnecessary c

RE: encoding question.

2007-02-14 Thread Benson Margulies
@lucene.apache.org Subject: Re: encoding question. Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typc

Re: encoding question.

2007-02-14 Thread Chris Hostetter
Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typcially when people have encoding problems in their