The source data for my index is already in standard UTF-8 and available as a
simple byte array. I need to do some simple tokenization of the data (check
for whitespace and special characters that control position increment). What
is the most efficient way to index this data and avoid unnecessary
c
@lucene.apache.org
Subject: Re: encoding question.
Internally Lucene deals with pure Java Strings; when writing those
strings
to and reading those strings back from disk, Lucene allways uses the
stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typc
Internally Lucene deals with pure Java Strings; when writing those strings
to and reading those strings back from disk, Lucene allways uses the stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typcially when people have encoding problems in their