Hi KK, > right? and remove this conversion that I'm doing later , > > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > This will make sure I'm not depending on the platform encoding, right?
In principle, yes. This is because you encode the binary bytes to a wrong-encoded stream in the platform encoding, then you decode that stream again and reencode it using UTF-8. This works, as long as you will not loose chars through this conversion! > This > seems to fix my indexing issue. Now regarding searching I dont need to > mention any charset thing there, I'm using stardard anyalyzer? As I know > lucene stores the chars as raw unicode so when I present my query in the > same unicode format lucene will give me proper results. Currently I'm not > using the encoding for HTTP parameters, I'll use that and let you know. > Thank you very much. > > KK, > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > I forgot: > > > > > byte [] utfEncodeByteArray = textOnly.getBytes(); > > > String utfString = new String(utfEncodeByteArray, > Charset.forName("UTF- > > > 8")); > > > > > > here textonly is the text extracted from the downloaded page > > > > What is textonly here? A String, if yes, why decode and then again > encode > > it? The important thing is: > > Strings in Java are always invariant to charsets (internally they are > > UTF-16). So if you convert a byte array to a string you have to specify > a > > charset (as you have done in new String code). If you convert a String > to a > > byte array, you must do the same. > > > > As mentioned in the mail before, the same is true, when converting > > InputStreams to Readers and Writers to OutputStreams (this can be done > > using > > the converter). > > > > And: If you get a String from somewhere, that looks bad, you cannot > convert > > the String to another encoding, it was corrupted during conversion to > > string > > before. > > > > E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify > the > > input encoding of the HTTP parameters and so on. > > > > Uwe > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org