RE: Question about solr config files encoding.

Uwe Schindler Thu, 05 Jul 2012 08:41:28 -0700

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.


   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

:-)

I think we can safely assume it is UTF-8, otherwise we must do the same shit 
like XML parsers with mark() on BufferedInputStream.... Most libraries out 
there can only read UTF-8 and SOLR itself produces only UTF8 JSON, right? Those 
tests only check response from solr.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Dawid Weiss
> Sent: Thursday, July 05, 2012 5:35 PM
> To: [email protected]
> Subject: Re: Question about solr config files encoding.
> 
> > But JSON is defined to be UTF-8, so we must supply the encoding
> (IOUtils.UTF8_CHARSET).
> 
> That RFC says it can be any unicode... this said I agree with you that we can
> probably assume it's UTF-8 and not worry about anything else.
> 
> Dawid
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional
> commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Question about solr config files encoding.

Reply via email to