Config fiules are XML and I changed them to be handled by the XML parser 
(InputStreams), so XML parser reads encoding from Header.

But JSON is defined to be UTF-8, so we must supply the encoding 
(IOUtils.UTF8_CHARSET).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Dawid Weiss [mailto:[email protected]]
> Sent: Thursday, July 05, 2012 5:00 PM
> To: [email protected]
> Subject: Question about solr config files encoding.
> 
> Guys should the encoding of config files really be platform-dependent?
> Currently Solr tests fail massively on setup because of things like
> this:
> 
>     public OpenExchangeRates(InputStream ratesStream) throws IOException {
>       parser = new JSONParser(new InputStreamReader(ratesStream));
> 
> this reader, when confronted with UTF-16 as file.encoding results in funky
> exceptions like:
> 
>    > Caused by: org.apache.noggit.JSONParser$ParseException: JSON Parse
> Error: char=笊,position=0 BEFORE='笊'
> AFTER='†≤楳捬慩浥爢㨠≔桩猠摡瑡⁩猠捯汬散瑥搠晲潭⁶慲楯畳⁰牯癩摥牳⁡
> 湤⁰牯癩摥搠晲'
>    >  at org.apache.noggit.JSONParser.err(JSONParser.java:221)
>    >  at org.apache.noggit.JSONParser.next(JSONParser.java:620)
>    >  at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661)
>    >  at
> org.apache.solr.schema.OpenExchangeRatesOrgProvider$OpenExchangeRates.
> <init>(OpenExchangeRatesOrgProvider.java:189)
>    >  at
> org.apache.solr.schema.OpenExchangeRatesOrgProvider.reload(OpenExchang
> eRatesOrgProvider.java:129)
> 
> Can we fix the encoding of these input files to UTF-8 or something?
> According to JSON RFC:
> 
> http://tools.ietf.org/html/rfc4627#section-3
> 
> JSON text SHALL be encoded in Unicode.  The default encoding is
>    UTF-8.
> 
>    Since the first two characters of a JSON text will always be ASCII
>    characters [RFC0020], it is possible to determine whether an octet
>    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>    at the pattern of nulls in the first four octets.
> 
>            00 00 00 xx  UTF-32BE
>            00 xx 00 xx  UTF-16BE
>            xx 00 00 00  UTF-32LE
>            xx 00 xx 00  UTF-16LE
>            xx xx xx xx  UTF-8
> 
> We could just enforce/require UTF-8? Alternatively, auto-detect this from a
> binary stream as a custom Reader class.
> 
> Dawid
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional
> commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to