[ https://issues.apache.org/jira/browse/HIVE-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashutosh Chauhan updated HIVE-2859: ----------------------------------- Affects Version/s: 0.9.0 0.8.0 0.8.1 Fix Version/s: (was: 0.9.0) (was: 0.7.1) Unlinking from 0.9 > STRING data corruption in internationalized data -- based on LANG env variable > ------------------------------------------------------------------------------ > > Key: HIVE-2859 > URL: https://issues.apache.org/jira/browse/HIVE-2859 > Project: Hive > Issue Type: Bug > Components: Configuration, Import/Export, Serializers/Deserializers, > Types > Affects Versions: 0.7.1, 0.8.0, 0.8.1, 0.9.0 > Environment: Windows / RHEL5 with LANG = en_US.CP1252 > Reporter: John Gordon > Original Estimate: 6h > Remaining Estimate: 6h > > This is a bug in Hive that is exacerbated by replatforming it to Windows > without CYGWIN. Basically, it assumes that the default file.encoding is > UTF8. There are something like 6-7 getBytes() calls and write() calls that > don't specify the encoding. The rest specify UTF-8 explicitly, which blocks > auto-detection of UTF-16 data in files with a BOM present. The mix of > explicit encodings and default encoding assumptions means that Hive must be > run in a JVM whose default encoding is UTF-8 and only UTF-8. > > When the JVM starts up, it derives the default encoding from the C runtime > setlocale() call. On Linux/Unix, this would use the LANG env variable (which > is almost always <locale>.UTF8 for machines handling internationalized data, > but not guaranteed to be so). On Windows, this is derived from the user's > language settings, and cannot return a UTF-8 encoding, right now. So there > isn't an environment setting for Windows that would reliably provide the JVM > with a set of inputs to cause it to set the default encoding to UTF-8 on > startup without additional options. > However, there are 2 feasible options: > 1.) the JVM has a startup option -Dfile.encoding=UTF-8 which should > explicitly override the default encoding detection behavior in the JVM to > make it always UTF-8 regardless of the environmental configuration. This > would make all deployments on all OS/environment configs behave consistently. > I don't know where Hive sets the JVM options we use when it starts the > service. > 2.) We could add "UTF8" explicitly to all the remaining getBytes() calls that > need it, and make all the string I/O explicitly UTF-8 encoded. This is > probably being changed right now as part of Hive-1505, so we would duplicate > effort and maybe make that change harder. Seems easier to trick the JVM into > behaving like it is on a well-configured machine WRT default encoding instead > of setting explicit encodings everywhere. > > So: > - Pretty much any globalized strings than Western European are going to > be corrupted in the current Hive service on Windows with this bug present > because there really isn't a way to have the JVM read the environment and > determine by default that UTF8 should be the default encoding. > - Anyone can repro this on Linux fairly easily -- Add "export > LANG=en_US.CP1252" to /etc/profile to modify the global LANG default encoding > to CP1252 explicitly, then restart the service and do a query over > internationalized UTF-8 data. > - We shouldn't rely on JVM default codepage selection if we want to > support UTF-8 consistently and reliably as the default encoding. > - The estimate can range wildly, but adding an explicit default > encoding on startup should only take a little while if you know where to do > it, theoretically. > - I don't know where to update the start arguments of the JVM when the > service is started, just getting into the code for the first time with this > bug investigation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira