We have HBase tables where column qualifiers have whitespace suffixes. The 
reason for that was to use short qualifiers, ideally single byte; and counting 
started with \u0001.
 
Now I need to hook the HBase table into Hive, so I define a column mapping, e.g.
 
CREATE EXTERNAL TABLE abc (key string, column string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ":key,A:\u0001")
TBLPROPERTIES ('hbase.table.name' = 'hbase_abc');
 
The problem is, with 'hbase.columns.mapping' = ":key,A:\u0001", the second column, 
A:\u0001, ends with a whitespace (< \u0020) and because of [1] and [2], it gets trimmed 
by String.trim() ([3]).
Even the HBase documentation is wrong ([4]):
 
    "whitespace should not be used in between entries since these will be 
interperted as part of the column name, which is almost certainly not what you want"
 
The reason for HIVE-3243 was for being "less confusing". However, one could argue that it added an 
implicit "auto-correct" on top of the syntax of the column mappings, which is even worse, as it 
"trimmed" down what you can use as HBase column qualifiers. 
 
I see the issue of backwards-compatibility and if we change it, it will change 
the current behaviour for people relying on the whitspace-trimming.

What are your opinions? 

Regards,
Marcel
 
[1] 
https://github.com/apache/hive/blob/32e854ef1c25f21d53f7932723cfc76bf75a71cd/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java#L178
[2] https://issues.apache.org/jira/browse/HIVE-3243
[3] https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim()
[4] 
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-ColumnMapping

Reply via email to