Hi All, I would like to ask for your advice with below use case:
While working with Hive we came across the situation when Hive produces inconsistent results (row counts are different) *Reproduce steps:* 1) create schema and populate with data; CREATE TABLE `a`( `key` string COMMENT 'from deserializer', `a` bigint COMMENT 'from deserializer', `b` bigint COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key, CF1:A, CF2:B', 'serialization.format'='1') TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'hbase.table.default.storage.type'='binary', 'hbase.table.name'='A'); CREATE EXTERNAL TABLE `b`( `key` string COMMENT 'from deserializer', `b` bigint COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key, CF2:B', 'serialization.format'='1') TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'hbase.table.default.storage.type'='binary', 'hbase.table.name'='A'); INSERT INTO A VALUES (1,null,1),(2,2,null),(3,3,3); 2) issue the following queries: 0: jdbc:hive2://localhost:10000> SELECT key from b;[...] +------+--+ | key | +------+--+ | 1 | | 2 | | 3 | +------+--+ 3 rows selected (0.337 seconds) 0: jdbc:hive2://localhost:10000> SELECT * from b;[..] +--------+------+--+ | b.key | b.b | +--------+------+--+ | 1 | 1 | | 3 | 3 | +--------+------+--+ 2 rows selected (0.401 seconds) Rows with *NULL* values will be ignored by HBASE: In case of `*select*` by `*rowkey*`, *HiveHBaseInputFormatUtil.java* constructs following SCAN object, that doesn't consider column families: When extending SCAN object with column families results become consistent: *Questions:* 1. Do you think it's a real issue or just an invalid use case? 2. If it is and issue, would it be enough to extend SCAN object with column families? Thank you! Best Regards, Denys