Hi All,

I would like to ask for your advice with below use case:

While working with Hive we came across the situation when Hive produces
inconsistent results (row counts are different)

*Reproduce steps:*

1) create schema and populate with data;

CREATE TABLE `a`(
  `key` string COMMENT 'from deserializer',
  `a` bigint COMMENT 'from deserializer',
  `b` bigint COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'=':key, CF1:A, CF2:B',
  'serialization.format'='1')
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false',
  'hbase.table.default.storage.type'='binary',
  'hbase.table.name'='A');

CREATE EXTERNAL TABLE `b`(
  `key` string COMMENT 'from deserializer',
  `b` bigint COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'=':key, CF2:B',
  'serialization.format'='1')
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false',
  'hbase.table.default.storage.type'='binary',
  'hbase.table.name'='A');

INSERT INTO A VALUES (1,null,1),(2,2,null),(3,3,3);


2) issue the following queries:

0: jdbc:hive2://localhost:10000> SELECT key from b;[...]
+------+--+
| key  |
+------+--+
| 1    |
| 2    |
| 3    |
+------+--+
3 rows selected (0.337 seconds)
0: jdbc:hive2://localhost:10000> SELECT * from b;[..]
+--------+------+--+
| b.key  | b.b  |
+--------+------+--+
| 1      | 1    |
| 3      | 3    |
+--------+------+--+
2 rows selected (0.401 seconds)


Rows with *NULL* values will be ignored by HBASE:



In case of `*select*` by `*rowkey*`, *HiveHBaseInputFormatUtil.java*
constructs following SCAN object, that doesn't consider column families:



When extending SCAN object with column families results become consistent:



*Questions:*

1. Do you think it's a real issue or just an invalid use case?
2. If it is and issue, would it be enough to extend SCAN object with column
families?

Thank you!

Best Regards,
Denys

Reply via email to