First time posting to this list. Please forgive me if I break etiquette. I'm 
looking for some help with getting data from hive to hbase.

I'm using HDP 2.2.8.

I have a compressed (zlib), orc-based hive table with 12 columns and billions 
of rows.

In order to get the data into hbase, I have to create a copy of the table as an 
"external" table, backed by CSV files (unless someone knows a better way).

Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv files 
backing the external table.

I've been doing this for almost a year, and MOST of the data ends up correct, 
but if I export a large amount of data, I end up with nulls where I shouldn't.

If I run the exact same query on the source table (compressed orc) and 
destination table (external text) I get null values in the results of the 
latter, but not the former.

However, if I only copy a small subset of the data to the text-based table, all 
the data is correct.

I also noticed that if I use an uncompressed source table, and then copy to an 
external text-based table, it happens much more often.

So, my (not-very-educated) guess is that this has to do with ORC files.

I know that there are alternatives to ORC, but Hortonworks strongly encourages 
us to use ORC for everything. I'm not even sure whether Parquet works with HDP.

Anyways, Is this a known bug?

Any ideas on how I can get around it without chopping up my data into multiple 
tables?

Reply via email to