First time posting to this list. Please forgive me if I break etiquette. I'm looking for some help with getting data from hive to hbase.
I'm using HDP 2.2.8. I have a compressed (zlib), orc-based hive table with 12 columns and billions of rows. In order to get the data into hbase, I have to create a copy of the table as an "external" table, backed by CSV files (unless someone knows a better way). Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv files backing the external table. I've been doing this for almost a year, and MOST of the data ends up correct, but if I export a large amount of data, I end up with nulls where I shouldn't. If I run the exact same query on the source table (compressed orc) and destination table (external text) I get null values in the results of the latter, but not the former. However, if I only copy a small subset of the data to the text-based table, all the data is correct. I also noticed that if I use an uncompressed source table, and then copy to an external text-based table, it happens much more often. So, my (not-very-educated) guess is that this has to do with ORC files. I know that there are alternatives to ORC, but Hortonworks strongly encourages us to use ORC for everything. I'm not even sure whether Parquet works with HDP. Anyways, Is this a known bug? Any ideas on how I can get around it without chopping up my data into multiple tables?