Do you have any fields with embedded newline characters? If so, certain hive output formats will parse the newline character as the end of row, and when importing, chances are the missing fields (now part of the next row) will be padded with nulls. This happens in Hive as well if you are using a TextFile intermediate format (SequenceFile does not have this problem).
-Nick Nicholas Szandor Hakobian Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack <zack.riesl...@sensus.com> wrote: > First time posting to this list. Please forgive me if I break etiquette. I’m > looking for some help with getting data from hive to hbase. > > > > I’m using HDP 2.2.8. > > > > I have a compressed (zlib), orc-based hive table with 12 columns and > billions of rows. > > In order to get the data into hbase, I have to create a copy of the table as > an "external" table, backed by CSV files (unless someone knows a better > way). > > > > Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv > files backing the external table. > > I’ve been doing this for almost a year, and MOST of the data ends up > correct, but if I export a large amount of data, I end up with nulls where I > shouldn't. > > > > If I run the exact same query on the source table (compressed orc) and > destination table (external text) I get null values in the results of the > latter, but not the former. > > > > However, if I only copy a small subset of the data to the text-based table, > all the data is correct. > > > > I also noticed that if I use an uncompressed source table, and then copy to > an external text-based table, it happens much more often. > > > > So, my (not-very-educated) guess is that this has to do with ORC files. > > > > I know that there are alternatives to ORC, but Hortonworks strongly > encourages us to use ORC for everything. I’m not even sure whether Parquet > works with HDP. > > > > Anyways, Is this a known bug? > > Any ideas on how I can get around it without chopping up my data into > multiple tables?