> And again: the same row is correct if I export a small set of data, and >incorrect if I export a large set - so I think that file/data size has >something to do with this.
My Phoenix vs LLAP benchmark hit size related issues in ETL. In my case, the tipping point was >1 hdfs block per CSV file. Generating CSV files compressed with SNAPPY was how I prevented the old-style MapReduce splitters from arbitrarily chopping up those files on block boundaries while loading. >I just tested and if I take the orc table, copy it to a sequence file, >and then copy to a csv "file", everything looks good. ... > So, my (not-very-educated) guess is that this has to do with ORC files. Yes, though somewhat indirectly. Check the output file sizes between those two. ORC -> SequenceFile -> Text will produce smaller text files (more of them) than ORC -> Text. Cheers, Gopal