> And again: the same row is correct if I export a small set of data, and
>incorrect if I export a large set - so I think that file/data size has
>something to do with this.

My Phoenix vs LLAP benchmark hit size related issues in ETL.


In my case, the tipping point was >1 hdfs block per CSV file.

Generating CSV files compressed with SNAPPY was how I prevented the
old-style MapReduce splitters from arbitrarily chopping up those files on
block boundaries while loading.

>I just tested and if I take the orc table, copy it to a sequence file,
>and then copy to a csv "file", everything looks good.
...
> So, my (not-very-educated) guess is that this has to do with ORC files.

Yes, though somewhat indirectly. Check the output file sizes between those
two.

ORC -> SequenceFile -> Text

will produce smaller text files (more of them) than

ORC -> Text.

Cheers,
Gopal


Reply via email to