I am running some data that isn't huge persay, but I performing processing on it to get into my final table (RCFile).
One of the challenges is that it comes in large blocks of data, for example, I may have a 70MB chunk of binary data that I want to put in. My process that generates this data hexes it, so that 70 MB becomes a 140 MB string of data. Then when I insert into the binary field I use unhex. Now, my nodes are not huge, I have 8 nodes 6 GB of ram each. A typical load reads the hex encoded from an external load table, and then inserts it (no joins etc). Most data loads fine, but when I get chunks above 32 MB in raw size I'll get failures. I am working getting a some adjustments on my source data to minimize those large chunks. That being said, what are somethings I can do at the hive/insert level that can reduce the heap space issues? I've tried playing with split size, reusing jvms, and heap space. But it's all trial and error, and I'd like to have more real world examples of conditions where one settings makes sense and another does not. I am not looking for a googling here, just some examples (even links to examples) showing that with this type of data or setup, you can get (less mem usage, faster performance, etc) by tweaking these settings. I think my issue is there are so many settings that say do this or that, and they don't really provide real world examples, it makes it tough to know where to start.