The transform scripts (or executables) are run as separate processes, so it sounds like Hive itself is blowing up. That would be consistent with your script working fine outside Hive. The Hive or Hadoop logs might have clues.
So, it happens consistently with this one file? I would check to be sure that there isn't a subtle error in the file or the output from your script, say an extra tab, other whitespace, or a malformed data value. If you can find the line where it blows up, that would be good. You could have your script dump debug data, like an index for each input and the corresponding key-value pair. Or modify the output of the script and the query results to return information like this to Hive. It seems more likely that the problem is downstream from when the data passes through the query. So, you could try changing the Hive query to just dump the script results and do nothing else afterwards, etc. However, I wouldn't expect those problems to cause heap exhaustion, unless it somehow triggers an infinite loop. Can you share your python script, Hive query, table schema(s), and a sample of the file? dean On Wed, Jan 16, 2013 at 9:32 PM, John Omernik <j...@omernik.com> wrote: > I am perplexed if I run a transform script on a file by itself, it runs > fine, outputs to standard out life is good. If I run the transform script > on that same file (with the path and filename being passed into the script > via transform so that the python script is doing the exact same thing) I > get a java heap space error. This process works on 99% of files, and I just > can't figure out why this file is different. How does say a python > transform script run "in" the java process (if that is even what it is > doing) so that it causes a heap error in a transform script but not run > without java around? > > I am curious on what steps I can take to trouble shoot or eliminate this > problem. > > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330