Hi -

I have a Pig script that hangs when it streams data through multiple Python
scripts. It only hangs in multiquery mode; the script runs successfully if I
add the -no_multiquery flag. However, the resulting run is extremely slow. I
am using Pig 0.7.

It seems that when I have a simple process (load, stream, save) the Pig
script works as expected. When I have a slightly more complex process (load
A, stream A -> A', generate B from A', process B to B', stream A' to output,
stream B' to output), the Pig script stalls and never completes execution.
The failure occurs in both local and mapred mode. I have been debugging in
local mode.

Here is an example of a working Pig script:

arrivals = load '$arrivals_path' using PigStorage('\t');
processed_arrivals = stream arrivals through `python -u -m ProcessArrivals`
as ( foo, bar, baz );

-- Replace empty strings with \N for preparation for import into MySQL
final_arrivals = stream processed_arrivals through `python -u -m Nullify`;
store final_arrivals into '$final_arrivals_path' using PigStorage('\t');

This works and everything is as expected.

Here is an example of a failing Pig script:

arrivals = load '$arrivals_path' using PigStorage('\t');
processed_arrivals = stream arrivals through `python -u -m ProcessArrivals`
as ( foo, bar, baz);

-- Derive a new data set from processed_arrivals

derived = foreach processed_arrivals generate foo, bar;
processed_derived = stream derived through `python -u -m ProcessDerived`;

-- Replace empty strings with \N for preparation for import into MySQL
final_arrivals = stream processed_arrivals through `python -u -m Nullify`;
final_derived = stream processed_derived through `python -u -m Nullify`;

store final_arrivals into '$final_arrivals_path' using PigStorage('\t');
store final_derived into '$final_derived_path' using PigStorage('\t');

When I run the failing script, I observe the following:

1. Pig output like the following:

2011-01-05 09:17:27,350 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:30,351 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:33,351 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:36,352 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:39,352 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 93036
Input bytes: 14089001 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 200064
Output bytes: 26794153 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
=====          * * *          =====
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 200064
Input bytes: 26794153 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 200064
Output bytes: 26794153 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
=====          * * *          =====
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 93036
Input bytes: 98024714 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 93036
Output bytes: 102466442 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
=====          * * *          =====
2011-01-05 09:17:42,353 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:45,353 [communication thread] INFO
 org.apache.hadoop.mapred.LocalJobRunner -

The tasks end, and then there are always 2 lines output, then nothing. It
hangs indefinitely.

2. Before I kill the hanging process, htop indicates that "python -u -m
ProcessArrivals" is running, but using 0% CPU and 1% memory.

3. Before I kill the hanging process, I can look at the temporary files in
the output directory. The files are the outputs "final_arrivals" and
"final_derived". Both files terminate abruptly, as if the output hasn't been
flushed. e.g., the user agent column might be:

(iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; en-us) AppleWebKit/533.17.9
(KHTML, like Ge

and then the file ends.

4. When I kill the hanging process (Ctrl-C), the output is a Python
traceback:

^CTraceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/svn/repository/ProcessArrivals.py", line 285, in <module>
    print io.serialize_output(ordered)
KeyboardInterrupt

This indicates that ProcessArrivals is hanging when it is writing an output
row. The code is nothing special: "ordered" is a python list, and
io.serialize_output is defined as:

def serialize_output(row):
  "Returns a string representation of data to output."
  return "\t".join([str(x) for x in row])

So nothing out of the ordinary there.

I have verified that piping the data through the Python scripts on the
command line completes successfully. Because it looked like a buffering
issue, I tried with running python in buffered mode (default) and unbuffered
mode (-u flag) to no avail.

Any thoughts of what I could try to make this run in multiquery mode or what
the problem might be are very much appreciated.

Thanks,
- Charles

Reply via email to