I'm having a weird issue.

When I invoke my mapreduce with a secondary sort using
the KeyFieldBasedPartitioner, it's altering lines containing backslashes.
 Or I've made some foolish conceptual error and my script is doing so, but
only when there's a partitioner.  Any advice welcome.  I've attached the
script and a bowdlerized copy of the output, since I fear the worst for the
formatting on the text below.

With no partitioner, among a few million other million lines, my script
produces this one correctly:

=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========


( was called using: )


hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
    -mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer 
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"


Note that the website field contained
  http://http:\\www.MyWebsitee.com
(this person clearly either fails at internet or wins at windows)

When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on
these few lines with backslashes, generating instead a single backslash
followed by a tab:


=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========


( was called using: )

hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf    map.output.key.field.separator='\t' \
    -jobconf    num.key.fields.for.partition=1 \
    -jobconf stream.map.output.field.separator='\t' \
    -jobconf stream.num.map.output.key.fields=2 \
    -mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer 
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"


When I run the script on the command line
  cat input | hadoop_parse_json.rb | sort -k1,2
| hadoop_uniq_without_timestamp.rb
everything works as I'd like.

I've hunted through the JIRA and found nothing.
If this sounds like a problem with hadoop I'll try to isolate a proper test
case.

Thanks for any advice,
flip
The output of my script with no secondary sort produces, among a few million 
others, this line correctly:

=========
twitter_user_profile    twitter_user_profile-0000018421-20081205-184526 
0000018421      M...e   http://http:\\www.MyWebsitee.com        S, NJ   I... 
notice.    Eastern Time (US & Canada)      -18000  20081205-184526
=========

When I use a KeyFieldBasedPartitioner, it reaches in and diddles lines with 
backslashes:

=========
twitter_user_profile    twitter_user_profile-0000018421-20081205-184526 
0000018421      M...e   http://http:\   www.MyWebsitee.com      S, NJ   I... 
notice.    Eastern Time (US & Canada)      -18000  20081205-184526
=========

===========================================================================
==
== Script, with partitioner
==

#!/usr/bin/env bash
input_id=$1
output_id=$2
hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar         
        \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner          
\
    -jobconf    map.output.key.field.separator='\t'                             
\
    -jobconf    num.key.fields.for.partition=1                                  
\
    -jobconf    stream.map.output.field.separator='\t'                          
\
    -jobconf    stream.num.map.output.key.fields=2                              
        \
    -mapper     
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer    
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
 \
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"                                            
\
    -file    hadoop_utils.rb                                                    
\
    -file    twitter_flat_model.rb                                              
        \
    -file    twitter_autourl.rb

== Excerpt of output.  Everything is correct except the url field

twitter_user_profile    twitter_user_profile-0000018441-20081205-024904 
0000018441      G..er   http://www.l... D...    O fun...:-)                     
20081205-024904
twitter_user_profile    twitter_user_profile-0000018441-20081205-084448 
0000018441      S...e                           Eastern Time (US & Canada)      
-18000  20081205-084448 
twitter_user_profile    twitter_user_profile-0000018421-20081205-184526 
0000018421      M...e   http://http:\   www.MyWebsitee.com      S, NJ   I... 
notice.    Eastern Time (US & Canada)      -18000  20081205-184526
twitter_user_profile    twitter_user_profile-0000018481-20081205-030907 
0000018481      J       http://i....com D...    T....   A       43200   
20081205-030907 
twitter_user_profile    twitter_user_profile-0000018401-20081205-010944 
0000018401      O                               London  0       20081205-010944 


== Removing the partitioner...

#!/usr/bin/env bash
input_id=$1
output_id=$2
hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar         
        \
    -mapper     
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer    
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
 \
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"                                            
\
    -file    hadoop_utils.rb                                                    
\
    -file    twitter_flat_model.rb                                              
        \
    -file    twitter_autourl.rb


== ... leaves output fields unmolested.

twitter_user_profile    twitter_user_profile-0000059832-20081205-184727 
0000059832      m...o                           S       28800   20081205-184727
twitter_user_profile    twitter_user_profile-0000146069-20081205-184637 
0000146069      M...d                                           20081205-184637
twitter_user_profile    twitter_user_profile-0000000069-20081205-184525 
0000000069      M....   http://www.m.....vox.com        S       C.....  Eastern 
Time (US & Canada)      -18000  20081205-184525
twitter_user_profile    twitter_user_profile-0000167822-20081205-184710 
0000167822      M...y                                           20081205-184710
twitter_user_profile    twitter_user_profile-0000117502-20081205-184637 
0000117502      M...g           ""      ""      B       3600    20081205-184637
twitter_user_profile    twitter_user_profile-0000018421-20081205-184526 
0000018421      M...e   http://http:\\www.MyWebsitee.com        S, NJ   I... 
notice.    Eastern Time (US & Canada)      -18000  20081205-184526
twitter_user_profile    twitter_user_profile-0000147671-20081205-184455 
0000147671      M....k  http://www.C...U.com    E, IL   C....,. Central Time 
(US & Canada)      -21600  20081205-184455
twitter_user_profile    twitter_user_profile-0000161375-20081205-184637 
0000161375      M....y                          Q       -18000  20081205-184637
twitter_user_profile    twitter_user_profile-0000142698-20081205-184527 
0000142698      M....r                                          20081205-184527

Reply via email to