[jira] [Created] (HIVE-7853) OrcNewInputFormat

john (JIRA) Fri, 22 Aug 2014 12:20:27 -0700

john created HIVE-7853:
--------------------------

             Summary: OrcNewInputFormat
                 Key: HIVE-7853
                 URL: https://issues.apache.org/jira/browse/HIVE-7853
             Project: Hive
          Issue Type: Bug
          Components: File Formats
    Affects Versions: 0.13.1
         Environment: all
            Reporter: john



Key is null in map when OrcNewInputFormat is used as Input Format Class

When using OrcNewInputFormat as input format class for my map reduce job, I 
find its key is always null in my map method. This gives me no way to get row 
number in my map method.  If you compare RCFileInputFormat (for RC file), its 
key in map method returns the row number so I know which row I am processing. 

Is there any workaround for me to get the row number from my map method?  Of 
course, I can count the row number by myself.  But that has two problems: #1 I 
have to assume the row is coming in the order; #2 I will get duplicated (and 
wrong) row numbers if a big input file causes multiple file splits (which will 
trigger my map method multiple times in different data nodes).   At this point, 
I am really seeking a better way to get row number for each processed row in 
map method.

Here is what I have in my map logs:

        [2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper 
Input Key: (null)
        [2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper 
Input Value: {Q81510000, T99760000, 699760000, 81567560000, 9667981610000, 
978989898980000, Laura, [email protected]}

My map method is:

        protected void map(Object key, Writable value, Context context)
                        throws IOException, InterruptedException {
                logger.debug("Mapper Input Key: " + key);
                logger.debug("Mapper Input Value: " + value.toString());
                .....
        }

The fix should be: add  following statement in nextKeyValue() method and pass 
the result all the way up to the map() method as its key:

          reader.getRowNumber(); 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7853) OrcNewInputFormat

Reply via email to