Hey folks,

I'm a brand new user and I'm working on my first 'real' script. The idea is
to count web traffic hits by day, user, and url. At the end I want to join
some account information
for each user. I'm running into an issue and I'm not sure how to go about
debugging my work.

The sso_to_account.csv is basically user,account_number\n and the  web_data
is a TSV file with 255 columns. I built a .pig_schema file for that file
and placed it alongside the data.
I have one compressed file for each day of data. Picking the first day of
data and running it alone produces the correct output. But running the
following:

pig -f pig_scripts/web_users_by_day.pig results in failed jobs with the
following stack trace:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:116)
        at org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:280)
        at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:244)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
        at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

I found the following bug on jira, but it doesn't seem related:
https://issues.apache.org/jira/browse/PIG-3051

This issue looks much more relevant:
https://issues.apache.org/jira/browse/PIG-2127
, but the comments say it is resolved.

I removed any extra windows style carriage returns with the following job
prior to working:

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.1.2.24.jar
-D mapred.output.compress=true -D
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec -D
mapred.reduce.tasks=0 -mapper "tr -d '\r'" -reducer NONE -input
/user/jjaggars/web_data/in -output /user/jjaggars/web_data/clean

Here is my (slightly sanitized) script:

accounts = LOAD '/user/jjaggars/sso_to_account.csv' USING PigStorage(',')
AS (user:chararray, account:chararray);
web_data = LOAD '/user/jjaggars/web_data/clean/*.lzo' USING
PigStorage('\t', 'schema');
logged_in = FILTER web_data BY evar37 is not null AND date_time is not null
AND evar23 is not null;
working_set = FOREACH logged_in GENERATE SUBSTRING(date_time, 0, 11) AS
date, REPLACE(evar37, '"', '') AS user, evar23 AS url;
by_day = GROUP working_set BY (date, user, url);
hits_by_day = FOREACH by_day GENERATE FLATTEN(group) as (date, user, url),
COUNT(working_set) AS hits;
hits_with_account = JOIN hits_by_day BY user, accounts BY user;
final = FOREACH hits_with_account GENERATE hits_by_day::date,
hits_by_day::user, hits_by_day::url, accounts::account, hits_by_day::hits;
STORE final INTO 'hits_by_day' USING PigStorage();

Here's some version info:

$ pig --version
Apache Pig version 0.11.2-SNAPSHOT (r: unknown)
compiled Aug 02 2013, 11:28:54

$ hadoop version
Hadoop 1.1.2.24
Subversion  -r
Compiled by jenkins on Fri May 17 21:33:29 EDT 2013
>From source with checksum c531493fc3ba97aab8691c08c2ddaed1

Reply via email to