Looks like one of your files is not parsed. By default pig storage thinks that your file is tab delimited. 03.08.2013 2:49 пользователь "Jesse Jaggars" <[email protected]> написал:
> Hey folks, > > I'm a brand new user and I'm working on my first 'real' script. The idea is > to count web traffic hits by day, user, and url. At the end I want to join > some account information > for each user. I'm running into an issue and I'm not sure how to go about > debugging my work. > > The sso_to_account.csv is basically user,account_number\n and the web_data > is a TSV file with 255 columns. I built a .pig_schema file for that file > and placed it alongside the data. > I have one compressed file for each day of data. Picking the first day of > data and running it alone produces the correct output. But running the > following: > > pig -f pig_scripts/web_users_by_day.pig results in failed jobs with the > following stack trace: > > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:116) > at > org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:280) > at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:244) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) > at > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > I found the following bug on jira, but it doesn't seem related: > https://issues.apache.org/jira/browse/PIG-3051 > > This issue looks much more relevant: > https://issues.apache.org/jira/browse/PIG-2127 > , but the comments say it is resolved. > > I removed any extra windows style carriage returns with the following job > prior to working: > > hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.1.2.24.jar > -D mapred.output.compress=true -D > mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec -D > mapred.reduce.tasks=0 -mapper "tr -d '\r'" -reducer NONE -input > /user/jjaggars/web_data/in -output /user/jjaggars/web_data/clean > > Here is my (slightly sanitized) script: > > accounts = LOAD '/user/jjaggars/sso_to_account.csv' USING PigStorage(',') > AS (user:chararray, account:chararray); > web_data = LOAD '/user/jjaggars/web_data/clean/*.lzo' USING > PigStorage('\t', 'schema'); > logged_in = FILTER web_data BY evar37 is not null AND date_time is not null > AND evar23 is not null; > working_set = FOREACH logged_in GENERATE SUBSTRING(date_time, 0, 11) AS > date, REPLACE(evar37, '"', '') AS user, evar23 AS url; > by_day = GROUP working_set BY (date, user, url); > hits_by_day = FOREACH by_day GENERATE FLATTEN(group) as (date, user, url), > COUNT(working_set) AS hits; > hits_with_account = JOIN hits_by_day BY user, accounts BY user; > final = FOREACH hits_with_account GENERATE hits_by_day::date, > hits_by_day::user, hits_by_day::url, accounts::account, hits_by_day::hits; > STORE final INTO 'hits_by_day' USING PigStorage(); > > Here's some version info: > > $ pig --version > Apache Pig version 0.11.2-SNAPSHOT (r: unknown) > compiled Aug 02 2013, 11:28:54 > > $ hadoop version > Hadoop 1.1.2.24 > Subversion -r > Compiled by jenkins on Fri May 17 21:33:29 EDT 2013 > From source with checksum c531493fc3ba97aab8691c08c2ddaed1 >
