[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331309#comment-16331309 ]
Will Lauer commented on PIG-4608: --------------------------------- Ok, just to close the loop, here are several examples given the new proposed syntax. I want to make sure I understand which are correct and what the behavior is in each case. ``` /* simple projection, specifying resulting schema, using both explicit column names and positions */ a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z; a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); -- flattening tuples into individual columns a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into multiple rows /* complex projection, specifying resulting schema, using both explicity column names and positions */ a = FOREACH b { q = COUNT(s); r = someUdf($1,$2); GENERATE q as x:long, r as y; } /* simple update */ a = FOREACH b UPDATE q with r+s; /* complex update */ a = FOREACH b { q = COUNT(s); r = someUdf($1, $2); UPDATE qprime WITH q, rprime WITH r; } /* simple update using positional arguments */ a = FOREACH b UPDATE $1 with r+$2; /* simple renaming of a column */ a = FOREACH b UPDATE q as r; /* simple schema type change */ a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type is changed, an explicit modify of the value should occur /* rename, type, and value change together */ a = FOREACH b UPDATE q WITH computeR(q) as r:long; /* simple column drop */ a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be present in a DROP statement /* updating an individual field within a tuple - not implemented in the initial version */ a = FOREACH b UPDATE q.$1.fieldN WITH r+s; /* renaming an individual field within a tuple - not implemented in the initial version */ a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming the field within q.$1, not renaming q or $1 /* flattening a tuple into existing fields - does this make sense?*/ a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5); a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one column during flattening assignment a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long); -- re-typing arguments as part of flattening /* flattening a bag into existing fields, exploding rows in the process -- does this make sense? */ a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol); a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and possibly retype as part of the flatten ``` While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as a pig script writer. I'd love to have [~kpriceyahoo] weigh in on the proposal to ensure it still makes sense to heavy pig script writers. > FOREACH ... UPDATE > ------------------ > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature > Reporter: Haley Thrapp > Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)