[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

Will Lauer (JIRA) Thu, 18 Jan 2018 14:01:35 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331309#comment-16331309
 ]


Will Lauer commented on PIG-4608:
---------------------------------

Ok, just to close the loop, here are several examples given the new proposed 
syntax. I want to make sure I understand which are correct and what the 
behavior is in each case.

```
/* simple projection, specifying resulting schema, using both explicit column 
names and positions */
a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z;
a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); -- 
flattening tuples into individual columns
a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into 
multiple rows

/* complex projection, specifying resulting schema, using both explicity column 
names and positions */
a = FOREACH b {
    q = COUNT(s);
    r = someUdf($1,$2);
    GENERATE q as x:long, r as y;
}

/* simple update */
a = FOREACH b UPDATE q with r+s;

/* complex update */
a = FOREACH b {
    q = COUNT(s);
    r = someUdf($1, $2);
    UPDATE qprime WITH q, rprime WITH r;
}

/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;

/* simple renaming of a column */
a = FOREACH b UPDATE q as r;

/* simple schema type change */
a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int
a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type 
is changed, an explicit modify of the value should occur

/* rename, type, and value change together */
a = FOREACH b UPDATE q WITH computeR(q) as r:long;

/* simple column drop */
a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column
a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be 
present in a DROP statement
 
/* updating an individual field within a tuple - not implemented in the initial 
version */
a = FOREACH b UPDATE q.$1.fieldN WITH r+s; 

/* renaming an individual field within a tuple - not implemented in the initial 
version */
a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming 
the field within q.$1, not renaming q or $1

/* flattening a tuple into existing fields - does this make sense?*/
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5);
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one 
column during flattening assignment
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long); 
-- re-typing arguments as part of flattening

/* flattening a bag into existing fields, exploding rows in the process -- does 
this make sense? */
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol);
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and 
possibly retype as part of the flatten
```

While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as 
a pig script writer. I'd love to have [~kpriceyahoo] weigh in on the proposal 
to ensure it still makes sense to heavy pig script writers.

> FOREACH ... UPDATE
> ------------------
>
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
>            Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

Reply via email to