Great, thanks! It helped.
2013/7/23 Pradeep Gollakota <[email protected]> > You can do the SPLIT outside the nested FOREACH. I'm assuming you have UDF > defined for VALID. > > So, your scrpit can be written as: > > rawRecords = LOAD '/data' as ...; > grouped = GROUP rawRecords BY msisdn; > validAndNotValidRecords = FOREACH grouped { > ordered = ORDER rawRecords BY ts; > GENERATE group as group_key, ordered as data; > }; > SPLIT validAndNotValidRecords INTO validRecords IF VALID(data), INTO > invalidRecords OTHERWISE; > > > > > On Tue, Jul 23, 2013 at 8:58 AM, Serega Sheypak <[email protected] > >wrote: > > > Omg, thanks it's exactly the thing I need. > > > > I can't do it before GROUP. I need group by key, then sort by timestamp > > field inside each group. > > After sort is done I do can determine non valid records. > > I've provided simplified case. > > > > The only problem is that SPLIT is not allowed in nested FOREACH > statement. > > > > > > 2013/7/23 Pradeep Gollakota <[email protected]> > > > > > You can use the SPLIT operator to split a relation into two (or more) > > > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT > > > > > > Also, you should probably do this before GROUP. As a best practice (and > > > general pig optimization strategy), you should filter (and project) > early > > > and often. > > > > > > > > > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak < > > [email protected] > > > >wrote: > > > > > > > Hi, I have rather simple problem and I can't create nice solution. > > > > Here is my input: > > > > msisdn longitude latitude ts > > > > 1 20.30 40.50 123 > > > > 1 0.0 null 456 > > > > 2 60.70 34.67 678 > > > > 2 null null 978 > > > > > > > > I need: > > > > group by msisdn > > > > order by ts inside each group > > > > filter records in each group: > > > > 1. put all records where longitude, latitude are valid on one side > > > > 2. put all records where longitude/latidude = 0.0/null to the othe > side > > > > > > > > Here is pig pseudo-code: > > > > rawRecords = LOAD '/data' as ...; > > > > grouped = GROUP rawRecords BY msisdn; > > > > validAndNotValidRecords = FOREACH grouped{ > > > > ordered = ORDER rawRecords BY ts; > > > > --do sometihing here to filter valid and not valid > > > records.... > > > > } > > > > STORE notValidRecords INTO /not_valid_data; > > > > > > > > someOtherProjection = GROUP validRecords By msisdn; > > > > --continue to work with filtered valid records... > > > > > > > > Can I do it in a single pig script, or I need to create two scripts: > > > > the first one would filter not valid records and store them > > > > the second one will continue to process filtered set of records? > > > > > > > > > >
