Hi Daniel, Thanks for pointing out PIG-2346. However, what happens if the user decides to rename some of the fields using the 'as' statement; we have the same problem, i.e., 'foreach' is generated. As a heuristic, perhaps synthesized operators should be marked as such. This way, pig can skip synthesized operators when trying to match the sequence 'load; filter'. Another alternative is to create a new keyword, say 'where', to be used for specifying partitions. E.g.,
A = load 'daily_activity' from HiveLoader where date_partition >= 20110101 and date_partition <= 20110201; stan On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[email protected]> wrote: > Hi, Stan, > Foreach is inserted only if you have "as" in "load" statement. This is to > assure the data loaded conforms with "as" clause. At some point there is a > bug in implementation, this should be fixed in PIG-2346 and will be > included in all subsequent releases. > > Thanks, > Daniel > > On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > [email protected]> wrote: > >> Howdy All, >> >> I am resurrecting my previous message sent to the list on Dec. 7. Let >> me first summarize. In a nutshell, as far as I can tell, >> partition-aware loading is broken >> in pig, and the culprit is PIG-1188 wherein the final decision was to >> introduce project & cast, i.e, foreach, after load. There are two >> problems with that approach. >> First, as indicated in my original message, 'getPartitionKeys' is >> never invoked because instead of the expected instruction sequence >> 'load; filter', PIG-1188 >> changed it to 'load; foreach; filter'. Second, if a loader already >> happens to project & cast in order to adhere the data to the schema, >> then the foreach synthesized >> by pig is a waste of time. >> >> Essentially, we had to undo the patch in 'PIG-1188' in order to get >> partition filters to work; this enabled us to implement a HiveLoader >> very much like >> HCatLoader which incidentally is also broken for the very same reason. >> This is obviously a hack and a real solution is needed. >> If the decision made in PIG-1188 cannot be re-considered, then I >> suggest that we revisit the logic which is used to pass partition >> filters to partition-aware loaders. >> >> Many thanks! >> >> stan >> >> >> >> ---------- Forwarded message ---------- >> From: Stan Rosenberg <[email protected]> >> Date: Wed, Dec 7, 2011 at 12:24 PM >> Subject: Partition keys in LoadMetadata is broken in 0.10? >> To: [email protected] >> >> >> Hi, >> >> I am trying to implement a loader which is partition-aware. As >> prescribed, my loader implements LoadMetadata, however, >> getPartitionKeys is never invoked. >> The script is of this form: >> >> X = LOAD 'input' USING MyLoader(); >> X = FILTER X BY partition_col == 'some_string'; >> >> and the schema returned by MyLoader.getSchema includes the column >> 'partition_col' which is of type 'chararray'. >> >> >> After debugging pig, I have found what appears to be a bug in the new >> code (version 0.10 snapshot and also in 0.9.1). The reason >> MyLoader.getPartitionKeys is never invoked is due to the wrongfully >> inserted >> 'foreach' after the 'load' and before the 'filter'. The code in >> TypeCastInserterTransformer.check used to return 'false' if the >> schemas matched or all fields were of type 'bytearray'; cf. pig >> version 0.8.1. >> Effectively, the above script gets transformed into: >> >> X = LOAD 'input' USING MyLoader(); >> X = FOREACH X GENERATE ...; >> X = FILTER X BY partition_col == 'some_string'; >> >> Subsequently, PartitionFilterPushDownTransformer.check observes that >> the immediate successor of 'load' is _not_ 'filter', whence >> getPartitionKeys is never invoked. >> >> Any suggestions? >> >> Thanks, >> >> stan >> >> P.S. While in the above case the 'foreach' can be avoided, in general >> typecasting may need to be performed if the user-provided schema does >> not match the one returned by the loader. >> I think the general case needs to be handled correctly, perhaps by >> ignoring all synthetic operators after the 'load'. (This is just a >> wild guess.) >>
