Hi, I am trying to implement a loader which is partition-aware. As prescribed, my loader implements LoadMetadata, however, getPartitionKeys is never invoked. The script is of this form:
X = LOAD 'input' USING MyLoader(); X = FILTER X BY partition_col == 'some_string'; and the schema returned by MyLoader.getSchema includes the column 'partition_col' which is of type 'chararray'. After debugging pig, I have found what appears to be a bug in the new code (version 0.10 snapshot and also in 0.9.1). The reason MyLoader.getPartitionKeys is never invoked is due to the wrongfully inserted 'foreach' after the 'load' and before the 'filter'. The code in TypeCastInserterTransformer.check used to return 'false' if the schemas matched or all fields were of type 'bytearray'; cf. pig version 0.8.1. Effectively, the above script gets transformed into: X = LOAD 'input' USING MyLoader(); X = FOREACH X GENERATE ...; X = FILTER X BY partition_col == 'some_string'; Subsequently, PartitionFilterPushDownTransformer.check observes that the immediate successor of 'load' is _not_ 'filter', whence getPartitionKeys is never invoked. Any suggestions? Thanks, stan P.S. While in the above case the 'foreach' can be avoided, in general typecasting may need to be performed if the user-provided schema does not match the one returned by the loader. I think the general case needs to be handled correctly, perhaps by ignoring all synthetic operators after the 'load'. (This is just a wild guess.)
