[
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401828#comment-13401828
]
Jie Li commented on PIG-2661:
-----------------------------
Some benchmark result using 1GB TPCH data lineitem:
||query||trunk||this patch||
||load-orderby-store| 1m41s (load) + 53s (sample) + 3m11s (orderby) | 38s
(sample) + 3m27s (orderby)|
||load-orderby-filter-store| 41s (load) + 32s (sample) + 35s (orderby) | 38s
(sample) + 50s (orderby) |
Note the filter is very selective but we didn't see the slowdown of the sample
job. The slight slowdown of the orderby job might result from different
serialization. In both query, we save one entire load job.
But just another issue came into my mind: though the distribution won't change,
the number of samples might change after the pipeline. If the pipeline
decreases #records such as filter/limit/sample, then we'll have less samples at
the end, but we also have a smaller order-by which doesn't need many samples.
If the pipeline increases #records such as flatten/stream, then we may end up
with having many samples at the end, which is likely to have poor performance.
Therefore let's just disable the sample optimization if we find these
"exploding" pipeline operators. (what else besides flatten/stream?)
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
> Key: PIG-2661
> URL: https://issues.apache.org/jira/browse/PIG-2661
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Jie Li
> Assignee: Jie Li
> Attachments: PIG-2661.0.patch, PIG-2661.1.patch
>
>
> See
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira