[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967670#comment-15967670
 ] 

Adam Szita commented on PIG-5135:
---------------------------------

[~kellyzly] I've checked this, it seems that {{assertEquals(30, 
inputStats.get(0).getBytes());}} is fine, but {{assertEquals(18, 
inputStats.get(1).getBytes());}} is not true, Spark returns -1 here. The plan 
generated for spark consists of 4 jobs, last one being the responsible for 
replicated join. This latter does 3 loads, and thus SparkPigStats handle this 
as -1. (Even after adding together all the bytes from all load ops in this job 
I got different result than 18.) I guess compression is also at work here on 
the tmp file part generation that further alters the number of bytes being read.

I would say we should leave the exclusion for Spark as is, but update the 
comment section since we don't get the expected numbers for a different reason. 
What do you think?

> HDFS bytes read stats are always 0 in Spark mode
> ------------------------------------------------
>
>                 Key: PIG-5135
>                 URL: https://issues.apache.org/jira/browse/PIG-5135
>             Project: Pig
>          Issue Type: Bug
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: Adam Szita
>             Fix For: spark-branch
>
>         Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to