Hi, Spark just prints #bytes in the web UI that is accumulated from InputSplit#getLength (it is just a length of files). Therefore, I'm afraid this metric does not reflect actual read #bytes for parquet. If you get the metric, you need to use other tools such as iostat or something.
// maropu // maropu On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker <dennis.hunzi...@gmail.com> wrote: > Hi all > > > > I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to > find out about the improvements made in filtering/scanning parquet files > when querying for tables using SparkSQL and how these changes relate to the > new filter API introduced in Parquet 1.7.0. > > > > After checking the usual sources, I still can’t make sense of some of the > numbers shown on the Spark UI. As an example, I’m looking at the collect > stage for a query that’s selecting a single row from a table containing 1 > million numbers using a simple where clause (i.e. col1 = 500000) and this > is what I see on the UI: > > > > 0 SUCCESS ... 2.4 MB (hadoop) / 0 > > 1 SUCCESS ... 2.4 MB (hadoop) / 250000 > > 2 SUCCESS ... 2.4 MB (hadoop) / 0 > > 3 SUCCESS ... 2.4 MB (hadoop) / 0 > > > > Based on the min/max statistics of each of the parquet parts, it makes > sense not to expect any records for 3 out of the 4, because the record I’m > looking for can only be in a single file. But why is the input size above > shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the > whole stage? Isn't it just meant to read the metadata and ignore the > content of the file? > > > > Regards, > > Dennis > -- --- Takeshi Yamamuro