Thanks, that makes sense. What I wonder though is that if we use parquet meta data caching, spark should be able to execute queries much faster when using a large amount of smaller .parquet files compared to a smaller amount of large ones. At least as long as the min/max indexing is efficient (i.e. the data is grouped/ordered). However, I'm not seeing this in my tests because of the overhead for creating many tasks for all these small files that mostly end up doing nothing at all. Is it possible to prevent that? I assume only if the driver was able to inspect the cached meta data and avoid creating tasks for files that aren't used in the first place.
On 27 May 2016 at 04:25, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > Spark just prints #bytes in the web UI that is accumulated from > InputSplit#getLength (it is just a length of files). > Therefore, I'm afraid this metric does not reflect actual read #bytes for > parquet. > If you get the metric, you need to use other tools such as iostat or > something. > > // maropu > > > // maropu > > > On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker < > dennis.hunzi...@gmail.com> wrote: > >> Hi all >> >> >> >> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to >> find out about the improvements made in filtering/scanning parquet files >> when querying for tables using SparkSQL and how these changes relate to the >> new filter API introduced in Parquet 1.7.0. >> >> >> >> After checking the usual sources, I still can’t make sense of some of the >> numbers shown on the Spark UI. As an example, I’m looking at the collect >> stage for a query that’s selecting a single row from a table containing 1 >> million numbers using a simple where clause (i.e. col1 = 500000) and this >> is what I see on the UI: >> >> >> >> 0 SUCCESS ... 2.4 MB (hadoop) / 0 >> >> 1 SUCCESS ... 2.4 MB (hadoop) / 250000 >> >> 2 SUCCESS ... 2.4 MB (hadoop) / 0 >> >> 3 SUCCESS ... 2.4 MB (hadoop) / 0 >> >> >> >> Based on the min/max statistics of each of the parquet parts, it makes >> sense not to expect any records for 3 out of the 4, because the record I’m >> looking for can only be in a single file. But why is the input size above >> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the >> whole stage? Isn't it just meant to read the metadata and ignore the >> content of the file? >> >> >> >> Regards, >> >> Dennis >> > > > > -- > --- > Takeshi Yamamuro >