Technically, yes. I'm not sure there is a parquet api for easily catching file statistics (min, max, ...) though, if it exists, it seems we could skip some file splits in `ParquetFileFormat`. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L273
// maropu On Thu, Jun 2, 2016 at 5:51 AM, Dennis Hunziker <dennis.hunzi...@gmail.com> wrote: > Thanks, that makes sense. What I wonder though is that if we use parquet > meta data caching, spark should be able to execute queries much faster when > using a large amount of smaller .parquet files compared to a smaller amount > of large ones. At least as long as the min/max indexing is efficient (i.e. > the data is grouped/ordered). However, I'm not seeing this in my tests > because of the overhead for creating many tasks for all these small files > that mostly end up doing nothing at all. Is it possible to prevent that? I > assume only if the driver was able to inspect the cached meta data and > avoid creating tasks for files that aren't used in the first place. > > > On 27 May 2016 at 04:25, Takeshi Yamamuro <linguin....@gmail.com> wrote: > >> Hi, >> >> Spark just prints #bytes in the web UI that is accumulated from >> InputSplit#getLength (it is just a length of files). >> Therefore, I'm afraid this metric does not reflect actual read #bytes for >> parquet. >> If you get the metric, you need to use other tools such as iostat or >> something. >> >> // maropu >> >> >> // maropu >> >> >> On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker < >> dennis.hunzi...@gmail.com> wrote: >> >>> Hi all >>> >>> >>> >>> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to >>> find out about the improvements made in filtering/scanning parquet files >>> when querying for tables using SparkSQL and how these changes relate to the >>> new filter API introduced in Parquet 1.7.0. >>> >>> >>> >>> After checking the usual sources, I still can’t make sense of some of >>> the numbers shown on the Spark UI. As an example, I’m looking at the >>> collect stage for a query that’s selecting a single row from a table >>> containing 1 million numbers using a simple where clause (i.e. col1 = >>> 500000) and this is what I see on the UI: >>> >>> >>> >>> 0 SUCCESS ... 2.4 MB (hadoop) / 0 >>> >>> 1 SUCCESS ... 2.4 MB (hadoop) / 250000 >>> >>> 2 SUCCESS ... 2.4 MB (hadoop) / 0 >>> >>> 3 SUCCESS ... 2.4 MB (hadoop) / 0 >>> >>> >>> >>> Based on the min/max statistics of each of the parquet parts, it makes >>> sense not to expect any records for 3 out of the 4, because the record I’m >>> looking for can only be in a single file. But why is the input size above >>> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the >>> whole stage? Isn't it just meant to read the metadata and ignore the >>> content of the file? >>> >>> >>> >>> Regards, >>> >>> Dennis >>> >> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > -- --- Takeshi Yamamuro