Re: Spark input size when filtering on parquet files

Takeshi Yamamuro Wed, 01 Jun 2016 19:38:21 -0700

Technically, yes.
I'm not sure there is a parquet api for easily catching file statistics
(min, max, ...) though,
if it exists, it seems we could skip some file splits in
`ParquetFileFormat`.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L273


// maropu

On Thu, Jun 2, 2016 at 5:51 AM, Dennis Hunziker <dennis.hunzi...@gmail.com>
wrote:

> Thanks, that makes sense. What I wonder though is that if we use parquet
> meta data caching, spark should be able to execute queries much faster when
> using a large amount of smaller .parquet files compared to a smaller amount
> of large ones. At least as long as the min/max indexing is efficient (i.e.
> the data is grouped/ordered). However, I'm not seeing this in my tests
> because of the overhead for creating many tasks for all these small files
> that mostly end up doing nothing at all. Is it possible to prevent that? I
> assume only if the driver was able to inspect the cached meta data and
> avoid creating tasks for files that aren't used in the first place.
>
>
> On 27 May 2016 at 04:25, Takeshi Yamamuro <linguin....@gmail.com> wrote:
>
>> Hi,
>>
>> Spark just prints #bytes in the web UI that is accumulated from
>> InputSplit#getLength (it is just a length of files).
>> Therefore, I'm afraid this metric does not reflect actual read #bytes for
>> parquet.
>> If you get the metric, you need to use other tools such as iostat or
>> something.
>>
>> // maropu
>>
>>
>> // maropu
>>
>>
>> On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker <
>> dennis.hunzi...@gmail.com> wrote:
>>
>>> Hi all
>>>
>>>
>>>
>>> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to
>>> find out about the improvements made in filtering/scanning parquet files
>>> when querying for tables using SparkSQL and how these changes relate to the
>>> new filter API introduced in Parquet 1.7.0.
>>>
>>>
>>>
>>> After checking the usual sources, I still can’t make sense of some of
>>> the numbers shown on the Spark UI. As an example, I’m looking at the
>>> collect stage for a query that’s selecting a single row from a table
>>> containing 1 million numbers using a simple where clause (i.e. col1 =
>>> 500000) and this is what I see on the UI:
>>>
>>>
>>>
>>> 0 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>> 1 SUCCESS ... 2.4 MB (hadoop) / 250000
>>>
>>> 2 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>> 3 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>>
>>>
>>> Based on the min/max statistics of each of the parquet parts, it makes
>>> sense not to expect any records for 3 out of the 4, because the record I’m
>>> looking for can only be in a single file. But why is the input size above
>>> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
>>> whole stage? Isn't it just meant to read the metadata and ignore the
>>> content of the file?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Dennis
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Spark input size when filtering on parquet files

Reply via email to