Re: Pushdown in DataSourceV2 question

Jörn Franke Sun, 09 Dec 2018 04:17:03 -0800

Well even if it has to apply it again, if pushdown is activated then it will be 
much less cost for spark to see if the filter has been applied or not. Applying 
the filter is negligible, what it really avoids if the file format implements 
it is IO cost (for reading) as well as cost for converting from the file format 
internal datatype to the one of Spark. Those two things are very expensive, but 
not the filter check. In the end, it could be also data source internal reasons 
not to apply a filter (there can be many depending on your scenario, the format 
etc). Instead of “discussing” between Spark and the data source it is much less 
costly that Spark checks that the filters are consistently applied.


> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando 
> <alessandro.solima...@gmail.com>:
> 
> Hello,
> that's an interesting question, but after Frank's reply I am a bit puzzled.
> 
> If there is no control over the pushdown status how can Spark guarantee the 
> correctness of the final query?
> 
> Consider a filter pushed down to the data source, either Spark has to know if 
> it has been applied or not, or it has to re-apply the filter anyway (and pay 
> the price for that).
> 
> Is there any other option I am not considering?
> 
> Best regards,
> Alessandro
> 
> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha scritto:
>> BTW. Even for json a pushdown can make sense to avoid that data is 
>> unnecessary ending in Spark ( because it would cause unnecessary overhead). 
>> In the datasource v2 api you need to implement a SupportsPushDownFilter
>> 
>> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>:
>> > 
>> > Hi,
>> > 
>> > I'm a support engineer, interested in DataSourceV2.
>> > 
>> > Recently I had some pain to troubleshoot to check if pushdown is actually 
>> > applied or not.
>> > I noticed that DataFrame's explain() method shows pushdown even for JSON.
>> > It totally depends on DataSource side, I believe. However, I would like 
>> > Spark to have some way to confirm whether specific pushdown is actually 
>> > applied in DataSource or not.
>> > 
>> > # Example
>> > val df = spark.read.json("s3://sample_bucket/people.json")
>> > df.printSchema()
>> > df.filter($"age" > 20).explain()
>> > 
>> > root
>> >  |-- age: long (nullable = true)
>> >  |-- name: string (nullable = true)
>> > 
>> > == Physical Plan ==
>> > *Project [age#47L, name#48]
>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>> >    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, 
>> > Location: InMemoryFileIndex[s3://sample_bucket/people.json], 
>> > PartitionFilters: [], PushedFilters: [IsNotNull(age), 
>> > GreaterThan(age,20)], ReadSchema: struct<age:bigint,name:string>
>> > 
>> > # Comments
>> > As you can see, PushedFilter is shown even if input data is JSON.
>> > Actually this pushdown is not used.
>> >    
>> > I'm wondering if it has been already discussed or not.
>> > If not, this is a chance to have such feature in DataSourceV2 because it 
>> > would require some API level changes.
>> > 
>> > 
>> > Warm regards,
>> > 
>> > Noritaka Sekiyama
>> > 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

Re: Pushdown in DataSourceV2 question

Reply via email to