Well even if it has to apply it again, if pushdown is activated then it will be much less cost for spark to see if the filter has been applied or not. Applying the filter is negligible, what it really avoids if the file format implements it is IO cost (for reading) as well as cost for converting from the file format internal datatype to the one of Spark. Those two things are very expensive, but not the filter check. In the end, it could be also data source internal reasons not to apply a filter (there can be many depending on your scenario, the format etc). Instead of “discussing” between Spark and the data source it is much less costly that Spark checks that the filters are consistently applied.
> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando > <alessandro.solima...@gmail.com>: > > Hello, > that's an interesting question, but after Frank's reply I am a bit puzzled. > > If there is no control over the pushdown status how can Spark guarantee the > correctness of the final query? > > Consider a filter pushed down to the data source, either Spark has to know if > it has been applied or not, or it has to re-apply the filter anyway (and pay > the price for that). > > Is there any other option I am not considering? > > Best regards, > Alessandro > > Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha scritto: >> BTW. Even for json a pushdown can make sense to avoid that data is >> unnecessary ending in Spark ( because it would cause unnecessary overhead). >> In the datasource v2 api you need to implement a SupportsPushDownFilter >> >> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>: >> > >> > Hi, >> > >> > I'm a support engineer, interested in DataSourceV2. >> > >> > Recently I had some pain to troubleshoot to check if pushdown is actually >> > applied or not. >> > I noticed that DataFrame's explain() method shows pushdown even for JSON. >> > It totally depends on DataSource side, I believe. However, I would like >> > Spark to have some way to confirm whether specific pushdown is actually >> > applied in DataSource or not. >> > >> > # Example >> > val df = spark.read.json("s3://sample_bucket/people.json") >> > df.printSchema() >> > df.filter($"age" > 20).explain() >> > >> > root >> > |-- age: long (nullable = true) >> > |-- name: string (nullable = true) >> > >> > == Physical Plan == >> > *Project [age#47L, name#48] >> > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) >> > +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, >> > Location: InMemoryFileIndex[s3://sample_bucket/people.json], >> > PartitionFilters: [], PushedFilters: [IsNotNull(age), >> > GreaterThan(age,20)], ReadSchema: struct<age:bigint,name:string> >> > >> > # Comments >> > As you can see, PushedFilter is shown even if input data is JSON. >> > Actually this pushdown is not used. >> > >> > I'm wondering if it has been already discussed or not. >> > If not, this is a chance to have such feature in DataSourceV2 because it >> > would require some API level changes. >> > >> > >> > Warm regards, >> > >> > Noritaka Sekiyama >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>