Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

Eran Medan Thu, 19 Nov 2015 11:28:47 -0800

you mean the top example, right? you are right and I'm aware of this as I
stated :)

but for the 2nd example (the initial jdbc load), I debugged it and
apparently JDBCRDD does the select before the filter is pushed down

I mean the where clause of the initial partition loading is not taking the
filter('account === "acct") into account (not pun intended)
the where clause seems to be solely defined by the partitioning, e.g.
either what you defined in an array to .jdbc or auto generated via
providing an upper / lower bound and a column.

In other words, all SQL except the initial load is running in Spark,
nothing except the where clauses of the partition is pushed down to the DB
query. (not even doing a .select to limit columns)...
ᐧ

On Wed, Nov 18, 2015 at 4:50 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:

> When you have following query, 'account=== “acct1” will be pushdown to
> generate new query with “where account = acct1”
>
> Thanks.
>
> Zhan Zhang
>
> On Nov 18, 2015, at 11:36 AM, Eran Medan <eran.me...@gmail.com> wrote:
>
> I understand that the following are equivalent
>
>     df.filter('account === "acct1")
>
>     sql("select * from tempTableName where account = 'acct1'")
>
>
> But is Spark SQL "smart" to also push filter predicates down for the
> initial load?
>
> e.g.
>         sqlContext.read.jdbc(…).filter('account=== "acct1")
>
> Is Spark "smart enough" to this for each partition?
>
>        ‘select … where account= ‘acc1’ AND (partition where clause here)?
>
> Or do I have to put it on each partition where clause otherwise it will
> load the entire set and only then filter it in memory?
>
> ᐧ
>
>
>

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

Reply via email to