Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.g...@gmail.com Thu, 19 Oct 2017 15:30:33 -0700

If the underlying table(s) have indexes on them.  Does spark use those
indexes to optimize the query?


IE if I had a table in my JDBC data source (mysql in this case) had several
indexes and my query was filtering on one of the fields with an index.
Would spark know to push that predicate to the database or is the predicate
push-down ignorant of the underlying storage layer details.

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> sorry what do you mean my JDBC table has an index on it? Where are you
> reading the data from the table?
>
> I assume you are referring to "id" column on the table that you are
> reading through JDBC connection.
>
> Then you are creating a temp Table called "df". That temp table is created
> in temporary work space and does not have any index. That index "id" is
> used when doing parallel reads into RDDs not when querying the temp Table.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 October 2017 at 23:10, lucas.g...@gmail.com <lucas.g...@gmail.com>
> wrote:
>
>> IE:  If my JDBC table has an index on it, will the optimizer consider
>> that when pushing predicates down?
>>
>> I noticed in a query like this:
>>
>> df = spark.hiveContext.read.jdbc(
>>   url=jdbc_url,
>>   table="schema.table",
>>   column="id",
>>   lowerBound=lower_bound_id,
>>   upperBound=upper_bound_id,
>>   numPartitions=numberPartitions
>> )
>> df.registerTempTable("df")
>>
>> filtered_df = spark.hiveContext.sql("""
>>     SELECT
>>         *
>>     FROM
>>         df
>>     WHERE
>>         type = 'type'
>>         AND action = 'action'
>>         AND audited_changes LIKE '---\ncompany_id:\n- %'
>> """)
>> filtered_audits.registerTempTable("filtered_df")
>>
>>
>> The queries sent to the DB look like this:
>> "Select fields from schema.table where type='type' and action='action'
>> and id > lower_bound and id <= upper_bound"
>>
>> And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory,
>> which is great!
>>
>> However I'm wondering why it chooses that optimization.  In this case
>> there aren't any indexes on any of these except ID.
>>
>> So, does spark take into account JDBC indexes in it's query plan where it
>> can?
>>
>> Thanks!
>>
>> Gary Lucas
>>
>
>

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Reply via email to