Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.g...@gmail.com Thu, 19 Oct 2017 15:12:01 -0700

IE:  If my JDBC table has an index on it, will the optimizer consider that
when pushing predicates down?


I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        *
    FROM
        df
    WHERE
        type = 'type'
        AND action = 'action'
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and
id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which
is great!

However I'm wondering why it chooses that optimization.  In this case there
aren't any indexes on any of these except ID.

So, does spark take into account JDBC indexes in it's query plan where it
can?

Thanks!

Gary Lucas

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Reply via email to