Andreas Franz created SPARK-55830:
-------------------------------------

             Summary: JDBC predicate pushdown drops driver properties causing 
connection failure
                 Key: SPARK-55830
                 URL: https://issues.apache.org/jira/browse/SPARK-55830
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.2
            Reporter: Andreas Franz


When reading data via JDBC using *predicate pushdown (predicates)*, Spark 
appears to lose custom *JDBC driver properties* during connection creation.

 

This becomes problematic when using a *custom SocketFactory*, such as the 
*Google Cloud SQL JDBC SocketFactory*, which relies on additional driver 
properties for establishing the connection.

 

As a result, Spark fails to connect to the database when predicates are used.

 

Without predicates, the connection works correctly.

 

*Examples to reproduce:*
{code:java}
def get_jdbc_parameters():
    return {
        "url": f"jdbc:postgresql:///<DB_NAME>",
        "socketFactory": "com.google.cloud.sql.postgres.SocketFactory",
        "user": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS CLOUDSQL DATABASE>",
        "cloudSqlInstance": "<INSTANCE_CONNECTION_NAME>",
        "cloudSqlTargetPrincipal": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS 
CLOUDSQL DATABASE>",
        "enableIamAuth": "true",
        "password": "disabled",
        "sslmode": "disable",
    }

def get_table_name():
    return "<TABLE>"

# CASE 1: No predicates -> returns up to 10 rows as expected
jdbc_parameters = get_jdbc_parameters()

spark.read.jdbc(
    url=jdbc_parameters.pop("url"),
    table=get_table_name(),
    properties=jdbc_parameters,
).limit(10).display()

# CASE 2:  Empty predicated -> fails with exception below
jdbc_parameters = get_jdbc_parameters()
predicates = []

spark.read.jdbc(
    url=jdbc_parameters.pop("url"),
    table=get_table_name(),
    properties=jdbc_parameters,
    predicates=predicates,
).limit(10).display()

'''
Py4JJavaError: An error occurred while calling o488.jdbc.
: org.postgresql.util.PSQLException: Connection to :5432 refused. Check that 
the hostname and port are correct and that the postmaster is accepting TCP/IP 
connections.
at 
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
at 
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
'''

# CASE 3: Non-empty predicates -> fails with exception below
jdbc_parameters = get_jdbc_parameters()
predicates = ["id BETWEEN 1000 AND 1010 "]

spark.read.jdbc(
    url=jdbc_parameters.pop("url"),
    table=get_table_name(),
    properties=jdbc_parameters,
    predicates=predicates,
).limit(10).display()

'''
Py4JJavaError: An error occurred while calling o488.jdbc.
: org.postgresql.util.PSQLException: Connection to :5432 refused. Check that 
the hostname and port are correct and that the postmaster is accepting TCP/IP 
connections.
at 
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
at 
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
'''

# CASE 4: Predicates with JDBC parameters encoded in URL as workaround -> only 
returns entries of the given predicate range
jdbc_parameters = get_jdbc_parameters()

predicates = ["id BETWEEN 1000 AND 1010 "]

url = jdbc_parameters.pop("url")
params = "&".join(f"{k}={v}" for k, v in jdbc_parameters.items())

spark.read.jdbc(
    url=f"{url}?{params}",
    table=get_table_name(),
    predicates=predicates,
).limit(10).display()
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to