This question is related to using Spark and deeplyR.
We load a lot of data from oracle in dataframes through a jdbc connection:
dfX <- spark_read_jdbc(spConn, “myconnection",
options = list(
url = urlDEVdb,
driver = "oracle.jdbc.OracleDriver",
user = dbt_schema,
password = dbt_password,
dbtable = pQuery,
memory = FALSE # don't cache the whole (big) table
))
Then we do a lot of sql statemsnts, and use sdf_register to register the
results. Eventually we want to write the final result to a db.
Although we have set memory=FALSE, we see all these tables get cached. I notice
that counts are triggered (I think this happens before a table is ccahed) and a
collect is triggered. Also we think we see that when the tables are registered
with sdf_register, looks like it triggers a collect action (almost looks like
these are also cached). This leads to a lot of actions (often on the dataframes
resulting from the same pipeline) which takes a long time.
Questions to people using deeplyR+spark:
1) Is it possible that this memory =false is ignored when reading through jdbc?
2) can someone confirm that there is a lot of automatic caching happening (and
hence a lot of counts and a lot of actions)?
Thanks for input!
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]