Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3225#issuecomment-67899967
I agree that the current mutable nature of `sc.hadoopConfiguration` is
confusing and this seems like it's worth documenting. It would be nicer if we
didn't have this messy mutable configuration, though. I think that the
combination of a mutable conf + lazy evaluation is what makes this confusing,
since @zsxwing's example of reading from two tables would work correctly under
eager evaluation:
```scala
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
conf.set(TableInputFormat.INPUT_TABLE, "another_table_name")
val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
```
I suppose one approach would be to have the configuration stay mutable but
to make a defensive copy of it when constructing RDDs that accept
configurations. This would break programs that were relying on being able to
mutate credentials _after_ having defined a bunch of RDDs (e.g. define some
RDDs, fail due to missing S3 credentials, supply new credentials, and re-run),
but I think it makes things easier to reason about.
If we're not going to introduce any change in behavior, though, then I
think we should document the current behavior more explicitly, as this patch
has done.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]