[GitHub] spark pull request: [SPARK-4361][Doc] Add more docs for Hadoop Con...

JoshRosen Mon, 22 Dec 2014 14:40:57 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3225#issuecomment-67899967
  
    I agree that the current mutable nature of `sc.hadoopConfiguration` is 
confusing and this seems like it's worth documenting.  It would be nicer if we 
didn't have this messy mutable configuration, though.  I think that the 
combination of a mutable conf + lazy evaluation is what makes this confusing, 
since @zsxwing's example of reading from two tables would work correctly under 
eager evaluation:
    
    ```scala
    conf.set(TableInputFormat.INPUT_TABLE, "table_name")
    conf.set(TableInputFormat.SCAN, convertScanToString(scan))
    val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
classOf[ImmutableBytesWritable], classOf[Result])
    
    conf.set(TableInputFormat.INPUT_TABLE, "another_table_name")
    val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
classOf[ImmutableBytesWritable], classOf[Result])
    ```
    
    I suppose one approach would be to have the configuration stay mutable but 
to make a defensive copy of it when constructing RDDs that accept 
configurations.  This would break programs that were relying on being able to 
mutate credentials _after_ having defined a bunch of RDDs (e.g. define some 
RDDs, fail due to missing S3 credentials, supply new credentials, and re-run), 
but I think it makes things easier to reason about.
    
    If we're not going to introduce any change in behavior, though, then I 
think we should document the current behavior more explicitly, as this patch 
has done.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4361][Doc] Add more docs for Hadoop Con...

Reply via email to