[jira] [Created] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

Kevin Ushey (JIRA) Fri, 30 Sep 2016 15:15:51 -0700

Kevin Ushey created SPARK-17752:
-----------------------------------

             Summary: Spark returns incorrect result when 'collect()'ing a 
cached Dataset with many columns
                 Key: SPARK-17752
                 URL: https://issues.apache.org/jira/browse/SPARK-17752
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 2.0.0
            Reporter: Kevin Ushey
            Priority: Critical



Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

---

SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE

---

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

Reply via email to