Kevin Ushey created SPARK-17752:
-----------------------------------
Summary: Spark returns incorrect result when 'collect()'ing a
cached Dataset with many columns
Key: SPARK-17752
URL: https://issues.apache.org/jira/browse/SPARK-17752
Project: Spark
Issue Type: Bug
Components: SparkR
Affects Versions: 2.0.0
Reporter: Kevin Ushey
Priority: Critical
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0
installation as necessary):
---
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))
n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")
path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",",
quote = FALSE)
tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)
identical(df, cl) # FALSE
---
Although this is reproducible with SparkR, it seems more likely that this is an
error in the Java / Scala Spark sources.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]