Hi We currently have some workloads in Spark 1.6.2 with queries operating on a data frame with 1500+ columns (17000 rows). This has never been quite stable, and some queries, such as "select *" would yield empty result sets, but queries restricting to specific columns have mostly worked. Needless to say that 1500+ columns isn't "desirable", but that's what the client's data looks like and our preference have been to load it and normalize it through Spark.
We have been waiting to see how this would work with Spark 2.0, and unfortunately the problem has gotten worse. Almost all queries on this large data frame that worked before will now return data frames with only null values. Is this a known issue with Spark? If yes, does anyone know why it has been left untouched / made worse in Spark 2.0? If data frames with many columns is a limitation that goes deep into Spark, I would prefer hard errors rather than queries that run with meaningless results. The problem is easy to reproduce, but I am not familiar enough debugging the Spark source code to find the root cause. Hope some of you can enlighten me :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org