Spark 2.0 regression when querying very wide data frames

mhornbech Fri, 19 Aug 2016 16:17:18 -0700

Hi

We currently have some workloads in Spark 1.6.2 with queries operating on a
data frame with 1500+ columns (17000 rows). This has never been quite
stable, and some queries, such as "select *" would yield empty result sets,
but queries restricting to specific columns have mostly worked. Needless to
say that 1500+ columns isn't "desirable", but that's what the client's data
looks like and our preference have been to load it and normalize it through
Spark.


We have been waiting to see how this would work with Spark 2.0, and
unfortunately the problem has gotten worse. Almost all queries on this large
data frame that worked before will now return data frames with only null
values.

Is this a known issue with Spark? If yes, does anyone know why it has been
left untouched / made worse in Spark 2.0? If data frames with many columns
is a limitation that goes deep into Spark, I would prefer hard errors rather
than queries that run with meaningless results. The problem is easy to
reproduce, but I am not familiar enough debugging the Spark source code to
find the root cause. 

Hope some of you can enlighten me :-)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.0 regression when querying very wide data frames

Reply via email to