I think for sure SPARK-28547 <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547> At the moment there are some flows in Spark architecture and it performs miserably or even freezes everywhere where column number exceeds 10-15K (even simple describe function takes ages while the same functions with pandas and no Spark take seconds). In many fields (like bioinformatics) wide datasets with both large numbers of rows and columns are very common (gene expression data is a good example here) and Spark is totally useless there.
-- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org