Steve, thank you for your response. We have tested the spark.read with various options. The difference in performance is very small. In particular, inference makes virtually no effect in the tested case (the testing files have just few rows) Moreover, the complexity of spark.read remains polynomial on the number of columns in all the considered cases. In contrast, spark.createDataFrame(data,schema) is linear and faster by a large factor. *What could be the reason for such a dramatic difference in performance?*
Please, find the plot with our measurements below. The code is exactly the same as in the initial post. The only thing which was changing is the additional settings of the spark.read. It includes: -read.format(csv).option("inferSchema", "false") -read.format(csv).option("inferSchema", "true") -read.format(csv).schema(schema) where schema is provided from a prepared json file -read.parquet which reads parquet file (including schema) prepared from the same CSVs -createDataFrame(data,schema) where data is parsed to rows from CSV and schema is constructed from its header <http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3091/spark_read_complexity_on_columns.jpg> -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org