Steve, thank you for your response.
We have tested the spark.read with various options. The difference in
performance is very small. In particular, inference makes virtually no
effect in the tested case (the testing files have just few rows) Moreover,
the complexity of spark.read remains polynomial on the number of columns in
all the considered cases. In contrast, spark.createDataFrame(data,schema) is
linear and faster by a large factor. *What could be the reason for such a
dramatic difference in performance?*

Please, find the plot with our measurements below. The code is exactly the
same as in the initial post. The only thing which was changing is the
additional settings of the spark.read. It includes:
-read.format(csv).option("inferSchema", "false")
-read.format(csv).option("inferSchema", "true")
-read.format(csv).schema(schema) where schema is provided from a prepared
json file
-read.parquet which reads parquet file (including schema) prepared from the
same CSVs
-createDataFrame(data,schema) where data is parsed to rows from CSV and
schema is constructed from its header

<http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3091/spark_read_complexity_on_columns.jpg>
 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to