I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple of unified DataFrames. Since this process is slow I have wrote this snippet:
targetList.foreach { target => // this is using sqlContext.load by getting list of files then loading them according to schema files that // read before and built their StructType getTrace(target, sqlContext) .reduce(_ unionAll _) .registerTempTable(target.toUpperCase()) sqlContext.sql("SELECT * FROM " + target.toUpperCase()) .saveAsParquetFile(processedTraces + target) to load the csv files and then union all the cvs files with the same schema and write them into a single parquet file with their parts. The problems is my cpu (not all cpus are being busy) and disk (ssd, with 1MB/s at most) are barely utilized. I wonder what am I doing wrong?! snippet for getTrace: def getTrace(target: String, sqlContext: SQLContext): Seq[DataFrame] = { logFiles(mainLogFolder + target).map { file => sqlContext.load( driver, // schemaSelect builds the StructType once schemaSelect(schemaFile, target, sqlContext), Map("path" -> file, "header" -> "false", "delimiter" -> ",")) } } thanks for any help ----- regards, mohamad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-CSV-to-DataFrame-and-saving-it-into-Parquet-for-speedup-tp23071.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org