I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple
of unified DataFrames. Since this process is slow I have wrote this snippet:
targetList.foreach { target =>
// this is using sqlContext.load by getting list of files then
loading them according to schema files that
// read before and built their StructType
getTrace(target, sqlContext)
.reduce(_ unionAll _)
.registerTempTable(target.toUpperCase())
sqlContext.sql("SELECT * FROM " + target.toUpperCase())
.saveAsParquetFile(processedTraces + target)
to load the csv files and then union all the cvs files with the same schema
and write them into a single parquet file with their parts. The problems is
my cpu (not all cpus are being busy) and disk (ssd, with 1MB/s at most) are
barely utilized. I wonder what am I doing wrong?!
snippet for getTrace:
def getTrace(target: String, sqlContext: SQLContext): Seq[DataFrame] = {
logFiles(mainLogFolder + target).map {
file =>
sqlContext.load(
driver,
// schemaSelect builds the StructType once
schemaSelect(schemaFile, target, sqlContext),
Map("path" -> file, "header" -> "false", "delimiter" -> ","))
}
}
thanks for any help
-----
regards,
mohamad
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-CSV-to-DataFrame-and-saving-it-into-Parquet-for-speedup-tp23071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]