Hello,
We are running into issues while trying to process fixed length files using
spark.
The approach we took is as follows:
1. Read the .bz2 file into a dataset from hdfs using
spark.read().textFile() API.Create a temporary view.
Dataset<String> rawDataset = sparkSession.read().textFile(filePath);
rawDataset.createOrReplaceTempView(tempView);
2. Run a sql query on the view, to slice and dice the data the way we need
it (using substring).
(SELECT
TRIM(SUBSTRING(value,1 ,16)) AS record1 ,
TRIM(SUBSTRING(value,17 ,8)) AS record2 ,
TRIM(SUBSTRING(value,25 ,5)) AS record3 ,
TRIM(SUBSTRING(value,30 ,16)) AS record4 ,
CAST(SUBSTRING(value,46 ,8) AS BIGINT) AS record5 ,
CAST(SUBSTRING(value,54 ,6) AS BIGINT) AS record6 ,
CAST(SUBSTRING(value,60 ,3) AS BIGINT) AS record7 ,
CAST(SUBSTRING(value,63 ,6) AS BIGINT) AS record8 ,
TRIM(SUBSTRING(value,69 ,20)) AS record9 ,
TRIM(SUBSTRING(value,89 ,40)) AS record10 ,
TRIM(SUBSTRING(value,129 ,32)) AS record11 ,
TRIM(SUBSTRING(value,161 ,19)) AS record12,
TRIM(SUBSTRING(value,180 ,1)) AS record13 ,
TRIM(SUBSTRING(value,181 ,9)) AS record14 ,
TRIM(SUBSTRING(value,190 ,3)) AS record15 ,
CAST(SUBSTRING(value,193 ,8) AS BIGINT) AS record16 ,
CAST(SUBSTRING(value,201 ,8) AS BIGINT) AS record17
FROM tempView)
3.Write the output of sql query to a parquet file.
loadDataset.write().mode(SaveMode.Append).parquet(outputDirectory);
Problem :
The step #2 takes a longer time , if the length of line is ~2000
characters. If each line in the file is only 1000 characters then it takes
only 4 minutes to process 20 million lines. If we increase the line length
to 2000 characters it takes 20 minutes to process 20 million lines.
Is there a better way in spark to parse fixed length lines?
*Note: *Spark version we use is 2.2.0 and we are using Spark with Java.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]