I am not 100% sure if spark is smart enough to achieve this using a single pass over the data. If not you could create a java udf for this which correctly parses all the columns at once.
Otherwise you could enable Tungsten off heap memory which might speed things up. lsn24 <lekshmi.s...@gmail.com> schrieb am Fr. 13. Apr. 2018 um 19:02: > Hello, > > We are running into issues while trying to process fixed length files > using > spark. > > The approach we took is as follows: > > 1. Read the .bz2 file into a dataset from hdfs using > spark.read().textFile() API.Create a temporary view. > > Dataset<String> rawDataset = sparkSession.read().textFile(filePath); > rawDataset.createOrReplaceTempView(tempView); > > 2. Run a sql query on the view, to slice and dice the data the way we need > it (using substring). > > (SELECT > TRIM(SUBSTRING(value,1 ,16)) AS record1 , > TRIM(SUBSTRING(value,17 ,8)) AS record2 , > TRIM(SUBSTRING(value,25 ,5)) AS record3 , > TRIM(SUBSTRING(value,30 ,16)) AS record4 , > CAST(SUBSTRING(value,46 ,8) AS BIGINT) AS record5 , > CAST(SUBSTRING(value,54 ,6) AS BIGINT) AS record6 , > CAST(SUBSTRING(value,60 ,3) AS BIGINT) AS record7 , > CAST(SUBSTRING(value,63 ,6) AS BIGINT) AS record8 , > TRIM(SUBSTRING(value,69 ,20)) AS record9 , > TRIM(SUBSTRING(value,89 ,40)) AS record10 , > TRIM(SUBSTRING(value,129 ,32)) AS record11 , > TRIM(SUBSTRING(value,161 ,19)) AS record12, > TRIM(SUBSTRING(value,180 ,1)) AS record13 , > TRIM(SUBSTRING(value,181 ,9)) AS record14 , > TRIM(SUBSTRING(value,190 ,3)) AS record15 , > CAST(SUBSTRING(value,193 ,8) AS BIGINT) AS record16 , > CAST(SUBSTRING(value,201 ,8) AS BIGINT) AS record17 > FROM tempView) > > 3.Write the output of sql query to a parquet file. > loadDataset.write().mode(SaveMode.Append).parquet(outputDirectory); > > Problem : > > The step #2 takes a longer time , if the length of line is ~2000 > characters. If each line in the file is only 1000 characters then it takes > only 4 minutes to process 20 million lines. If we increase the line length > to 2000 characters it takes 20 minutes to process 20 million lines. > > > Is there a better way in spark to parse fixed length lines? > > > *Note: *Spark version we use is 2.2.0 and we are using Spark with Java. > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >