Re: Reading Large File in Pyspark

2021-06-03 Thread Gourav Sengupta
Hi, could not agree more with Molotch :) Regards, Gourav Sengupta On Thu, May 27, 2021 at 7:08 PM Molotch wrote: > You can specify the line separator to make spark split your records into > separate rows. > > df = spark.read.option("lineSep","^^^").text("path") > > Then you need to df.select(

Re: Reading Large File in Pyspark

2021-05-27 Thread Molotch
You can specify the line separator to make spark split your records into separate rows. df = spark.read.option("lineSep","^^^").text("path") Then you need to df.select(split("value", "***").as("arrayColumn")) the column into an array and map over it with getItem to create a column for each proper

Reading Large File in Pyspark

2021-05-26 Thread Sukanya Sarma
Hi, What is the best way to read a large text file in Pyspark ( > 1TB). The file is generated by source system on which we can't make any changes and the file has a custom column separator('***') and record delimiter ('^^^'). Reading this in Pyspark Dataframe directly is not possible(as reading t