Hi, What is the best way to read a large text file in Pyspark ( > 1TB). The file is generated by source system on which we can't make any changes and the file has a custom column separator('***') and record delimiter ('^^^'). Reading this in Pyspark Dataframe directly is not possible(as reading this as text will read all columns as one column but able to separate the rows and reading this in csv it does not allow me to give a lineseperator value greater than one character). I want to load this file and do some processing on my EMR.
What is the best way to load this file ? I could do some preprocessing in regexp and then load the file into pyspark DF but I would prefer to see if a way to do this in pyspark effectively over a cluster. Regards, Sanchit