Hi,

What is the best way to read a large text file in Pyspark ( > 1TB). The
file is generated by source system on which  we can't make any changes and
the file has a custom column separator('***') and record delimiter ('^^^').
Reading this in Pyspark Dataframe directly is not possible(as reading this
as text will read all columns as one column but able to separate the rows
and reading this in csv it does not allow me to give a lineseperator value
greater than one character). I want to load this file and do some
processing on my EMR.

What is the best way to load this file ? I could do some preprocessing in
regexp and then load the file into pyspark DF but I would prefer to see if
a way to do this in pyspark effectively over a cluster.

Regards,
Sanchit

Reply via email to