Hi, I am trying to process a very large comma delimited csv file and I am running into problems. The main problem is that some fields contain quoted strings with embedded commas. It seems as if PySpark is unable to properly parse lines containing such fields like say Pandas does.
Here is the code I am using to read the file in Pyspark df_raw=spark.read.option("header","true").csv(csv_path) Here is an example of a good and 'bad' line in such a file: col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE Line 0 is the header Line 1 is the 'problematic' line Line 2 is a good line. Pandas can handle this easily: [1]: import pandas as pd In [2]: pdf = pd.read_csv('malformed_data.csv') In [4]: pdf[['col12','col13','col14']] Out[4]: col12 col13 \ 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC col14 0 23.0 1 0.0 while Pyspark seems to parse this erroneously: [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True) [6]: sdf.select("col12","col13",'col14').show() +------------------+--------------------+--------------------+ | col12| col13| col14| +------------------+--------------------+--------------------+ |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...| | null|OUTKAST#THROOTS~W...| 0.0| +------------------+--------------------+--------------------+ Is this a bug or am I doing something wrong ? I am working with Spark 2.0 Any help is appreciated Thanks, -- Femi http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.