Reading csv files with quoted fields containing embedded commas

Femi Anthony Sat, 05 Nov 2016 14:59:06 -0700

Hi, I am trying to process a very large comma delimited csv file and I am
running into problems.
The main problem is that some fields contain quoted strings with embedded
commas.
It seems as if PySpark is unable to properly parse lines containing such
fields like say Pandas does.


Here is the code I am using to read the file in Pyspark

df_raw=spark.read.option("header","true").csv(csv_path)

Here is an example of a good and 'bad' line in such a file:



col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
""W""   JK, RE
LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE

Line 0 is the header
Line 1 is the 'problematic' line
Line 2 is a good line.

Pandas can handle this easily:


[1]: import pandas as pd

In [2]: pdf = pd.read_csv('malformed_data.csv')

In [4]: pdf[['col12','col13','col14']]
Out[4]:
                    col12                                             col13
 \
0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC

   col14
0   23.0
1    0.0


while Pyspark seems to parse this erroneously:

[5]:
sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)

[6]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
|             col12|               col13|               col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
|              null|OUTKAST#THROOTS~W...|                 0.0|
+------------------+--------------------+--------------------+

 Is this a bug or am I doing something wrong ?
 I am working with Spark 2.0
 Any help is appreciated

Thanks,
-- Femi

http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Reading csv files with quoted fields containing embedded commas

Reply via email to