Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame.
Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-04<SP>00:00:30 1446566430 2015-11-04<SP>00:00:30 1446566430 2015-11-04<SP>00:00:30 1446566430 2015-11-04<SP>00:00:30 1446566430 2015-11-04<SP>00:00:30 1446566431 2015-11-04<SP>00:00:31 1446566431 2015-11-04<SP>00:00:31 1446566431 2015-11-04<SP>00:00:31 1446566431 2015-11-04<SP>00:00:31 1446566431 2015-11-04<SP>00:00:31 the <SP> in column 2 represents sub-delimiter within that column, and this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv I’m using *spark-csv* to import this file as Spark *DataFrame*: sqlContext.read.format("com.databricks.spark.csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "false") // Automatically infer data types .option("delimiter", " ") .load("hdfs:///tmp/1.csv") .show Oddly, the output shows only a part of each column: [image: Screenshot from 2016-02-07 15-27-51.png] and even the boundary of the table wasn’t shown correctly. I also used the other way to read csv file, by sc.textFile(...).map(_.split(" ")) and sqlContext.createDataFrame, and the result is the same. Can someone point me out where I did it wrong? — BR, Todd Leo