Sorry, same expected results with trunk and Kryo serializer On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sliznmail...@gmail.com> wrote:
> I’ve found the trigger of my issue: if I start my spark-shell or submit > by spark-submit with --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer, the > DataFrame content goes wrong, as I described earlier. > > > On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmail...@gmail.com> wrote: > >> Thanks Luciano, now it looks like I’m the only guy who have this issue. >> My options is narrowed down to upgrade my spark to 1.6.0, to see if this >> issue is gone. >> >> — >> Cheers, >> Todd Leo >> >> >> >> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1...@gmail.com> >> wrote: >> >>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and >>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the >>> columns seem to be read properly. >>> >>> +----------+----------------------+ >>> |C0 |C1 | >>> +----------+----------------------+ >>> >>> |1446566430 | 2015-11-04<SP>00:00:30| >>> |1446566430 | 2015-11-04<SP>00:00:30| >>> |1446566430 | 2015-11-04<SP>00:00:30| >>> |1446566430 | 2015-11-04<SP>00:00:30| >>> |1446566430 | 2015-11-04<SP>00:00:30| >>> |1446566431 | 2015-11-04<SP>00:00:31| >>> |1446566431 | 2015-11-04<SP>00:00:31| >>> |1446566431 | 2015-11-04<SP>00:00:31| >>> |1446566431 | 2015-11-04<SP>00:00:31| >>> |1446566431 | 2015-11-04<SP>00:00:31| >>> +----------+----------------------+ >>> >>> >>> >>> >>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com> >>> wrote: >>> >>>> Hi Spark Users Group, >>>> >>>> I have a csv file to analysis with Spark, but I’m troubling with >>>> importing as DataFrame. >>>> >>>> Here’s the minimal reproducible example. Suppose I’m having a >>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >>>> >>>> 1446566430 2015-11-04<SP>00:00:30 >>>> 1446566430 2015-11-04<SP>00:00:30 >>>> 1446566430 2015-11-04<SP>00:00:30 >>>> 1446566430 2015-11-04<SP>00:00:30 >>>> 1446566430 2015-11-04<SP>00:00:30 >>>> 1446566431 2015-11-04<SP>00:00:31 >>>> 1446566431 2015-11-04<SP>00:00:31 >>>> 1446566431 2015-11-04<SP>00:00:31 >>>> 1446566431 2015-11-04<SP>00:00:31 >>>> 1446566431 2015-11-04<SP>00:00:31 >>>> >>>> the <SP> in column 2 represents sub-delimiter within that column, and >>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >>>> >>>> I’m using *spark-csv* to import this file as Spark *DataFrame*: >>>> >>>> sqlContext.read.format("com.databricks.spark.csv") >>>> .option("header", "false") // Use first line of all files as header >>>> .option("inferSchema", "false") // Automatically infer data types >>>> .option("delimiter", " ") >>>> .load("hdfs:///tmp/1.csv") >>>> .show >>>> >>>> Oddly, the output shows only a part of each column: >>>> >>>> [image: Screenshot from 2016-02-07 15-27-51.png] >>>> >>>> and even the boundary of the table wasn’t shown correctly. I also used >>>> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >>>> and sqlContext.createDataFrame, and the result is the same. Can >>>> someone point me out where I did it wrong? >>>> >>>> — >>>> BR, >>>> Todd Leo >>>> >>>> >>> >>> >>> >>> -- >>> Luciano Resende >>> http://people.apache.org/~lresende >>> http://twitter.com/lresende1975 >>> http://lresende.blogspot.com/ >>> >> -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/