I have csv data that is embedded in gzip format on HDFS.
*With Pig*
a = load
'/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00003.gz' using
PigStorage();
b = limit a 10
(2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,,,,,203,4810370.0,1.4090459061723766,1.017458,-0.03,-0.11,0.05,0.468666,)
(2015-07-27,12459,,31241,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,0,isGeo,,,203,7937613.0,1.1624841995932425,1.11562,-0.06,-0.15,0.03,0.233283,)
However with Spark
val rowStructText =
sc.parallelize("/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-00000.gz")
val x = rowStructText.map(s => {
println(s)
s}
)
x.count
Questions
1) x.count always shows 67 irrespective of the path i change in
sc.parallelize
2) It shows x as RDD[Char] instead of String
3) println() never emits the rows.
Any suggestions
-Deepak
--
Deepak