Re: read compressed hdfs files using SparkContext.textFile?

shenyan zhen Tue, 08 Sep 2015 14:38:41 -0700

Realized I was using spark-shell, so it assumes local file.
By submitting a spark job, the same code worked fine..


On Tue, Sep 8, 2015 at 3:13 PM, shenyan zhen <shenya...@gmail.com> wrote:

> Hi,
>
> For hdfs files written with below code:
>
> rdd.saveAsTextFile(getHdfsPath(...), classOf
> [org.apache.hadoop.io.compress.GzipCodec])
>
>
> I can see the hdfs files been generated:
>
>
> 0      /lz/streaming/am/1441734600000/_SUCCESS
>
> 1.6 M  /lz/streaming/am/1441734600000/part-00000.gz
>
> 1.6 M  /lz/streaming/am/1441734600000/part-00001.gz
>
> 1.6 M  /lz/streaming/am/1441734600000/part-00002.gz
>
> ...
>
>
> How do I read it using SparkContext?
>
>
> My naive attempt:
>
> val t1 = sc.textFile("/lz/streaming/am/1441734600000")
>
> t1.take(1).head
>
> did not work:
>
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/lz/streaming/am/1441734600000
>
> at
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
>
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
>
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
>
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
>
>
> Thanks,
>
> Shenyan
>
>
>
>
>

Re: read compressed hdfs files using SparkContext.textFile?

Reply via email to