If you’re still seeing gibberish, it’s because Spark is not using the LZO
libraries properly. In your case, I believe you should be calling
newAPIHadoopFile() instead of textFile().

For example:

sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
  classOf[com.hadoop.mapreduce.LzoTextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])

On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier to
read LZO-compressed files from EC2 clusters
<https://issues.apache.org/jira/browse/SPARK-2394>

Nick
​


On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <ognen.duzlev...@gmail.com
> wrote:

> Hello,
>
> I have been trying to play with the Google ngram dataset provided by
> Amazon in form of LZO compressed files.
>
> I am having trouble understanding what is going on ;). I have added the
> compression jar and native library to the underlying Hadoop/HDFS
> installation, restarted the name node and the datanodes, Spark can
> obviously see the file but I get gibberish on a read. Any ideas?
>
> See output below:
>
> 14/07/13 14:39:19 INFO SparkContext: Added JAR file:/home/ec2-user/hadoop/
> lib/hadoop-gpl-compression-0.1.0.jar at http://10.10.0.100:40100/jars/
> hadoop-gpl-compression-0.1.0.jar with timestamp 1405262359777
> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
> Spark context available as sc.
>
> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
> curMem=0, maxMem=311387750
> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 160.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> <console>:12
>
> scala> f.take(10)
> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15,
> took 0.419708348 s
> res0: Array[String] = Array(SEQ?!org.apache.hadoop.
> io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.
> compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
> ????????????????????????????????????????????????????????????
> ????????????????????????????????????????????????????????????
> ?????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?
> 4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?
> H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?
> \�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?
> p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>
> Thanks!
> Ognen
>

Reply via email to