If you’re still seeing gibberish, it’s because Spark is not using the LZO libraries properly. In your case, I believe you should be calling newAPIHadoopFile() instead of textFile().
For example: sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text]) On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier to read LZO-compressed files from EC2 clusters <https://issues.apache.org/jira/browse/SPARK-2394> Nick On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <ognen.duzlev...@gmail.com > wrote: > Hello, > > I have been trying to play with the Google ngram dataset provided by > Amazon in form of LZO compressed files. > > I am having trouble understanding what is going on ;). I have added the > compression jar and native library to the underlying Hadoop/HDFS > installation, restarted the name node and the datanodes, Spark can > obviously see the file but I get gibberish on a read. Any ideas? > > See output below: > > 14/07/13 14:39:19 INFO SparkContext: Added JAR file:/home/ec2-user/hadoop/ > lib/hadoop-gpl-compression-0.1.0.jar at http://10.10.0.100:40100/jars/ > hadoop-gpl-compression-0.1.0.jar with timestamp 1405262359777 > 14/07/13 14:39:20 INFO SparkILoop: Created spark context.. > Spark context available as sc. > > scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo") > 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with > curMem=0, maxMem=311387750 > 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 160.0 KB, free 296.8 MB) > f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at > <console>:12 > > scala> f.take(10) > 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15, > took 0.419708348 s > res0: Array[String] = Array(SEQ?!org.apache.hadoop. > io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop. > compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�?????? > ???????????????????????????????????????????????????????????? > ???????????????????????????????????????????????????????????? > ?????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�? > �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �? > �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�? > 4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�? > H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�? > \�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�? > p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�? > �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?... > > Thanks! > Ognen >