I actually never got this to work, which is part of the reason why I filed that JIRA. Apart from using --jar when starting the shell, I don’t have any more pointers for you. :(
On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski <ognen.duzlev...@gmail.com > wrote: > Nicholas, > > Thanks! > > How do I make spark assemble against a local version of Hadoop? > > I have 2.4.1 running on a test cluster and I did > "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in > hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1 > HDFS). I am guessing my local version of Hadoop libraries/jars is not used. > Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar > (responsible for the lzo stuff) to this hand assembled Spark? > > I am running the spark-shell like this: > bin/spark-shell --jars > /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar > > and getting this: > > scala> val f = sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo > ",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text]) > 14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with > curMem=0, maxMem=311387750 > 14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 211.0 KB, free 296.8 MB) > f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, > org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at > <console>:12 > > scala> f.take(1) > 14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1 > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.JobContext, but class was expected > at > com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67) > > which makes me think something is not linked to something properly (not a > Java expert unfortunately). > > Thanks! > Ognen > > > > On 7/13/14, 10:35 AM, Nicholas Chammas wrote: > > If you’re still seeing gibberish, it’s because Spark is not using the > LZO libraries properly. In your case, I believe you should be calling > newAPIHadoopFile() instead of textFile(). > > For example: > > sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", > classOf[com.hadoop.mapreduce.LzoTextInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Text]) > > On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier > to read LZO-compressed files from EC2 clusters > <https://issues.apache.org/jira/browse/SPARK-2394> > > Nick > > > > On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski < > ognen.duzlev...@gmail.com> wrote: > >> Hello, >> >> I have been trying to play with the Google ngram dataset provided by >> Amazon in form of LZO compressed files. >> >> I am having trouble understanding what is going on ;). I have added the >> compression jar and native library to the underlying Hadoop/HDFS >> installation, restarted the name node and the datanodes, Spark can >> obviously see the file but I get gibberish on a read. Any ideas? >> >> See output below: >> >> 14/07/13 14:39:19 INFO SparkContext: Added JAR >> file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at >> http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with >> timestamp 1405262359777 >> 14/07/13 14:39:20 INFO SparkILoop: Created spark context.. >> Spark context available as sc. >> >> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo") >> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with >> curMem=0, maxMem=311387750 >> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to >> memory (estimated size 160.0 KB, free 296.8 MB) >> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at >> <console>:12 >> >> scala> f.take(10) >> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15, >> took 0.419708348 s >> res0: Array[String] = >> Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�?????? >> ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�? >> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �? >> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�? >> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?... >> >> Thanks! >> Ognen >> > > >