Re: Problem reading in LZO compressed files

Nicholas Chammas Sun, 13 Jul 2014 17:07:17 -0700

I actually never got this to work, which is part of the reason why I filed
that JIRA. Apart from using --jar when starting the shell, I don’t have any
more pointers for you. :(



On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski <ognen.duzlev...@gmail.com
> wrote:

>  Nicholas,
>
> Thanks!
>
> How do I make spark assemble against a local version of Hadoop?
>
> I have 2.4.1 running on a test cluster and I did
> "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in
> hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1
> HDFS). I am guessing my local version of Hadoop libraries/jars is not used.
> Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar
> (responsible for the lzo stuff) to this hand assembled Spark?
>
> I am running the spark-shell like this:
> bin/spark-shell --jars
> /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
>
> and getting this:
>
> scala> val f = sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo
> ",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
> 14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with
> curMem=0, maxMem=311387750
> 14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 211.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at
> <console>:12
>
> scala> f.take(1)
> 14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
>     at
> com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)
>
> which makes me think something is not linked to something properly (not a
> Java expert unfortunately).
>
> Thanks!
> Ognen
>
>
>
> On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
>
>  If you’re still seeing gibberish, it’s because Spark is not using the
> LZO libraries properly. In your case, I believe you should be calling
> newAPIHadoopFile() instead of textFile().
>
> For example:
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>   classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>   classOf[org.apache.hadoop.io.LongWritable],
>   classOf[org.apache.hadoop.io.Text])
>
> On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier
> to read LZO-compressed files from EC2 clusters
> <https://issues.apache.org/jira/browse/SPARK-2394>
>
> Nick
> 
>
>
> On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <
> ognen.duzlev...@gmail.com> wrote:
>
>> Hello,
>>
>> I have been trying to play with the Google ngram dataset provided by
>> Amazon in form of LZO compressed files.
>>
>> I am having trouble understanding what is going on ;). I have added the
>> compression jar and native library to the underlying Hadoop/HDFS
>> installation, restarted the name node and the datanodes, Spark can
>> obviously see the file but I get gibberish on a read. Any ideas?
>>
>> See output below:
>>
>> 14/07/13 14:39:19 INFO SparkContext: Added JAR
>> file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
>> http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
>> timestamp 1405262359777
>> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
>> Spark context available as sc.
>>
>> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
>> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
>> curMem=0, maxMem=311387750
>> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
>> memory (estimated size 160.0 KB, free 296.8 MB)
>> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>> <console>:12
>>
>> scala> f.take(10)
>> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15,
>> took 0.419708348 s
>> res0: Array[String] =
>> Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
>> ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
>> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
>> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
>> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>>
>> Thanks!
>> Ognen
>>
>
>
>

Re: Problem reading in LZO compressed files

Reply via email to