Nicholas, thanks nevertheless! I am going to spend some time to try and figure this out and report back :-)
Ognen

On 7/13/14, 7:05 PM, Nicholas Chammas wrote:

I actually never got this to work, which is part of the reason why I filed that JIRA. Apart from using |--jar| when starting the shell, I don’t have any more pointers for you. :(
On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski <ognen.duzlev...@gmail.com <mailto:ognen.duzlev...@gmail.com>> wrote:

    Nicholas,

    Thanks!

    How do I make spark assemble against a local version of Hadoop?

    I have 2.4.1 running on a test cluster and I did
    "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was
    pull in hadoop-2.4.1 dependencies via sbt (which is sufficient for
    using a 2.4.1 HDFS). I am guessing my local version of Hadoop
    libraries/jars is not used. Alternatively, how do I add the
    hadoop-gpl-compression-0.1.0.jar (responsible for the lzo stuff)
    to this hand assembled Spark?

    I am running the spark-shell like this:
    bin/spark-shell --jars
    /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar

    and getting this:

    scala> val f =
    sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo
    
<http://10.10.0.98:54310/data/1gram.lzo>",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
    14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called
    with curMem=0, maxMem=311387750
    14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as
    values to memory (estimated size 211.0 KB, free 296.8 MB)
    f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
    org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile
    at <console>:12

    scala> f.take(1)
    14/07/13 16:53:08 INFO FileInputFormat: Total input paths to
    process : 1
    java.lang.IncompatibleClassChangeError: Found interface
    org.apache.hadoop.mapreduce.JobContext, but class was expected
        at
    
com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)

    which makes me think something is not linked to something properly
    (not a Java expert unfortunately).

    Thanks!
    Ognen



    On 7/13/14, 10:35 AM, Nicholas Chammas wrote:

    If you’re still seeing gibberish, it’s because Spark is not using
    the LZO libraries properly. In your case, I believe you should be
    calling |newAPIHadoopFile()| instead of |textFile()|.

    For example:

    
|sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
       classOf[com.hadoop.mapreduce.LzoTextInputFormat],
       classOf[org.apache.hadoop.io.LongWritable],
       classOf[org.apache.hadoop.io.Text])
    |

    On a side note, here’s a related JIRA issue: SPARK-2394: Make it
    easier to read LZO-compressed files from EC2 clusters
    <https://issues.apache.org/jira/browse/SPARK-2394>

    Nick

    ​


    On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski
    <ognen.duzlev...@gmail.com <mailto:ognen.duzlev...@gmail.com>> wrote:

        Hello,

        I have been trying to play with the Google ngram dataset
        provided by Amazon in form of LZO compressed files.

        I am having trouble understanding what is going on ;). I have
        added the compression jar and native library to the
        underlying Hadoop/HDFS installation, restarted the name node
        and the datanodes, Spark can obviously see the file but I get
        gibberish on a read. Any ideas?

        See output below:

        14/07/13 14:39:19 INFO SparkContext: Added JAR
        file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
        at
        http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
        timestamp 1405262359777
        14/07/13 14:39:20 INFO SparkILoop: Created spark context..
        Spark context available as sc.

        scala> val f =
        sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo
        <http://10.10.0.98:54310/data/1gram.lzo>")
        14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793)
        called with curMem=0, maxMem=311387750
        14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored
        as values to memory (estimated size 160.0 KB, free 296.8 MB)
        f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at
        textFile at <console>:12

        scala> f.take(10)
        14/07/13 14:39:43 INFO SparkContext: Job finished: take at
        <console>:15, took 0.419708348 s
        res0: Array[String] =
        
Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
        
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
        �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �?
        �? �? �?
        
�?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
        �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...

        Thanks!
        Ognen





Reply via email to