Nicholas,
Thanks!
How do I make spark assemble against a local version of Hadoop?
I have 2.4.1 running on a test cluster and I did
"SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in
hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1
HDFS). I am guessing my local version of Hadoop libraries/jars is not
used. Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar
(responsible for the lzo stuff) to this hand assembled Spark?
I am running the spark-shell like this:
bin/spark-shell --jars
/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
and getting this:
scala> val f =
sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with
curMem=0, maxMem=311387750
14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values
to memory (estimated size 211.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at
<console>:12
scala> f.take(1)
14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)
which makes me think something is not linked to something properly (not
a Java expert unfortunately).
Thanks!
Ognen
On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
If you’re still seeing gibberish, it’s because Spark is not using the
LZO libraries properly. In your case, I believe you should be calling
|newAPIHadoopFile()| instead of |textFile()|.
For example:
|sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text])
|
On a side note, here’s a related JIRA issue: SPARK-2394: Make it
easier to read LZO-compressed files from EC2 clusters
<https://issues.apache.org/jira/browse/SPARK-2394>
Nick
On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski
<ognen.duzlev...@gmail.com <mailto:ognen.duzlev...@gmail.com>> wrote:
Hello,
I have been trying to play with the Google ngram dataset provided
by Amazon in form of LZO compressed files.
I am having trouble understanding what is going on ;). I have
added the compression jar and native library to the underlying
Hadoop/HDFS installation, restarted the name node and the
datanodes, Spark can obviously see the file but I get gibberish on
a read. Any ideas?
See output below:
14/07/13 14:39:19 INFO SparkContext: Added JAR
file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar
with timestamp 1405262359777
14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.
scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo
<http://10.10.0.98:54310/data/1gram.lzo>")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called
with curMem=0, maxMem=311387750
14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 160.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12
scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at
<console>:15, took 0.419708348 s
res0: Array[String] =
Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
�?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �?
�? �?
�?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
Thanks!
Ognen