Kryo serialization does not compress

2014-02-25 Thread pradeeps8
Hi All,

We are currently trying to benchmark the various cache options on RDDs with
respect to speed and efficiency.
The data that we are using is mostly filled with numbers (floating point).

We have noticed that the memory consumption of the RDD for MEMORY_ONLY
(519.1 MB) and MEMORY_ONLY_SER (511.5 MB) 

which uses Kryo serialization.
Both consumes almost equivalent storage (519.1 MB vs 511.5 MB respectively).

Is this behavior expected?
Because we were under the impression that kryo serialization is efficient
and were expecting it to compress further.

Also,we have noticed that when we enable compression(LZ4) on RDDs, the
memory consumption of the RDD for MEMORY_ONLY 

with compression is same as without compression i.e. 519.1 MB. 
But for MEMORY_ONLY_SER (kryo serialization) with compression consumes only
386.5 MB.

Why isn't enabling compression without serialization working for
MEMORY_ONLY?
Is there anything else we need to do for MEMORY_ONLY to get it compressed?

Thanks,
Pradeep



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-serialization-does-not-compress-tp2042.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Kryo serialization does not compress

2014-03-06 Thread pradeeps8
We are trying to use kryo serialization, but with kryo serialization ON the
memory consumption does not change. We have tried this on multiple sets of
data.
We have also checked the logs of Kryo serialization and have confirmed that
Kryo is being used.

Can somebody please help us with this?

The script used is given below. 
SCRIPT
/import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.JavaConversions.mapAsScalaMap
import scala.collection.JavaConverters.asScalaBufferConverter
import scala.collection.mutable.Buffer
import scala.Array
import scala.math.Ordering.Implicits._ 

import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.RangePartitioner
import org.apache.spark.HashPartitioner

//For Kryo logging
import com.esotericsoftware.minlog.Log
import com.esotericsoftware.minlog.Log._
Log.set(LEVEL_TRACE);

val query = "select array(level_1, level_2,  level_3, level_4, level_5,
level_6, level_7, level_8, level_9, 

level_10, level_11, level_12, level_13, level_14, level_15, level_16,
level_17, level_18, level_19, level_20, 

level_21, level_22, level_23, level_24, level_25) as unitids, class, cuts,
type, data from table1 p join table2 b on 

(p.UnitId = b.unit_id) where runid = 912 and b.snapshotid = 220 and p.UnitId
= b.unit_id"

val rows: RDD[((Buffer[Any], String, Buffer[Any]), (String,
scala.collection.mutable.Buffer[Any]))] = 

sc.sql2rdd(query).map(row =>
((row.getList("unitids").asInstanceOf[java.util.List[Any]].asScala, 

row.getString("class"),
row.getList("cuts").asInstanceOf[java.util.List[Any]].asScala),
(row.getString("type"), 

row.getList("data").asInstanceOf[java.util.List[Any]].asScala)))

var rows2Array: RDD[((Buffer[Any], String, Buffer[Any]), (String,
Array[Float]))] = rows.map(row => (row._1, 

(row._2._1, ((row._2._2.map(y => y match {
  case floatWritable: org.apache.hadoop.io.FloatWritable =>
floatWritable.get
  case lazyFloat: org.apache.hadoop.hive.serde2.`lazy`.LazyFloat =>
lazyFloat.getWritableObject().get
  case _ => println("unknown data type " + y + " : "); 0
}))).toArray)))

var allArrays: RDD[((Array[Long], String, Buffer[Any]), (String,
Array[Float]))] = rows2Array.map(row => 

((row._1._1.map(x => x match {case longWritable:
org.apache.hadoop.io.LongWritable => longWritable.get 

case lazyLong: org.apache.hadoop.hive.serde2.`lazy`.LazyLong =>
lazyLong.getWritableObject().get  case _ => 

println("unknown data type " + x + " : "); 0}).toArray, row._1._2,
row._1._3), row._2))

var dataRdd: RDD[((Array[Long], String, Array[String]), (String,
Array[Float]))] = allArrays.map(row => ((row._1._1, 

row._1._2, row._1._3.map(x => x match {  case str: String => str  case _ =>
println("unknown data type " + x + " : 

"); new String("")}).toArray), row._2))

dataRdd = dataRdd.partitionBy(new
HashPartitioner(64)).persist(StorageLevel.MEMORY_ONLY_SER)

dataRdd.count()
/





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-serialization-does-not-compress-tp2042p2347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Kryo serialization does not compress

2014-03-07 Thread pradeeps8
Hi Patrick,

Thanks for your reply.

I am guessing even an array type will be registered automatically. Is this
correct?

Thanks,
Pradeep



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-serialization-does-not-compress-tp2042p2400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


java.lang.ClassNotFoundException in spark 0.9.0, shark 0.9.0 (pre-release) and hadoop 2.2.0

2014-03-07 Thread pradeeps8
Hi,

We are currently trying to migrate to hadoop 2.2.0 and hence we have
installed spark 0.9.0 and the pre-release version of shark 0.9.0.
When we execute the script ( script.txt
 
) we get the following error.
/org.apache.spark.SparkException: Job aborted: Task 1.0:3 failed 4 times
(most recent failure: Exception failure: java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1) 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 
at scala.Option.foreach(Option.scala:236) 
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
/

Has anyone seen this error?
If so, could you please help me get it corrected?

Thanks,
Pradeep




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-in-spark-0-9-0-shark-0-9-0-pre-release-and-hadoop-2-2-0-tp2401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.lang.ClassNotFoundException in spark 0.9.0, shark 0.9.0 (pre-release) and hadoop 2.2.0

2014-03-13 Thread pradeeps8
Hi All,

We have found the actual problem. The problem was with the getList method in
row class.
Earlier, the row class used to return java.util.List for getList method but
as of now the new source code (shark 0.9.0) returns a string. 
This is the commit log.
https://github.com/amplab/shark/commit/e80a2a158a888e9caa0ae053ed7f6f86fe01d3d3

We were wondering why was this change made?

Thanks,
Pradeep



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-in-spark-0-9-0-shark-0-9-0-pre-release-and-hadoop-2-2-0-tp2401p2651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread pradeeps8
Hi Aureliano,

I followed this thread to create a custom saveAsObjectFile. 
The following is the code.
/new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
BytesWritable](saveRDD.mapPartitions(iter =>
iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new
BytesWritable(serialize(x).saveAsSequenceFile("objFiles") /

But, I get the following error when executed.
/
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
 
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
 
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
/

Any idea about this error?
or 
Is there anything wrong in the line of code?

Thanks,
Pradeep




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-31 Thread pradeeps8
Hi Sonal,

There are no custom objects in saveRDD, it is of type RDD[(String, String)].

Thanks,
Pradeep 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.