Re: org/I0Itec/zkclient/serialize/ZkSerializer ClassNotFound

2014-10-21 Thread Akhil Das
You can add this jar

in the classpath to get ride of this. If you are hitting further exceptions
like classNotFound for metrics* etc, then make sure you have all these jars
in the classpath:

SPARK_CLASSPATH=SPARK_CLASSPATH:/root/akhld/spark/lib/
*spark-streaming-kafka_2.10-1.1.0.jar*:/root/akhld/spark/lib/
*kafka_2.10-0.8.0.jar*:/root/akhld/spark/lib/*zkclient-0.3.jar*
:/root/akhld/spark/lib/*metrics-core-2.2.0.jar*

Thanks
Best Regards

On Tue, Oct 21, 2014 at 10:45 AM, skane  wrote:

> I'm having the same problem with Spark 1.0.0. I got the "
> JavaKafkaWordCount.java" example working on my workstation running Spark
> locally after doing a build, but when I tried to get the example running on
> YARN, I got the same error. I used the "uber jar" that was created during
> the build process for the examples, and I confirmed that
> "org/I0Itec/zkclient/serialize/ZkSerializer" is in the uber jar.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/org-I0Itec-zkclient-serialize-ZkSerializer-ClassNotFound-tp15919p16897.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-21 Thread Akhil Das
What about start-all.sh or start-slaves.sh?

Thanks
Best Regards

On Tue, Oct 21, 2014 at 10:25 AM, Soumya Simanta 
wrote:

> I'm working a cluster where I need to start the workers separately and
> connect them to a master.
>
> I'm following the instructions here and using branch-1.1
>
> http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
>
> and I can start the master using
> ./sbin/start-master.sh
>
> When I try to start the slave/worker using
> ./sbin/start-slave.sh it does't work. The logs say that it needs the
> master.
> when I provide
> ./sbin/start-slave.sh spark://:7077 it still doesn't work.
>
> I can start the worker using the following command (as described in the
> documentation).
>
> ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
>
> Was wondering why start-slave.sh is not working?
>
> Thanks
> -Soumya
>
>


Re: default parallelism bug?

2014-10-21 Thread Olivier Girardot
Hi,
what do you mean by pretty small ? How big is your file ?

Regards,

Olivier.

2014-10-21 6:01 GMT+02:00 Kevin Jung :

> I use Spark 1.1.0 and set these options to spark-defaults.conf
> spark.scheduler.mode FAIR
> spark.cores.max 48
> spark.default.parallelism 72
>
> Thanks,
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Olivier Girardot
Could you please provide some of your code, and the sample json files you
use ?

Regards,

Olivier.

2014-10-21 5:45 GMT+02:00 tridib :

> Hello Experts,
> I have two tables build using jsonFile(). I can successfully run join query
> on these tables. But once I cacheTable(), all join query fails?
>
> Here is stackstrace:
> java.lang.NullPointerException
> at
>
> org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryColumnarTableScan.scala:43)
> at
>
> org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:42)
> at
>
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$statistics$1.apply(LogicalPlan.scala:50)
> at
>
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$statistics$1.apply(LogicalPlan.scala:50)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
>
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.statistics$lzycompute(LogicalPlan.scala:50)
> at
>
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.statistics(LogicalPlan.scala:44)
> at
>
> org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:83)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
>
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:268)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
>
> org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
>
> org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
>
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
> at
>
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
> at
>
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
> at
>
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
> at $iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC.(:22)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
> at
> org.apache.spark.repl.SparkIMain$Re

Re: Convert Iterable to RDD

2014-10-21 Thread Olivier Girardot
I don't think this is provided out of the box, but you can use toSeq on
your Iterable and if the Iterable is lazy, it should stay that way for the
Seq.
And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your
RDD.

For the Iterable[Iterable[T]] you can flatten it and then create your RDD
from the corresponding Iterable.

Regards,

Olivier.


2014-10-21 5:07 GMT+02:00 Dai, Kevin :

>  In addition, how to convert Iterable[Iterable[T]] to RDD[T]
>
>
>
> Thanks,
>
> Kevin.
>
>
>
> *From:* Dai, Kevin [mailto:yun...@ebay.com]
> *Sent:* 2014年10月21日 10:58
> *To:* user@spark.apache.org
> *Subject:* Convert Iterable to RDD
>
>
>
> Hi, All
>
>
>
> Is there any way to convert iterable to RDD?
>
>
>
> Thanks,
>
> Kevin.
>


Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
If you already know your keys the best way would be to "extract"
one RDD per key (it would not bring the content back to the master and you
can take advantage of the caching features) and then execute a
registerTempTable by Key.

But I'm guessing, you don't know the keys in advance, and in this case, I
think it becomes a very confusing point to put everything in different
tables,
First of all - how would you query it afterwards ?

Regards,

Olivier.

2014-10-20 13:02 GMT+02:00 critikaled :

> Hi I have a rdd which I want to register as multiple tables based on key
>
> 
> val context = new SparkContext(conf)
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(context)
> import sqlContext.createSchemaRDD
>
> case class KV(key:String,id:String,value:String)
> val logsRDD = context.textFile("logs", 10).map{line=>
>   val Array(key,id,value) = line split ' '
>   (key,id,value)
> }.registerTempTable("KVS")
>
> I want to store the above information to multiple tables based on key
> without bringing the entire data to master
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar
Hi All, 

I am trying to run the spark example JavaDecisionTree code using some
external data set. 
It works for certain dataset only with specific maxBins and maxDepth
settings. Even for a working dataset if I add a new data item I get a
ArrayIndexOutOfBounds Exception, I get the same exception for the first case
as well (changing maxBins and maxDepth). I am not sure what is wrong here,
can anyone please explain this. 

Exception stacktrace: 

14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
7.0 (TID 13) 
java.lang.ArrayIndexOutOfBoundsException: 6301 
at
org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 
at
org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 
at
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 
at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
at
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
at
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) 
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) 
at
org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) 
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) 
at
org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) 
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
(TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301 
   
org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 
   
org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 
   
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
scala.collection.Iterator$class.foreach(Iterator.scala:727) 
   
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) 
   
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) 
   
org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) 
   
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) 
   
org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
org.

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-21 Thread Fengyun RAO
Thanks, Guilaume,

Below is when the exception happens, nothing has spilled to disk yet.

And there isn't a join, but a partitionBy and groupBy action.

Actually if numPartitions is small, it succeeds, while if it's large, it
fails.

Partition was simply done by
override def getPartition(key: Any): Int = {
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}

IndexIDAttemptStatus ▾Locality LevelExecutorLaunch TimeDurationGC Time
AccumulatorsShuffle ReadShuffle Spill (Memory)Shuffle Spill (Disk)Errors99
1730FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:29:561.6 min30 s43.6 MB0.0
B0.0 B

com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow)
otherElements (org.apache.spark.util.collection.CompactBuffer)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)

org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:90)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:89)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

991751FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:31:292.6 min39 s42.7 MB0.0
B0.0 B

com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow)
otherElements (org.apache.spark.util.collection.CompactBuffer)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
com.eso

[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Hi!

The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer

Is this function supposed to be available?

Thanks

P.

---


java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
at org.apache.spark.sql.hive.HiveUdafFunction.(hiveUdfs.scala:334)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
at
org.apache.spark.sql.execution.Aggregate.org$apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Hi,
Is there a simple way to run spark sql queries against Sql Server databases? Or 
are we limited to running sql and doing sc.Parallelize()? Being able to query 
small amounts of lookup info directly from spark can save a bunch of annoying 
etl, and I'd expect Spark Sql to have some way of doing this.

Cheers,
Ashic.
  

Custom s3 endpoint

2014-10-21 Thread bobrik
I have s3-compatible service and I'd like to have access to it in spark.

>From what I have gathered, I need to add
"s3service.s3-endpoint=" to file jets3t.properties in
classpath. I'm not java programmer and I'm not sure where to put it in
hello-world example.

I managed to make it work with "local" master with this hack:

Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME).setProperty("s3service.s3-endpoint",
"");

But this property fails to propagate when I run spark on mesos cluster.
Putting correct jets3t.properties in SPARK_HOME/conf also helps only with
local master mode.

Can anyone help with this issue? Where should I put my j3tset.properties in
java project? That would be super-awesome if Spark could pick up s3 endpoint
from env variables like it does with s3 credentials.

Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-s3-endpoint-tp16911.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Cheng Lian
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL 
server. Currently Spark SQL can't run queries against SQL server. The 
foreign data source API planned in Spark 1.2 can make this possible.


On 10/21/14 6:26 PM, Ashic Mahtab wrote:

Hi,
Is there a simple way to run spark sql queries against Sql Server 
databases? Or are we limited to running sql and doing 
sc.Parallelize()? Being able to query small amounts of lookup info 
directly from spark can save a bunch of annoying etl, and I'd expect 
Spark Sql to have some way of doing this.


Cheers,
Ashic.




create a Row Matrix

2014-10-21 Thread viola
Hi, 

I am VERY new to spark and mllib and ran into a couple of problems while
trying to reproduce some examples. I am aware that this is a very simple
question but could somebody please give me an example 
- how to create a RowMatrix in scala with the following entries:
[1 2
3 4]?
I would like to apply an SVD on it.

Thank you very much! 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = "/hdd/spark/person.json"
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable("person")
val addressPath = "/hdd/spark/address.json"
val address = sqlContext.jsonFile(addressPath)
address.printSchema()
address.registerTempTable("address")
sqlContext.cacheTable("person")
sqlContext.cacheTable("address")
val rs2 = sqlContext.sql("SELECT p.id, p.name, a.city FROM person p, address
a where p.id = a.id limit 10").collect.foreach(println)

person.json
{"id:"1","name":"Mr. X"}

address.json
{"city:"Earth","id":"1"}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why fetch failed

2014-10-21 Thread marylucy
thank you 
it works!akka timeout may be bottle-neck in my system

> 在 Oct 20, 2014,17:07,"Akhil Das"  写道:
> 
> I used to hit this issue when my data size was too large and the number of 
> partitions was too large ( > 1200 ), I got ride of it by
> 
> - Reducing the number of partitions
> - Setting the following while creating the sparkContext:
>   .set("spark.rdd.compress","true")
>   .set("spark.storage.memoryFraction","1")
>   .set("spark.core.connection.ack.wait.timeout","600")
>   .set("spark.akka.frameSize","50")
> 
> 
> Thanks
> Best Regards
> 
>> On Sun, Oct 19, 2014 at 6:52 AM, marylucy  wrote:
>> When doing groupby for big data,may be 500g,some partition tasks 
>> success,some partition tasks fetchfailed error.   Spark system retry 
>> previous stage,but always fail
>> 6 computers : 384g
>> Worker:40g*7 for one computer
>> 
>> Can anyone tell me why fetch failed???
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: why fetch failed

2014-10-21 Thread marylucy
thanks
i need check spark 1.1.0 contain it

> 在 Oct 21, 2014,0:01,"DB Tsai"  写道:
> 
>  I ran into the same issue when the dataset is very big. 
> 
> Marcelo from Cloudera found that it may be caused by SPARK-2711, so their 
> Spark 1.1 release reverted SPARK-2711, and the issue is gone. See  
> https://issues.apache.org/jira/browse/SPARK-3633 for detail. 
> 
> You can checkout Cloudera's version here 
> https://github.com/cloudera/spark/tree/cdh5-1.1.0_5.2.0
> 
> PS, I don't test it yet, but will test it in the following couple days, and 
> report back.
> 
> 
> Sincerely,
> 
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
> 
>> On Sat, Oct 18, 2014 at 6:22 PM, marylucy  wrote:
>> When doing groupby for big data,may be 500g,some partition tasks 
>> success,some partition tasks fetchfailed error.   Spark system retry 
>> previous stage,but always fail
>> 6 computers : 384g
>> Worker:40g*7 for one computer
>> 
>> Can anyone tell me why fetch failed???
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


RE: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Thanks. Didn't know about jdbcrdd...should do nicely for now. The foreign data 
source api looks interesting...


Date: Tue, 21 Oct 2014 20:33:03 +0800
From: lian.cs@gmail.com
To: as...@live.com; user@spark.apache.org
Subject: Re: Getting Spark SQL talking to Sql Server


  

  
  
Instead of using Spark SQL, you can use JdbcRDD to extract data from
SQL server. Currently Spark SQL can't run queries against SQL
server. The foreign data source API planned in Spark 1.2 can make
this possible.



On 10/21/14 6:26 PM, Ashic Mahtab
  wrote:



  
  Hi,

Is there a simple way to run spark sql queries against Sql
Server databases? Or are we limited to running sql and doing
sc.Parallelize()? Being able to query small amounts of lookup
info directly from spark can save a bunch of annoying etl, and
I'd expect Spark Sql to have some way of doing this.



Cheers,

Ashic.

  


  

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-21 Thread Arian Pasquali
That's true Guillaume.
I'm currently aggregating documents considering a week as time range.
I will have to make it daily and aggregate the results later.

thanks for your hints anyway




Arian Pasquali
http://about.me/arianpasquali

2014-10-20 13:53 GMT+01:00 Guillaume Pitel :

>  Hi,
>
> The array size you (or the serializer) tries to allocate is just too big
> for the JVM. No configuration can help :
>
> https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit
>
> The only option is to split you problem further by increasing parallelism.
>
> Guillaume
>
> Hi,
> I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
> I get “Requested array size exceeds VM limit” and I’m probably missing
> something regarding memory configuration
> .
>
>  My server has 30G of memory and this are my current settings.
>
>  ##this one seams that was deprecated
>  export SPARK_MEM=‘25g’
>
>  ## worker memory options seams to be the memory for each worker (by
> default we have a worker for each core)
> export SPARK_WORKER_MEMORY=‘5g’
>
>  I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS,
> but I’m not quite sure how.
> I have tried some different options like the following, but I still
> couldn’t make it right:
>
>  export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
> export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
>
>  Does anyone has any idea how can I approach this?
>
>
>
>
>  14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> maxBytesInFlight: 50331648, targetRequestSize: 10066329
> 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> Getting 1566 non-empty blocks out of 1566 blocks
> 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> Started 0 remote fetches in 4 ms
> 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
> map of 3925 MB to disk (1 time so far)
> 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
> map of 3925 MB to disk (2 times so far)
> 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 1566)
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught
> exception in thread Thread[Executor task launch worker-2,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140
>
>
>  Arian
>
>
>
> --
>[image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)626 222 431
>
> eXenSa S.A.S. 
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705
>


Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark !  I found out why my RDD's werent coming through in my spark
stream.

It turns out you need the onStart()  needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread.  For example

def onStartMock():Unit ={
  val future = new Thread(new Runnable() {
def run() {
  for(x <- 1 until 10) {
val newMem = Runtime.getRuntime.freeMemory()/12188091;
if(newMem != lastMem){
  System.out.println("in thread : " + newMem);
}
lastMem=newMem;
store(mockStatus);
  }
}});

Hope that helps somebody in the same situation.  FYI Its in the docs :)

 * {{{
 *  class MyReceiver(storageLevel: StorageLevel) extends
NetworkReceiver[String](storageLevel) {
 *  def onStart() {
 *  // Setup stuff (start threads, open sockets, etc.) to start
receiving data.
 *  // Must start new thread to receive data, as onStart() must be
non-blocking.
 *
 *  // Call store(...) in those threads to store received data into
Spark's memory.
 *
 *  // Call stop(...), restart(...) or reportError(...) on any
thread based on how
 *  // different errors needs to be handled.
 *
 *  // See corresponding method documentation for more details
 *  }
 *
 *  def onStop() {
 *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
receiving data.
 *  }
 *  }
 * }}}


Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis
Collect will store the entire output in a List in memory. This solution is
acceptable for "Little Data" problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.

Most problems which benefit from Spark are large enough that even the data
assigned to a single partition will not fit into memory.

In my special case the output now is in the 0.5 - 4 GB range but in the
future might get to 4 times that size - something a single machine could
write but not hold at one time. I find that for most problems a file like
Part-0001 is not what the next step wants to use - the minute a step is
required to further process that file - even move and rename - there is
little reason not to let the spark code write what is wanted in the first
place.

I like the solution of using toLocalIterator and writing my own file


Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav
Hi Tridib,

I changed SQLContext to HiveContext and it started working. These are steps
I used.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile("json/person.json")
person.printSchema()
person.registerTempTable("person")
val address = sqlContext.jsonFile("json/address.json")
address.printSchema()
address.registerTempTable("address")
sqlContext.cacheTable("person")
sqlContext.cacheTable("address")
val rs2 = sqlContext.sql("select p.id,p.name,a.city from person p join
address a on (p.id = a.id)").collect.foreach(println)


Rishi@InfoObjects

*Pure-play Big Data Consulting*


On Tue, Oct 21, 2014 at 5:47 AM, tridib  wrote:

> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val personPath = "/hdd/spark/person.json"
> val person = sqlContext.jsonFile(personPath)
> person.printSchema()
> person.registerTempTable("person")
> val addressPath = "/hdd/spark/address.json"
> val address = sqlContext.jsonFile(addressPath)
> address.printSchema()
> address.registerTempTable("address")
> sqlContext.cacheTable("person")
> sqlContext.cacheTable("address")
> val rs2 = sqlContext.sql("SELECT p.id, p.name, a.city FROM person p,
> address
> a where p.id = a.id limit 10").collect.foreach(println)
>
> person.json
> {"id:"1","name":"Mr. X"}
>
> address.json
> {"city:"Earth","id":"1"}
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.

Thanks & Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16924.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava
Hi,

I am creating a cassandra java rdd and transforming it using the where
clause.

It works fine when I run it outside the mapValues, but when I put the code
in mapValues I get an error while creating the transformation.

Below is my sample code:

  CassandraJavaRDD cassandraRefTable = javaFunctions(sc
).cassandraTable("reference_data",

 "dept_reference_data", ReferenceData.class);

JavaPairRDD joinedRdd = rdd.mapValues(new
Function() {

 public Employee call(Employee employee) throws Exception {

 ReferenceData data = null;

 if(employee.getDepartment() != null) {

   data = referenceTable.where("postal_plus=?", location
.getPostalPlus()).first();

   System.out.println(data.toCSV());

 }

if(data != null) {

  //call setters on employee

}

return employee;

 }

}

I get this error:

java.lang.NullPointerException

at org.apache.spark.rdd.RDD.(RDD.scala:125)

at com.datastax.spark.connector.rdd.CassandraRDD.(
CassandraRDD.scala:47)

at com.datastax.spark.connector.rdd.CassandraRDD.copy(CassandraRDD.scala:70)

at com.datastax.spark.connector.rdd.CassandraRDD.where(CassandraRDD.scala:77
)

 at com.datastax.spark.connector.rdd.CassandraJavaRDD.where(
CassandraJavaRDD.java:54)


Thanks for help!!



Regards

Ankur


Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust
No, analytic and window functions do not work yet.

On Tue, Oct 21, 2014 at 3:00 AM, Pierre B <
pierre.borckm...@realimpactanalytics.com> wrote:

> Hi!
>
> The RANK function is available in hive since version 0.11.
> When trying to use it in SparkSQL, I'm getting the following exception
> (full
> stacktrace below):
> java.lang.ClassCastException:
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
> cast to
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
>
> Is this function supposed to be available?
>
> Thanks
>
> P.
>
> ---
>
>
> java.lang.ClassCastException:
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
> cast to
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
> at
> org.apache.spark.sql.hive.HiveUdafFunction.(hiveUdfs.scala:334)
> at
> org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
> at
> org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
> at
> org.apache.spark.sql.execution.Aggregate.org
> $apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
> at
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
> at
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


disk-backing pyspark rdds?

2014-10-21 Thread Eric Jonas
Hi All!
I'm getting my feet wet with pySpark for the fairly boring case of
doing parameter sweeps for monte carlo runs. Each of my functions runs for
a very long time (2h+) and return numpy arrays on the order of ~100 MB.
That is, my spark applications look like

def foo(x):
np.random.seed(x)
eat_2GB_of_ram()
take_2h()
return my_100MB_array

sc.parallelize(np.arange(100)).map(f).saveAsPickleFile("s3n://blah...")

The resulting rdds will most likely not fit in memory but for this use case
I don't really care. I know I can persist RDDs, but is there any way to
by-default disk-back them (something analogous to mmap?) so that they don't
create memory pressure in the system at all? With compute taking this long,
the added overhead of disk and network IO is quite minimal.

Thanks!

...Eric Jonas


stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
what could cause this type of 'stage failure'?  Thanks!

This is a simple py spark script to list data in hbase.
command line: ./spark-submit --driver-class-path
~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 

14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on ip-***.ec2.internal:35201 (size: 1470.0 B, free: 265.4 MB)
14/10/21 17:53:50 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/10/21 17:53:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0
(MappedRDD[1] at map at PythonHadoopUtil.scala:185)
14/10/21 17:53:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:34050/user/Executor#681287499]
with ID 0
14/10/21 17:53:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:47483/user/Executor#-936252397]
with ID 1
14/10/21 17:53:53 INFO BlockManagerMasterActor: Registering block manager
ip-2.internal:49236 with 3.1 GB RAM
14/10/21 17:53:54 INFO BlockManagerMasterActor: Registering block manager
ip-.ec2.internal:36699 with 3.1 GB RAM
14/10/21 17:53:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-.ec2.internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 1]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 2]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor ip-2.internal: java.lang.IllegalStateException (unread block data)
[duplicate 3]
14/10/21 17:53:54 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/10/21 17:53:54 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool 
14/10/21 17:53:54 INFO TaskSchedulerImpl: Cancelling stage 0
14/10/21 17:53:54 INFO DAGScheduler: Failed to run first at
SerDeUtil.scala:70
Traceback (most recent call last):
  File "/root/workspace/test/sparkhbase.py", line 17, in 
conf=conf2)
  File "/root/spark/python/pyspark/context.py", line 471, in newAPIHadoopRDD
jconf, batchSize)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, ip-internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputSt

Re: stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
maybe set up a hbase.jar in the conf?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu
Just to follow up, the queries worked against master and I got my whole flow 
rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with 
the next release of CDH5  :P

-Terry

From: Terry Siu mailto:terry@smartfocus.com>>
Date: Monday, October 20, 2014 at 12:22 PM
To: Michael Armbrust mailto:mich...@databricks.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf("spark.sql.hive.convertMetastoreParquet", "true”)

hc.setConf("spark.sql.parquet.compression.codec", "snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some("sc.segcustomerid".attr==="st.customerid".attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable = true)

 |-- customer_familystatus: string (nullable = true)

 |-- customer_customertype: string (nullable = true)

 |-- customer_c

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Michael Armbrust
> Hmm... I thought HiveContext will only worki if Hive is present. I am
> curious
> to know when to use HiveContext and when to use SqlContext.
>

http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started

TLDR; Always use HiveContext if your application does not have a dependency
conflict with Hive jars. :)


How to set hadoop native library path in spark-1.1

2014-10-21 Thread Pradeep Ch
Hi all,

Can anyone tell me how to set the native library path in Spark.

Right not I am setting it using "SPARK_LIBRARY_PATH" environmental variable
in spark-env.sh. But still no success.

I am still seeing this in spark-shell.

NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable


Thanks,
Pradeep


Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Thank for pointing that out.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Oh - and one other note on this, which appears to be the case.

If , in your stream forEachRDD implementation, you do something stupid
(like call rdd.count())

tweetStream.foreachRDD((rdd,lent)=> {
  tweetStream.repartition(1)
  numTweetsCollected+=1;
  //val count = rdd.count() DONT DO THIS !

You can also get stuck in a situation where your RDD processor blocks
infinitely.

And for twitter specific stuff, make sure to look at modifying the
TwitterInputDStream class
so that it implements the stuff from SPARK-2464, which can lead to infinite
stream reopening as well.



On Tue, Oct 21, 2014 at 11:02 AM, jay vyas 
wrote:

> Hi Spark !  I found out why my RDD's werent coming through in my spark
> stream.
>
> It turns out you need the onStart()  needs to return , it seems - i.e. you
> need to launch the worker part of your
> start process in a thread.  For example
>
> def onStartMock():Unit ={
>   val future = new Thread(new Runnable() {
> def run() {
>   for(x <- 1 until 10) {
> val newMem = Runtime.getRuntime.freeMemory()/12188091;
> if(newMem != lastMem){
>   System.out.println("in thread : " + newMem);
> }
> lastMem=newMem;
> store(mockStatus);
>   }
> }});
>
> Hope that helps somebody in the same situation.  FYI Its in the docs :)
>
>  * {{{
>  *  class MyReceiver(storageLevel: StorageLevel) extends
> NetworkReceiver[String](storageLevel) {
>  *  def onStart() {
>  *  // Setup stuff (start threads, open sockets, etc.) to start
> receiving data.
>  *  // Must start new thread to receive data, as onStart() must be
> non-blocking.
>  *
>  *  // Call store(...) in those threads to store received data
> into Spark's memory.
>  *
>  *  // Call stop(...), restart(...) or reportError(...) on any
> thread based on how
>  *  // different errors needs to be handled.
>  *
>  *  // See corresponding method documentation for more details
>  *  }
>  *
>  *  def onStop() {
>  *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
> receiving data.
>  *  }
>  *  }
>  * }}}
>
>


-- 
jay vyas


Class not found

2014-10-21 Thread Pat Ferrel
Not sure if this has been clearly explained here but since I took a day to 
track it down…

Several people have experienced a class not found error on Spark when the class 
referenced is supposed to be in the Spark jars.

One thing that can cause this is if you are building Spark for your cluster 
environment. The instructions say to do a “mvn package …” Instead some of these 
errors can be fixed using the following procedure:

1) delete ~/.m2/repository/org/spark and your-project
2) build Spark for your version of Hadoop *but do not use "mvn package ...”* 
use “mvn install …” This will put a copy of the exact bits you need into the 
maven cache for building your-project against. In my case using hadoop 1.2.1 it 
was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on 
Spark some failures can safely be ignored so check before giving up. 
3) build your-project with “mvn clean install"



How to calculate percentiles with Spark?

2014-10-21 Thread sparkuser
Hi,

What would be the best way to get percentiles from a Spark RDD? I can see
JavaDoubleRDD or MLlib's  MultivariateStatisticalSummary
   provide the
mean() but not percentiles.

Thank you!

Horace



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-Submit Python along with JAR

2014-10-21 Thread TJ Klein
Hi,

I'd like to run my python script using "spark-submit" together with a JAR
file containing Java specifications for a Hadoop file system. How can I do
that? It seems I can either provide a JAR file or a PYthon file to
spark-submit.

So far I have been running my code in ipython with IPYTHON_OPTS="notebook
--pylab inline" /usr/local/spark/bin/pyspark --jars
/usr/local/spark/HadoopFileFormat.jar

Best,
 Tassilo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Submit-Python-along-with-JAR-tp16938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Any help? or comments?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Primitive arrays in Spark

2014-10-21 Thread Akshat Aranya
This is as much of a Scala question as a Spark question

I have an RDD:

val rdd1: RDD[(Long, Array[Long])]

This RDD has duplicate keys that I can collapse such

val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b)

If I start with an Array of primitive longs in rdd1, will rdd2 also have
Arrays of primitive longs?  I suspect, based on my memory usage, that this
is not the case.

Also, would it be more efficient to do this:

val rdd1: RDD[(Long, ArrayBuffer[Long])]

and then

val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) =>
a++b).map(_.toArray)


MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Hi All,I have a question regarding the ordering of indices. The document says 
that the indices indices are one-based and in ascending order. However, do the 
indices within a row need to be sorted in ascending order? 
 Sparse dataIt is very common in practice to have sparse training data. MLlib 
supports reading training examples stored in LIBSVM format, which is the 
default format used by LIBSVM and LIBLINEAR. It is a text format in which each 
line represents a labeled sparse feature vector using the following 
format:label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the 
feature indices are converted to zero-based.

For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data 
file OK?
1110:1.0   80:0.5   310:0.00 890:0.5  20:0.0   200:0.5   400:1.0  
82:0.0 and so on:
OR do I need to sort them as:
1  80:0.5   110:1.0   310:0.00  20:0.082:0.0200:0.5   400:1.0  
890:0.5

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Ok thanks Michael.

In general, what's the easy way to figure out what's already implemented?

The exception I was getting was not really helpful here?

Also, is there a roadmap document somewhere ?

Thanks!

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread freedafeng
Thanks for the help!

Hadoop version: 2.3.0
Hbase version: 0.98.1

Use python to read/write data from/to hbase. 

Only change over the official spark 1.1.0 is the pom file under examples. 
Compilation: 
spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
spark/examples:mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2
-Dhadoop.version=2.3.0 -DskipTests clean package

I am wondering how I can deploy this version of spark to a new ec2 cluster.
I tried 
./spark-ec2 -k sparkcluster -i ~/sparkcluster.pem -s 1 -v 1.1.0
--hadoop-major-version=2.3.0 --worker-instances=2  -z us-east-1d launch
sparktest1

but this version got a type mismatch error when I read hbase data.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Class not found

2014-10-21 Thread Pat Ferrel
Doesn’t this seem like a dangerous error prone hack? It will build different 
bits on different machines. It doesn’t even work on my linux box because the 
mvn install doesn’t cache the same as on the mac.

If Spark is going to be supported on the maven repos shouldn’t it be addressed 
by different artifacts to support any option that changes the linkage 
info/class naming?

On Oct 21, 2014, at 12:16 PM, Pat Ferrel  wrote:

Not sure if this has been clearly explained here but since I took a day to 
track it down…

Several people have experienced a class not found error on Spark when the class 
referenced is supposed to be in the Spark jars.

One thing that can cause this is if you are building Spark for your cluster 
environment. The instructions say to do a “mvn package …” Instead some of these 
errors can be fixed using the following procedure:

1) delete ~/.m2/repository/org/spark and your-project
2) build Spark for your version of Hadoop *but do not use "mvn package ...”* 
use “mvn install …” This will put a copy of the exact bits you need into the 
maven cache for building your-project against. In my case using hadoop 1.2.1 it 
was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on 
Spark some failures can safely be ignored so check before giving up. 
3) build your-project with “mvn clean install"




Re: How to calculate percentiles with Spark?

2014-10-21 Thread lordjoe
A rather more general question is - assume I have an JavaRDD which is
sorted -
How can I convert this into a JavaPairRDD where the Integer is
tie  index -> 0...N - 1.
Easy to do on one machine
 JavaRDD values = ... // create here

   JavaRDD positions = values.mapToPair(new PairFunction() {
private int index = 0;
@Override
public Tuple2 call(final K t) throws Exception {
return new Tuple2(index++,t);
  }
});
but will this code do the right thing on a cluster



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937p16945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Class not found

2014-10-21 Thread Pat Ferrel
maven cache is laid out differently but it does work on Linux and BSD/mac.

Still looks like a hack to me.

On Oct 21, 2014, at 1:28 PM, Pat Ferrel  wrote:

Doesn’t this seem like a dangerous error prone hack? It will build different 
bits on different machines. It doesn’t even work on my linux box because the 
mvn install doesn’t cache the same as on the mac.

If Spark is going to be supported on the maven repos shouldn’t it be addressed 
by different artifacts to support any option that changes the linkage 
info/class naming?

On Oct 21, 2014, at 12:16 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:

Not sure if this has been clearly explained here but since I took a day to 
track it down…

Several people have experienced a class not found error on Spark when the class 
referenced is supposed to be in the Spark jars.

One thing that can cause this is if you are building Spark for your cluster 
environment. The instructions say to do a “mvn package …” Instead some of these 
errors can be fixed using the following procedure:

1) delete ~/.m2/repository/org/spark and your-project
2) build Spark for your version of Hadoop *but do not use "mvn package ...”* 
use “mvn install …” This will put a copy of the exact bits you need into the 
maven cache for building your-project against. In my case using hadoop 1.2.1 it 
was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on 
Spark some failures can safely be ignored so check before giving up. 
3) build your-project with “mvn clean install"





com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001
I am running a simple rdd filter command. What does it mean? 
Here is the full stack trace(and code below it): 

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 133
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

*Here is the code of the main function:*

/String comparisonFieldIndexes = "16,18";
String segmentFieldIndexes = "14,15";
String comparisonFieldWeights = "50, 50";
String delimiter = ""+'\001';

PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70,
comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes,
delimiter);

JavaRDD filtered_rdd = origRDD.filter(parOnCol.new
FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) );

parOnCol.printRDD(filtered_rdd);/


*Here is the FilterEmptyFields class:*

/public class FilterEmptyFields implements Function
{

final int[] nonEmptyFields;
final String DELIMITER;

public FilterEmptyFields(int[] nonEmptyFields, String 
delimiter){
this.nonEmptyFields = nonEmptyFields;
this.DELIMITER = delimiter;
}

@Override
public Boolean call(String s){

String[] fields = s.split(DELIMITER);

for(int i=0; i

Any suggestions guys?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SchemaRDD.where clause error

2014-10-21 Thread Kevin Paul
Hi all, I tried to use the function SchemaRDD.where() but got some error:

  val people = sqlCtx.sql("select * from people")
  people.where('age === 10)

:27: error: value === is not a member of Symbol

where did I go wrong?

Thanks,
Kevin Paul


buffer overflow when running Kmeans

2014-10-21 Thread Yang
this is the stack trace I got with yarn logs -applicationId

really no idea where to dig further.

thanks!
yang

14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3,
required: 8
Serialization trace:
data$mcD$sp (breeze.linalg.DenseVector$mcD$sp)
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Re: SchemaRDD.where clause error

2014-10-21 Thread Michael Armbrust
You need to "import sqlCtx._" to get access to the implicit conversion.

On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul 
wrote:

> Hi all, I tried to use the function SchemaRDD.where() but got some error:
>
>   val people = sqlCtx.sql("select * from people")
>   people.where('age === 10)
>
> :27: error: value === is not a member of Symbol
>
> where did I go wrong?
>
> Thanks,
> Kevin Paul
>


Spark - HiveContext - Unstructured Json

2014-10-21 Thread Harivardan Jayaraman
Hi,
I have unstructured JSON as my input which may have extra columns row to
row. I want to store these json rows using HiveContext so that it can be
accessed from the JDBC Thrift Server.
I notice there are primarily only two methods available on the SchemaRDD
for data - saveAsTable and insertInto. One defines the schema while the
other can be used to insert in to the table, but there is no way to Alter
the table and add columns to it.
How do I do this?

One option that I thought of is to write native "CREATE TABLE..." and
"ALTER TABLE.." statements but just does not seem feasible because at every
step, I will need to query Hive to determine what is the current schema and
make a decision whether I should add columns to it or not.

Any thoughts? Has anyone been able to do this?


Re: buffer overflow when running Kmeans

2014-10-21 Thread Ted Yu
Just posted below for a similar question.

Have you seen this thread ?

http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflow&subj=RE+spark+nbsp+kryo+serilizable+nbsp+exception

On Tue, Oct 21, 2014 at 2:44 PM, Yang  wrote:

> this is the stack trace I got with yarn logs -applicationId
>
> really no idea where to dig further.
>
> thanks!
> yang
>
> 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
> phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
> 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3,
> required: 8
> Serialization trace:
> data$mcD$sp (breeze.linalg.DenseVector$mcD$sp)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
> at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
> at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
> at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
> at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
> at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>


How to read BZ2 XML file in Spark?

2014-10-21 Thread John Roberts
Hi,

I want to ingest Open Street Map. It's 43GB (compressed) XML in BZIP2
format. What's your advice for reading it in to an RDD?

BTW, the Spark Training at UMD is awesome! I'm having a blast learning
Spark. I wish I could go to the MeetUp tonight, but I have kid activities...

http://wiki.openstreetmap.org/wiki/Planet.osm

http://wiki.openstreetmap.org/wiki/OSM_XML

John
--
John S. Roberts
SigInt Technologies LLC, a Novetta Solutions Company
8830 Stanford Blvd, Suite 306; Columbia, MD 21045 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Is there any specific issues you are facing?

Thanks,

Yin

On Tue, Oct 21, 2014 at 4:00 PM, tridib  wrote:

> Any help? or comments?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: MLLib libsvm format

2014-10-21 Thread Xiangrui Meng
Yes. "where the indices are one-based and **in ascending order**". -Xiangrui

On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak  wrote:
> Hi All,
>
> I have a question regarding the ordering of indices. The document says that
> the indices indices are one-based and in ascending order. However, do the
> indices within a row need to be sorted in ascending order?
>
>
>
>
> Sparse data
>
> It is very common in practice to have sparse training data. MLlib supports
> reading training examples stored in LIBSVM format, which is the default
> format used by LIBSVM and LIBLINEAR. It is a text format in which each line
> represents a labeled sparse feature vector using the following format:
>
> label index1:value1 index2:value2 ...
>
> where the indices are one-based and in ascending order. After loading, the
> feature indices are converted to zero-based.
>
>
>
> For example, I have have indices ranging rom 1 to 1000 is this as a libsvm
> data file OK?
>
>
> 1110:1.0   80:0.5   310:0.0
>
> 0 890:0.5  20:0.0   200:0.5   400:1.0  82:0.0
>
> and so on:
>
>
> OR do I need to sort them as:
>
>
> 1  80:0.5   110:1.0   310:0.0
>
> 0  20:0.082:0.0200:0.5   400:1.0  890:0.5

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark ui redirecting to port 8100

2014-10-21 Thread sadhan
Set up the spark port to a different one and the connection seems successful
but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as
well.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: create a Row Matrix

2014-10-21 Thread Xiangrui Meng
Please check out the example code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala
-Xiangrui

On Tue, Oct 21, 2014 at 5:34 AM, viola  wrote:
> Hi,
>
> I am VERY new to spark and mllib and ran into a couple of problems while
> trying to reproduce some examples. I am aware that this is a very simple
> question but could somebody please give me an example
> - how to create a RowMatrix in scala with the following entries:
> [1 2
> 3 4]?
> I would like to apply an SVD on it.
>
> Thank you very much!
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Great, I will sort them.


Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone

 Original message From: Xiangrui Meng 
 Date:10/21/2014  3:29 PM  (GMT-08:00) 
To: Sameer Tilak  Cc: 
user@spark.apache.org Subject: Re: MLLib libsvm format 

Yes. "where the indices are one-based and **in ascending order**". -Xiangrui

On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak  wrote:
> Hi All,
>
> I have a question regarding the ordering of indices. The document says that
> the indices indices are one-based and in ascending order. However, do the
> indices within a row need to be sorted in ascending order?
>
>
>
>
> Sparse data
>
> It is very common in practice to have sparse training data. MLlib supports
> reading training examples stored in LIBSVM format, which is the default
> format used by LIBSVM and LIBLINEAR. It is a text format in which each line
> represents a labeled sparse feature vector using the following format:
>
> label index1:value1 index2:value2 ...
>
> where the indices are one-based and in ascending order. After loading, the
> feature indices are converted to zero-based.
>
>
>
> For example, I have have indices ranging rom 1 to 1000 is this as a libsvm
> data file OK?
>
>
> 1110:1.0   80:0.5   310:0.0
>
> 0 890:0.5  20:0.0   200:0.5   400:1.0  82:0.0
>
> and so on:
>
>
> OR do I need to sort them as:
>
>
> 1  80:0.5   110:1.0   310:0.0
>
> 0  20:0.082:0.0200:0.5   400:1.0  890:0.5


Re: How to read BZ2 XML file in Spark?

2014-10-21 Thread sameerf
Hi John,

Glad you're enjoying the Spark training at UMD.

Is the 43 GB XML data in a single file or split across multiple BZIP2 files?
Is the file in a HDFS cluster or on a single linux machine?

If you're using BZIP2 with splittable compression (in HDFS), you'll need at
least Hadoop 1.1:
https://issues.apache.org/jira/browse/HADOOP-7823

Or if you've got the file on a single linux machine, perhaps consider
uncompressing it manually using cmd line tools before loading it into Spark.

You'll want to start with maybe 1 GB for each partition, so if the
uncompressed file is 100GB, maybe start with 100 partitions. Even if the
entire dataset is in one file (which might get read into just 1 or 2
partitions initially with Spark), you can use the repartition(numPartitions)
transformation to make 100 partitions.

Then you'll have to make sense of the XML schema. You have a few options to
do this.

You can take advantage of Scala’s XML functionality provided by the
scala.xml package to parse the data. Here is a blog post with some code
example for this:
http://stevenskelton.ca/real-time-data-mining-spark/

Or try sc.wholeTextFiles(). It reads the entire file into a string record.
You'll want to make sure that you have enough memory to read the single
string into memory.

This Cloudera blog post about half-way down has some Regex examples of how
to use Scala to parse an XML file into a collection of tuples:
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

You can also search for XMLInputFormat on Google. There are some 
implementations that allow you to specify the  to split on, e.g.: 
https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

Good luck!

Sameer F.
Client Services @ Databricks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-BZ2-XML-file-in-Spark-tp16954p16960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark ui redirecting to port 8100

2014-10-21 Thread Sameer Farooqui
Hi Sadhan,

Which port are you specifically trying to redirect? The driver program has
a web UI, typically on port 4040... or the Spark Standalone Cluster Master
has a UI exposed on port 7077.

Which setting did you update in which file to make this change?

And finally, which version of Spark are you on?

Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:29 PM, sadhan  wrote:

> Set up the spark port to a different one and the connection seems
> successful
> but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as
> well.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Hello,

Spark 1.1.0, Hadoop 2.4.1

I have written a Spark streaming application. And I am getting
FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
Here is brief what I am is trying to do.
My application is creating text file stream using Java Stream context. The
input file is on HDFS.

JavaDStream textStream = ssc.textFileStream(InputFile);

Then it is comparing each line of input stream with some data and filtering
it. The filtered data I am storing in JavaDStream.

 JavaDStream suspectedStream= textStream.flatMap(new
FlatMapFunction(){
@Override
public Iterable call(String line) throws
Exception {

List filteredList = new ArrayList();

// doing filter job

return filteredList;
}

And this filteredList I am storing in HDFS as:

 suspectedStream.foreach(new
Function,Void>(){
@Override
public Void call(JavaRDD rdd) throws
Exception {
rdd.saveAsTextFile(outputFolderPath);
return null;
}});


But with this I am receiving 
org.apache.hadoop.mapred.FileAlreadyExistsException.

I tried with appending random number with outputFolderPath and its working. 
But my requirement is to collect all output in one directory. 

Can you please suggest if there is any way to get rid of this exception ?

Thanks,
  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread sameerf
Hi,

Can you post what the error looks like?


Sameer F.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread Joseph Bradley
Hi, this sounds like a bug which has been fixed in the current master.
What version of Spark are you using?  Would it be possible to update to the
current master?
If not, it would be helpful to know some more of the problem dimensions
(num examples, num features, feature types, label type).
Thanks,
Joseph


On Tue, Oct 21, 2014 at 2:42 AM, lokeshkumar  wrote:

> Hi All,
>
> I am trying to run the spark example JavaDecisionTree code using some
> external data set.
> It works for certain dataset only with specific maxBins and maxDepth
> settings. Even for a working dataset if I add a new data item I get a
> ArrayIndexOutOfBounds Exception, I get the same exception for the first
> case
> as well (changing maxBins and maxDepth). I am not sure what is wrong here,
> can anyone please explain this.
>
> Exception stacktrace:
>
> 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
> 7.0 (TID 13)
> java.lang.ArrayIndexOutOfBoundsException: 6301
> at
>
> org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
> at
>
> org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
> at
>
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
> at
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
> at
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
> at
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> at
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
>
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> at
>
> org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> at
>
> org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
> at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
> (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301
>
>
> org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>
>
> org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>
>
> org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFu

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Sameer Farooqui
Hi Shailesh,

Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.

This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a previous job, so it
enforces that the dir should not exist.

Easiest way to get around this may be to just write the results from each
Spark app to a newly named directory, then on an interval run a simple
script to merge data from multiple HDFS directories into one directory.

This HDFS command will let you do something like a directory merge:
hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal -
/newfolderpath/file

See this StackOverflow discussion for a way to do it using Pig and Bash
scripting also:
https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder


Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari 
wrote:

> Hello,
>
> Spark 1.1.0, Hadoop 2.4.1
>
> I have written a Spark streaming application. And I am getting
> FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
> Here is brief what I am is trying to do.
> My application is creating text file stream using Java Stream context. The
> input file is on HDFS.
>
> JavaDStream textStream = ssc.textFileStream(InputFile);
>
> Then it is comparing each line of input stream with some data and filtering
> it. The filtered data I am storing in JavaDStream.
>
>  JavaDStream suspectedStream=
> textStream.flatMap(new
> FlatMapFunction(){
> @Override
> public Iterable call(String line) throws
> Exception {
>
> List filteredList = new
> ArrayList();
>
> // doing filter job
>
> return filteredList;
> }
>
> And this filteredList I am storing in HDFS as:
>
>  suspectedStream.foreach(new
> Function,Void>(){
> @Override
> public Void call(JavaRDD rdd) throws
> Exception {
> rdd.saveAsTextFile(outputFolderPath);
> return null;
> }});
>
>
> But with this I am receiving
> org.apache.hadoop.mapred.FileAlreadyExistsException.
>
> I tried with appending random number with outputFolderPath and its working.
> But my requirement is to collect all output in one directory.
>
> Can you please suggest if there is any way to get rid of this exception ?
>
> Thanks,
>   Shailesh
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Using the DataStax Cassandra Connector from PySpark

2014-10-21 Thread Mike Sukmanowsky
Hi there,

I'm using Spark 1.1.0 and experimenting with trying to use the DataStax
Cassandra Connector (https://github.com/datastax/spark-cassandra-connector)
from within PySpark.

As a baby step, I'm simply trying to validate that I have access to classes
that I'd need via Py4J. Sample python program:


from py4j.java_gateway import java_import

from pyspark.conf import SparkConf
from pyspark import SparkContext

conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
sc = SparkContext(appName="Spark + Cassandra Example", conf=conf)
java_import(sc._gateway.jvm, "com.datastax.spark.connector.*")
print sc._jvm.CassandraRow()




CassandraRow corresponds to
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala
which is included in the JAR I submit. Feel free to download the JAR here
https://dl.dropboxusercontent.com/u/4385786/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar

I'm currently running this Python example with:



spark-submit
--driver-class-path="/path/to/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar"
--verbose src/python/cassandara_example.py



But continually get the following error indicating that the classes aren't
in fact on the classpath of the GatewayServer:



Traceback (most recent call last):
  File
"/Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py",
line 37, in 
main()
  File
"/Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py",
line 25, in main
print sc._jvm.CassandraRow()
  File
"/Users/mikesukmanowsky/.opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.



The correct response from the GatewayServer should be:


In [22]: gateway.jvm.CassandraRow()
Out[22]: JavaObject id=o0



Also tried using --jars option instead and that doesn't seem to work
either. Is there something I'm missing as to why the classes aren't
available?


-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*p*: +1 (416) 953-4248
*e*: mike.sukmanow...@gmail.com

facebook  | twitter
 | LinkedIn
 | github



Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Vipul Pandey
any word on this one? I would like to get this done as well. 
Although, my real use case is to do something on each executor right up in the 
beginning - and I was trying to hack it using broadcasts by broadcasting an 
object of my own and do whatever I want in the readObject method.

Any other way out?


On Oct 4, 2014, at 7:36 PM, Peng Cheng  wrote:

> While Spark already offers support for asynchronous reduce (collect data from
> workers, while not interrupting execution of a parallel transformation)
> through accumulator, I have made little progress on making this process
> reciprocal, namely, to broadcast data from driver to workers to be used by
> all executors in the middle of a transformation. This primarily intended to
> be used in downpour SGD/adagrad, a non-blocking concurrent machine learning
> optimizer that performs better than existing synchronous GD in MLlib, and
> have vast application in training of many models.
> 
> My attempt so far is to stick to out-of-the-box, immutable broadcast, open a
> new thread on driver, in which I broadcast a thin data wrapper that when
> deserialized, will insert into a mutable singleton that is already
> replicated to all workers in the fat jar, this customized deserialization is
> not hard, just overwrite readObject like this:
> 
> class AutoInsert(var value: Int) extends Serializable{
> 
>  WorkerReplica.last = value
> 
>  private def readObject(in: ObjectInputStream): Unit = {
>in.defaultReadObject()
>WorkerContainer.last = this.value
>  }
> }
> 
> Unfortunately it looks like the deserializtion is called lazily and won't do
> anything before a worker use it (Broadcast[AutoInsert]), this is impossible
> without waiting for workers' stage to be finished and broadcast again. I'm
> wondering if I can 'hack' this thing into working? Or I'll have to write a
> serious extension to broadcast component to enable changing the value.
> 
> Hope I can find like-minded on this forum because ML is a selling point of
> Spark.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another 
one:

scala> val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)

scala> val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)

scala> a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)

scala> res0.getClass
res1: Class[_ <: Array[Long]] = class [J

The problem might be that lots of intermediate space is allocated as you merge 
values two by two. In particular, if a key has N arrays mapping to it, your 
code will allocate O(N^2) space because it builds first an array of size 1, 
then 2, then 3, etc. You can make this faster by using aggregateByKey instead, 
and using an intermediate data structure other than an Array to do the merging 
(ideally you'd find a growable ArrayBuffer-like class specialized for Longs, 
but you can also just try ArrayBuffer).

Matei



> On Oct 21, 2014, at 1:08 PM, Akshat Aranya  wrote:
> 
> This is as much of a Scala question as a Spark question
> 
> I have an RDD:
> 
> val rdd1: RDD[(Long, Array[Long])]
> 
> This RDD has duplicate keys that I can collapse such
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b)
> 
> If I start with an Array of primitive longs in rdd1, will rdd2 also have 
> Arrays of primitive longs?  I suspect, based on my memory usage, that this is 
> not the case.
> 
> Also, would it be more efficient to do this:
> 
> val rdd1: RDD[(Long, ArrayBuffer[Long])]
> 
> and then
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => 
> a++b).map(_.toArray)
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Thanks Sameer for quick reply.

I will try to implement it.

  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql not able to find classes with --jars option

2014-10-21 Thread sadhan
It was mainly because spark was setting the jar classes in a thread local
context classloader. The quick fix was to make our serde use the context
classloader first.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-not-able-to-find-classes-with-jars-option-tp16839p16972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Strategies for reading large numbers of files

2014-10-21 Thread Landon Kuhn
Thanks to folks here for the suggestions. I ended up settling on what seems
to be a simple and scalable approach. I am no longer using
sparkContext.textFiles with wildcards (it is too slow when working with a
large number of files). Instead, I have implemented directory traversal as
a Spark job, which enables it to parallelize across the cluster.

First, a couple of functions. One to traverse directories, and another to
get the lines in a file:

  def list_file_names(path: String): Seq[String] = {
val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration)
def f(path: Path): Seq[String] = {
  Option(fs.listStatus(path)).getOrElse(Array[FileStatus]()).
  flatMap {
case fileStatus if fileStatus.isDir ⇒ f(fileStatus.getPath)
case fileStatus ⇒ Seq(fileStatus.getPath.toString)
  }
}
f(new Path(path))
  }

  def read_log_file(path: String): Seq[String] = {
val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration)
val file = fs.open(new Path(path))
val source = Source.fromInputStream(file)
source.getLines.toList
  }

Next, I generate a list of "root" paths to scan:

  val paths =
for {
  record_type ← record_types
  year ← years
  month ← months
  day ← days
  hour ← hours
} yield s"s3n://s3-bucket-name/$record_type/$year/$month/$day/$hour/"
  }

(In this case, I generate one path per hour per record type.)

Finally, using Spark, I can build an RDD with the contents of every file in
the path list:

val rdd: RDD[String] =
sparkContext.
parallelize(paths, paths.size).
flatMap(list_file_names).
flatMap(read_log_file)

I am posting this info here with the hope that it will be useful to
somebody in the future.

L


On Tue, Oct 7, 2014 at 12:58 AM, deenar.toraskar 
wrote:

> Hi Landon
>
> I had a problem very similar to your, where we have to process around 5
> million relatively small files on NFS. After trying various options, we did
> something similar to what Matei suggested.
>
> 1) take the original path and find the subdirectories under that path and
> then parallelize the resulting list. you can configure the depth you want
> to
> go down to before sending the paths across the cluster.
>
>   def getFileList(srcDir:File, depth:Int) : List[File] = {
> var list : ListBuffer[File] = new ListBuffer[File]()
> if (srcDir.isDirectory()) {
> srcDir.listFiles() .foreach((file: File) =>
>if (file.isFile()) {
>   list +=(file)
>} else {
>   if (depth > 0 ) {
>  list ++= getFileList(file, (depth- 1 ))
>   }
>else if (depth < 0) {
> list ++= getFileList(file, (depth))
>   }
>else {
>   list += file
>}
> })
> }
> else {
>list += srcDir
> }
> list .toList
>   }
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Landon Kuhn*, *Software Architect*, Janrain, Inc. 
E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025
Follow Janrain: Facebook  | Twitter
 | YouTube  | LinkedIn
 | Blog 
Follow Me: LinkedIn 
-
*Acquire, understand, and engage your users. Watch our video
 or sign up for a live demo
 to see what it's all about.*


Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Applications

2014-10-21 Thread Saiph Kappa
Hi,

I have been trying to find a fairly complex application that makes use of
the Spark Streaming framework. I checked public github repos but the
examples I found were too simple, only comprising simple operations like
counters and sums. On the Spark summit website, I could find very
interesting projects, however no source code was available.

Where can I find non-trivial spark streaming application code? Is it that
difficult?

Thanks.


spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2

2014-10-21 Thread Tian Zhang
Hi, I am using the latest calliope library from tuplejump.com to create RDD
for cassandra table.
I am on a 3 nodes spark 1.1.0 with yarn.

My cassandra table is defined as below and I have about 2000 rows of data
inserted.
CREATE TABLE top_shows (
  program_id varchar,
  view_minute timestamp,
  view_count counter,
  PRIMARY KEY (view_minute, program_id)   //note that view_minute is the
partition key
);

Here are the simple steps I ran from spark-shell on master node

spark-shell --master yarn-client --jars
rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar --driver-memory 512m 
--executor-memory 512m --num-executors 3 --executor-cores 1

// Import the necessary 
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import com.tuplejump.calliope.utils.RichByteBuffer._
import com.tuplejump.calliope.Implicits._
import com.tuplejump.calliope.CasBuilder
import com.tuplejump.calliope.Types.{CQLRowKeyMap, CQLRowMap}

// Define my class and the implicit cast 
case class ProgramViewCount(viewMinute:Long, program:String, viewCount:Long)
implicit def keyValtoProgramViewCount(key:CQLRowKeyMap,
values:CQLRowMap):ProgramViewCount =
   ProgramViewCount(key.get("view_minute").get.getLong,
key.get("program_id").toString, values.get("view_count").get.getLong)

// Use the cql3 interface to read from table with WHERE predicate.
val cas = CasBuilder.cql3.withColumnFamily("streaming_qa",
"top_shows").onHost("23.22.120.96")
.where("view_minute = 141386178")
val allPrograms = sc.cql3Cassandra[ProgramViewCount](cas)

// Lazy  evaluation till this point
val rowCount = allPrograms.count

I hit the following exception. It seems that it does not like my where
clause. If I do not have the 
WHERE CLAUSE, it works fine. But with the WHERE CLAUSE, no matter the
predicate is on 
partition key or not, it will fail with the following exception.

Anyone else using calliope package can share some lights? Thanks a lot.

Tian

scala> val rowCount = allPrograms.count

14/10/21 23:26:07 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 2, ip-10-187-51-136.ec2.internal): java.lang.RuntimeException: 
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.(CqlPagingRecordReader.java:301)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD$$anon$1.(Cql3CassandraRDD.scala:75)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD.compute(Cql3CassandraRDD.scala:64)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-RDD-and-Calliope-1-1-0-CTP-U2-H2-tp16975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-ec2 script with VPC

2014-10-21 Thread Mike Jennings
You can give this patch a try. Let me know if you find any problems.

https://github.com/apache/spark/pull/2872

Thanks,

Mike



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-VPC-tp11482p16978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava
Is this because I am calling a transformation function on an rdd from
inside another transformation function?

Is it not allowed?

Thanks
Ankut
On Oct 21, 2014 1:59 PM, "Ankur Srivastava" 
wrote:

> Hi Gerard,
>
> this is the code that may be helpful.
>
> public class ReferenceDataJoin implements Serializable {
>
>
>  private static final long serialVersionUID = 1039084794799135747L;
>
> JavaPairRDD rdd;
>
> CassandraJavaRDD referenceTable;
>
>
>  public PostalReferenceDataJoin(List employees) {
>
>  JavaSparkContext sc =
> SparkContextFactory.getSparkContextFactory().getSparkContext();
>
>  this.rdd = sc.parallelizePairs(employees);
>
>  this. referenceTable = javaFunctions(sc).cassandraTable("reference_data",
>
>  “dept_reference_data", ReferenceData.class);
>
> }
>
>
>  public JavaPairRDD execute() {
>
> JavaPairRDD joinedRdd = rdd
>
> .mapValues(new Function() {
>
> private static final long serialVersionUID = -226016490083377260L;
>
>
> @Override
>
> public Employee call(Employee employee)
>
> throws Exception {
>
> ReferenceData data = null;
>
> if (employee.getDepartment() != null) {
>
> data = referenceTable.where(“dept=?",
>
> employee.getDepartment()).first();;
>
> System.out.println(employee.getDepartment() + ">" + data);
>
> }
>
> if (data != null) {
>
> //setters on employee
>
> }
>
> return employee;
>
> }
>
>   });
>
>  return joinedRdd;
>
> }
>
>
> }
>
>
> Thanks
> Ankur
>
> On Tue, Oct 21, 2014 at 11:11 AM, Gerard Maas 
> wrote:
>
>> Looks like that code does not correspond to the problem you're facing. I
>> doubt it would even compile.
>> Could you post the actual code?
>>
>> -kr, Gerard
>> On Oct 21, 2014 7:27 PM, "Ankur Srivastava" 
>> wrote:
>>
>>> Hi,
>>>
>>> I am creating a cassandra java rdd and transforming it using the where
>>> clause.
>>>
>>> It works fine when I run it outside the mapValues, but when I put the
>>> code in mapValues I get an error while creating the transformation.
>>>
>>> Below is my sample code:
>>>
>>>   CassandraJavaRDD cassandraRefTable = javaFunctions(sc
>>> ).cassandraTable("reference_data",
>>>
>>>  "dept_reference_data", ReferenceData.class);
>>>
>>> JavaPairRDD joinedRdd = rdd.mapValues(new
>>> Function() {
>>>
>>>  public Employee call(Employee employee) throws Exception {
>>>
>>>  ReferenceData data = null;
>>>
>>>  if(employee.getDepartment() != null) {
>>>
>>>data = referenceTable.where("postal_plus=?", location
>>> .getPostalPlus()).first();
>>>
>>>System.out.println(data.toCSV());
>>>
>>>  }
>>>
>>> if(data != null) {
>>>
>>>   //call setters on employee
>>>
>>> }
>>>
>>> return employee;
>>>
>>>  }
>>>
>>> }
>>>
>>> I get this error:
>>>
>>> java.lang.NullPointerException
>>>
>>> at org.apache.spark.rdd.RDD.(RDD.scala:125)
>>>
>>> at com.datastax.spark.connector.rdd.CassandraRDD.(
>>> CassandraRDD.scala:47)
>>>
>>> at com.datastax.spark.connector.rdd.CassandraRDD.copy(
>>> CassandraRDD.scala:70)
>>>
>>> at com.datastax.spark.connector.rdd.CassandraRDD.where(
>>> CassandraRDD.scala:77)
>>>
>>>  at com.datastax.spark.connector.rdd.CassandraJavaRDD.where(
>>> CassandraJavaRDD.java:54)
>>>
>>>
>>> Thanks for help!!
>>>
>>>
>>>
>>> Regards
>>>
>>> Ankur
>>>
>>
>


Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Add one more thing about question 1. Once you get the SchemaRDD from
jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to
cast the column type from the StringType to DateType (the string format
should be "-[m]m-[d]d" and you need to use hiveContext). Here is the
code snippet that may help.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val schemaRDD = hiveContext.jsonFile(...)
schemaRDD.registerTempTable("jsonTable")
hiveContext.sql("SELECT CAST(columnName as DATE) FROM jsonTable")

Thanks,

Yin

On Tue, Oct 21, 2014 at 8:00 PM, Yin Huai  wrote:

> Hello Tridib,
>
> I just saw this one.
>
> 1. Right now, jsonFile and jsonRDD do not detect date type. Right now,
> IntegerType, LongType, DoubleType, DecimalType, StringType, BooleanType,
> StructType and ArrayType will be automatically detected.
> 2. The process of inferring schema will pass the entire dataset once to
> determine the schema. So, you will see a join is launched. Applying a
> specific schema to a dataset does not have this cost.
> 3. It is hard to comment on it without seeing your implementation. For our
> built-in JSON support, jsonFile and jsonRDD provides a very convenient way
> to work with JSON datasets with SQL. You do not need to define the schema
> in advance and Spark SQL will automatically create the SchemaRDD for your
> dataset. You can start to query it with SQL by simply registering the
> returned SchemaRDD as a temp table. Regarding the implementation, we use a
> high performance JSON lib (Jackson, https://github.com/FasterXML/jackson)
> to parse JSON records.
>
> Thanks,
>
> Yin
>
> On Mon, Oct 20, 2014 at 10:56 PM, tridib  wrote:
>
>> Hi Spark SQL team,
>> I trying to explore automatic schema detection for json document. I have
>> few
>> questions:
>> 1. What should be the date format to detect the fields as date type?
>> 2. Is automatic schema infer slower than applying specific schema?
>> 3. At this moment I am parsing json myself using map Function and creating
>> schema RDD from the parsed JavaRDD. Is there any performance impact not
>> using inbuilt jsonFile()?
>>
>> Thanks
>> Tridib
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark - HiveContext - Unstructured Json

2014-10-21 Thread Cheng Lian
You can resort to |SQLContext.jsonFile(path: String, samplingRate: 
Double)| and set |samplingRate| to 1.0, so that all the columns can be 
inferred.


You can also use |SQLContext.applySchema| to specify your own schema 
(which is a |StructType|).


On 10/22/14 5:56 AM, Harivardan Jayaraman wrote:


Hi,
I have unstructured JSON as my input which may have extra columns row 
to row. I want to store these json rows using HiveContext so that it 
can be accessed from the JDBC Thrift Server.
I notice there are primarily only two methods available on the 
SchemaRDD for data - saveAsTable and insertInto. One defines the 
schema while the other can be used to insert in to the table, but 
there is no way to Alter the table and add columns to it.

How do I do this?

One option that I thought of is to write native "CREATE TABLE..." and 
"ALTER TABLE.." statements but just does not seem feasible because at 
every step, I will need to query Hive to determine what is the current 
schema and make a decision whether I should add columns to it or not.


Any thoughts? Has anyone been able to do this?


​


Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Peng Cheng
Looks like the only way is to implement that feature. There is no way of
hacking it into working



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread Koert Kuipers
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer
resizing) or spark 1.0 (which has a fixed size buffer)?
On Oct 21, 2014 5:30 PM, "nitinkak001"  wrote:

> I am running a simple rdd filter command. What does it mean?
> Here is the full stack trace(and code below it):
>
> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
> required: 133
> at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
> at
> com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
> at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
> at
>
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
> at
>
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
>
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> *Here is the code of the main function:*
>
> /String comparisonFieldIndexes = "16,18";
> String segmentFieldIndexes = "14,15";
> String comparisonFieldWeights = "50, 50";
> String delimiter = ""+'\001';
>
> PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70,
> comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes,
> delimiter);
>
> JavaRDD filtered_rdd = origRDD.filter(parOnCol.new
> FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) );
>
> parOnCol.printRDD(filtered_rdd);/
>
>
> *Here is the FilterEmptyFields class:*
>
> /public class FilterEmptyFields implements Function Boolean>
> {
>
> final int[] nonEmptyFields;
> final String DELIMITER;
>
> public FilterEmptyFields(int[] nonEmptyFields, String
> delimiter){
> this.nonEmptyFields = nonEmptyFields;
> this.DELIMITER = delimiter;
> }
>
> @Override
> public Boolean call(String s){
>
> String[] fields = s.split(DELIMITER);
>
> for(int i=0; i if(fields[nonEmptyFields[i]] == null  ||
> fields[nonEmptyFields[i]].isEmpty()){
> return false;
> }
> }
>
> return true;
> }
>
> }
>
> Any suggestions guys?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar
Hi Joseph

I am using spark 1.1.0 the latest version, I will try to update to the
current master and check.

The example I am running is JavaDecisionTree, the dataset is of libsvm
format containing 

1. 45 instances of training sample. 
2. 5 features
3. I am not sure what is feature type, but there are no categorical features
being passed in the example.
4. Three labels, not sure what label type is.

The example runs fine with 100 maxBins as value, but when I change this to
say 50 or 30 I get the exception.
Also could you please let me know what should be the default value for
maxBins(API says 100 as default but it did not work in this case)?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p16988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Num-executors and executor-cores overwritten by defaults

2014-10-21 Thread Ilya Ganelin
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can
no longer set the number of executors or executor-cores. No matter what
values I pass on the command line to spark they are overwritten by the
defaults. Does anyone have any idea what could have happened here? Running
on Spark 1.02 before I had no trouble.

Also I am able to launch the spark shell without these parameters being
overwritten.


spark sql query optimization , and decision tree building

2014-10-21 Thread sanath kumar
Hi all ,

I have a large data in text files (1,000,000 lines) .Each line has 128
columns . Here each line is a feature and each column is a dimension.

I have converted the txt files in json format and able to run sql queries
on json files using spark.

Now i am trying to build a k dimenstion decision tree (kd tree) with this
large data .

My steps :
1) calculate variance of each column pick the column with maximum variance
and make it as key of first node , and mean of the column as the value of
the node.
2) based on the first node value split the data into 2 parts an repeat the
process until you reach a point.

My sample code :

import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")

My Questions :

1) How to save result values of a query into a list ?
2) How to calculate variance of a column .Is there any efficient way?
3) i will be running multiple queries on same data .Does spark has any
way to optimize it ?
4) how to save the output as key value pairs in a text file ?

5) is there any way i can build decision kd tree using machine
libraries of spark ?

please help

Thanks,

Sanath


Re: create a Row Matrix

2014-10-21 Thread viola
Thanks for the quick response. However, I still only get error messages. I am
able to load a .txt file with entries in it and use it in sparks, but I am
not able to create a simple matrix, for instance a 2x2 row matrix
[1 2
3 4]
I tried variations such as 
val RowMatrix = Matrix(2,2,array(1,3,2,4))
but it doesn't work..





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913p16993.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Subscription request

2014-10-21 Thread Sathya
Hi,

Kindly subscribe me to the user group.

Regards,
Sathyanarayanan


Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-21 Thread Akhil Das
Hi Buntu,

You could something similar to the following:

 val receiver_stream = new ReceiverInputDStream(ssc) {
>   override def getReceiver(): Receiver[Nothing] = ??? //Whatever
> }.map((x : String) => (null, x))
> val config = new Configuration()
> config.set("mongo.output.uri", "mongodb://akhld:27017/sigmoid.output")
> receiver_stream.foreachRDD(rdd => {
>   val pair_rdd = new PairRDDFunctions[Null, String](rdd) // make sure
> your rdd contains a key, value
>   *pair_rdd.saveAsNewAPIHadoopFile("/home/akhld/sigmoid/beta/",
> classOf[Any], classOf[Any],
> classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config)*
> })


Thanks
Best Regards

On Tue, Oct 21, 2014 at 11:59 PM, Buntu Dev  wrote:

> Thanks Akhil,
>
> I tried this but running into similar error:
>
> ~~
>  val stream = KafkaUtils.createStream(ssc, zkQuorum, group,
> topicpMap).map(_._2)
>  stream.map(message => {
>  (null, message)
>
>}).saveAsNewAPIHadoopFile (destination, classOf[Void],
> classOf[Group], classOf[ExampleOutputFormat], conf)
> 
>
> Error:
> value saveAsNewAPIHadoopFile is not a member of
> org.apache.spark.rdd.RDD[(Null, String)]
>
>
> How do I go about converting to PairRDDFunctions?
>
>
> On Fri, Oct 10, 2014 at 12:01 AM, Akhil Das 
> wrote:
>
>> You can convert this ReceiverInputDStream
>> 
>> into PairRDDFuctions
>> 
>> and call the saveAsNewAPIHadoopFile.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Oct 10, 2014 at 11:28 AM, Buntu Dev  wrote:
>>
>>> Basically I'm attempting to convert a JSON stream to Parquet and I get
>>> this error without the .values or .map(_._2) :
>>>
>>>  value saveAsNewAPIHadoopFile is not a member of
>>> org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, String)]
>>>
>>>
>>>
>>>
>>> On Thu, Oct 9, 2014 at 10:15 PM, Sean Owen  wrote:
>>>
 Your RDD does not contain pairs, since you ".map(_._2)" (BTW that can
 just be ".values"). "Hadoop files" means "SequenceFiles" and those
 store key-value pairs. That's why the method only appears for
 RDD[(K,V)].

 On Fri, Oct 10, 2014 at 3:50 AM, Buntu Dev  wrote:
 > Thanks Sean, but I'm importing
 org.apache.spark.streaming.StreamingContext._
 >
 > Here are the spark imports:
 >
 > import org.apache.spark.streaming._
 >
 > import org.apache.spark.streaming.StreamingContext._
 >
 > import org.apache.spark.streaming.kafka._
 >
 > import org.apache.spark.SparkConf
 >
 > 
 >
 > val stream = KafkaUtils.createStream(ssc, zkQuorum, group,
 > topicpMap).map(_._2) stream.saveAsNewAPIHadoopFile
 (destination,
 > classOf[Void], classOf[Group], classOf[ExampleOutputFormat], conf)
 >
 > 
 >
 > Anything else I might be missing?

>>>
>>>
>>
>