Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
In general, I don't think that means you should exclude something;
it's still needed.

The problem is that commons config depends *only* on *beanutils-core
1.8.0* so it ends up managing up that artifact version only, and not
the main beanutils one.

In this particular instance, which I've seen before, it seems clear
that beanutils-core *can* be excluded since it's a subset (see: the
warning message!) and beanutils is an uber-artifact. But then we'd
have to manage up beanutils to 1.8.0.

On Thu, Apr 9, 2015 at 5:24 PM, Ted Yu  wrote:
> commons-beanutils is brought in transitively:
>
> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.4.0:compile
> [INFO] |  |  +- commons-cli:commons-cli:jar:1.2:compile
> [INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
> [INFO] |  |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  |  +- commons-io:commons-io:jar:2.4:compile
> [INFO] |  |  +- commons-collections:commons-collections:jar:3.2.1:compile
> [INFO] |  |  +- commons-lang:commons-lang:jar:2.6:compile
> [INFO] |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>
> Looks like we can exclude commons-beanutils in pom.xml for hadoop-client
> dependency.
> e.g. in core/pom.xml :
>
> 
>   org.apache.hadoop
>   hadoop-client
>   
> 
>   javax.servlet
>   servlet-api
> 
>   
> 
>
> On Thu, Apr 9, 2015 at 2:14 PM, Ritesh Kumar Singh
>  wrote:
>>
>> Though the warnings can be ignored, they add up in the log files while
>> compiling other projects too. And there are a lot of those warnings. Any
>> workaround? How do we modify the pom.xml file to exclude these unnecessary
>> dependencies?
>>
>> On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:
>>>
>>> Generally, you can ignore these things. They mean some artifacts
>>> packaged other artifacts, and so two copies show up when all the JAR
>>> contents are merged.
>>>
>>> But here you do show a small dependency convergence problem; beanutils
>>> 1.7 is present but beanutills-core 1.8 is too even though these should
>>> be harmonized. I imagine one could be excluded; I imagine we could
>>> harmonize the version manually. In practice, I also imagine it doesn't
>>> cause any problem but feel free to propose a fix along those lines.
>>>
>>> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
>>>  wrote:
>>> > Hi,
>>> >
>>> > During compilation I get a lot of these:
>>> >
>>> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
>>> >  23 overlappping classes:
>>> >
>>> > [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
>>> > define
>>> >  82 overlappping classes:
>>> >
>>> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
>>> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
>>> >
>>> >
>>> > And a lot of others. How do I fix these?
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Pairwise computations within partition

2015-04-09 Thread abellet
Hello everyone,

I am a Spark novice facing a nontrivial problem to solve with Spark.

I have an RDD consisting of many elements (say, 60K), where each element is
is a d-dimensional vector.

I want to implement an iterative algorithm which does the following. At each
iteration, I want to apply an operation on *pairs* of elements (say, compute
their dot product). Of course the number of pairs is huge, but I only need
to consider a small random subset of the possible pairs at each iteration.

To minimize communication between nodes, I am willing to partition my RDD by
key (where each elements gets a random key) and to only consider pairs of
elements that belong to the same partition (i.e., that share the same key).
But I am not sure how to sample and apply the operation on pairs, and to
make sure that the computation for each pair is indeed done by the node
holding the corresponding elements.

Any help would be greatly appreciated. Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pairwise-computations-within-partition-tp22436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to
Hive 0.12 if you specify it in the profile when building Spark as per
https://spark.apache.org/docs/1.3.0/building-spark.html.

If you are downloading a pre built version of Spark 1.3 - then by default,
it is set to Hive 0.13.1.

HTH!

On Thu, Apr 9, 2015 at 10:03 AM ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Most likely you have an existing Hive installation with data in it. In
> this case i was not able to get Spark 1.3 communicate with existing Hive
> meta store. Hence when i read any table created in hive, Spark SQL used to
> complain "Data table not found"
>
> If you get it working, please share the steps.
>
> On Thu, Apr 9, 2015 at 9:25 PM, Arthur Chan 
> wrote:
>
>> Hi,
>>
>> I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark
>> 1.3.x
>>
>> Could anyone advise which Hive version should be used to match Spark
>> 1.3.x?
>> Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3?
>>
>> Regards
>> Arthur
>>
>
>
>
> --
> Deepak
>
>


RDD union

2015-04-09 Thread Debasish Das
Hi,

I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...

While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...

Is there any checkpoint based option to further speed up the union ?

Thanks.
Deb


Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Hi there,

You should be selective about which RDDs you cache and which you don't. A
good candidate RDD for caching is one that you reuse multiple times.
Commonly the reuse is for iterative machine learning algorithms that need
to take multiple passes over the data.

If you try to cache a really large RDD, Spark may evict older RDDs out of
memory to make room for the new one. So, that's another reason to be
careful about which RDDs you cache.

>> Is it a correct conclusion that it doesn't matter if ".cache" is used
anywhere in the program if I only have one action that is called only once?

Maybe. Sometimes you may call an action that triggers a large DAG with 200
RDDs to materialize. Inside that DAG there may be ML algorithms or reuse of
an RDD multiple times for joins with other datasets. In these cases, even
though you're calling just one action, it would make sense to cache
certain, strategic RDDs.


But with regards to your specific question...

>>Related to this question, consider this situation:
val d1 = data.map((x,y,z) => (x,y))
val d2 = data.map((x,y,z) => (y,x))

>> I'm wondering if Spark is optimizing the execution in a way that the
mappers
for d1 and d2 are running in parallel and the data RDD is traversed only
once.

Here caching doesn't really help. Spark would be smart enough to realize
that both maps can be pipelined together in one thread/task. So, if the
'data' RDD has 5 partitions, you would just need 5 threads to apply both
maps (not 10 threads).

When you call an action, the DAG gets broken down into Stages. Sometimes a
prior stage has to completely finish before running the next stage. Inside
a Stage there are multiple tasks/threads, one for each partition. Usually a
wide dependency relationship between two RDDs defines the stage boundary. A
wide dependency means that a network shuffle operation has to take place
between the two stages.

As Bojan said, you can call the .toDebugString() method on an RDD to begin
understanding how the DAG that generates that specific RDD breaks down into
different stages of execution.



On Thu, Apr 9, 2015 at 1:58 AM, Bojan Kostic  wrote:

> You can use toDebugString to see all the steps in job.
>
> Best
> Bojan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Hi,

During compilation I get a lot of these:

[WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
 23 overlappping classes:

[WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar define
 82 overlappping classes:

[WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
commons-beanutils-core-1.8.0.jar define 10 overlappping classes:


And a lot of others. How do I fix these?


Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread mehdisinger
Hi,

I'm facing an issue when I try to run my Spark application. I keep getting
the following exception:

15/04/09 15:14:07 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID 5)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID:
13994
Serialization trace:
ord (org.apache.spark.util.BoundedPriorityQueue)
at
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:599)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)
at
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I'm not sure where this exception occurs exactly... 
Does anyone know about this issue?

I'm running Spark version 1.1.0. 
My Master and workers are running on different machines (cluster mode), all
with the exact same architecture/configuration

Can anyone help?

Regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-exception-Encountered-unregistered-class-ID-13994-tp22437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Ok, what do i need to do in order to migrate the patch?

Thanks
Alex

On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma 
wrote:

> This is the jira I referred to
> https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
> working on it is evaluating priority between upgrading to scala 2.11.5(it
> is non trivial I suppose because repl has changed a bit) or migrating that
> patch is much simpler.
>
> Prashant Sharma
>
>
>
> On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos  wrote:
>
>> Hi-
>>
>> Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988
>>
>> Any help in getting this working would be much appreciated!
>>
>> Thanks
>> Alex
>>
>> On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma 
>> wrote:
>>
>>> You are right this needs to be done. I can work on it soon, I was not
>>> sure if there is any one even using scala 2.11 spark repl. Actually there
>>> is a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID),
>>> which has to be ported for scala 2.11 too. If however, you(or anyone else)
>>> are planning to work, I can help you ?
>>>
>>> Prashant Sharma
>>>
>>>
>>>
>>> On Thu, Apr 9, 2015 at 3:08 PM, anakos  wrote:
>>>
 Hi-

 I am having difficulty getting the 1.3.0 Spark shell to find an external
 jar.  I have build Spark locally for Scala 2.11 and I am starting the
 REPL
 as follows:

 bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

 I see the following line in the console output:

 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

 file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
 at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
 with
 timestamp 1428569535904

 but when i try to import anything from this jar, it's simply not
 available.
 When I try to add the jar manually using the command

 :cp /path/to/jar

 the classes in the jar are still unavailable. I understand that 2.11 is
 not
 officially supported, but has anyone been able to get an external jar
 loaded
 in the 1.3.0 release?  Is this a known issue? I have tried searching
 around
 for answers but the only thing I've found that may be related is this:

 https://issues.apache.org/jira/browse/SPARK-3257

 Any/all help is much appreciated.
 Thanks
 Alex



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-04-09 Thread Saiph Kappa
Sorry, I was getting those errors because my workload was not sustainable.

However, I noticed that, by just running the spark-streaming-benchmark (
https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala
), I get no difference on the execution time, number of processed records,
and delay whether I'm using 1 machine or 2 machines with the setup
described before (using spark standalone). Is it normal?



On Fri, Mar 27, 2015 at 5:32 PM, Tathagata Das  wrote:

> If it is deterministically reproducible, could you generate full DEBUG
> level logs, from the driver and the workers and give it to me? Basically I
> want to trace through what is happening to the block that is not being
> found.
> And can you tell what Cluster manager are you using? Spark Standalone,
> Mesos or YARN?
>
>
> On Fri, Mar 27, 2015 at 10:09 AM, Saiph Kappa 
> wrote:
>
>> Hi,
>>
>> I am just running this simple example with
>> machineA: 1 master + 1 worker
>> machineB: 1 worker
>> «
>> val ssc = new StreamingContext(sparkConf, Duration(1000))
>>
>> val rawStreams = (1 to numStreams).map(_
>> =>ssc.rawSocketStream[String](host, port,
>> StorageLevel.MEMORY_ONLY_SER)).toArray
>> val union = ssc.union(rawStreams)
>>
>> union.filter(line => Random.nextInt(1) == 0).map(line => {
>>   var sum = BigInt(0)
>>   line.toCharArray.foreach(chr => sum += chr.toInt)
>>   fib2(sum)
>>   sum
>> }).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s => s"### result:
>> $s").print()
>> »
>>
>> And I'm getting the following exceptions:
>>
>> Log from machineB
>> «
>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 132
>> 15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 134
>> 15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
>> 15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast
>> variable 24
>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 136
>> 15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 138
>> 15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 140
>> 15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
>> 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
>> curMem=47117, maxMem=280248975
>> 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
>> bytes in memory (estimated size 1886.0 B, free 267.2 MB)
>> 15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
>> broadcast_24_piece0
>> 15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24
>> took 19 ms
>> 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
>> curMem=49003, maxMem=280248975
>> 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values
>> in memory (estimated size 3.0 KB, free 267.2 MB)
>> 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0
>> (TID 140)
>> java.lang.Exception: Could not compute split, block input-0-1427473262420
>> not found
>> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:701)
>> 15/03/27 16:21:35 ERROR Executor: Exception in task 6.0 in stage 27.0
>> (TID 138)
>> java.lang.Exception: Could not compute split, block input-0-1427473262418
>> not found
>> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
>>

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Ted Yu
Is there custom class involved in your application ?

I assume you have called sparkConf.registerKryoClasses() for such class(es).

Cheers

On Thu, Apr 9, 2015 at 7:15 AM, mehdisinger 
wrote:

> Hi,
>
> I'm facing an issue when I try to run my Spark application. I keep getting
> the following exception:
>
> 15/04/09 15:14:07 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID
> 5)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID:
> 13994
> Serialization trace:
> ord (org.apache.spark.util.BoundedPriorityQueue)
> at
>
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
> at
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:599)
> at
>
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at
>
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> I'm not sure where this exception occurs exactly...
> Does anyone know about this issue?
>
> I'm running Spark version 1.1.0.
> My Master and workers are running on different machines (cluster mode), all
> with the exact same architecture/configuration
>
> Can anyone help?
>
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-exception-Encountered-unregistered-class-ID-13994-tp22437.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Typo in previous email, pardon me.

Set "spark.driver.maxResultSize" to 1068 or higher.

On Thu, Apr 9, 2015 at 8:57 AM, Ted Yu  wrote:

> Please set "spark.kryoserializer.buffer.max.mb" to 1068 (or higher).
>
> Cheers
>
> On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> Pressed send early.
>>
>> I had tried that with these settings
>>
>>  buffersize=128 maxbuffersize=1024
>>
>> val conf = new SparkConf()
>>
>> .setAppName(detail)
>>
>> .set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer")
>>
>>
>> .set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").get)
>>
>>
>> .set("spark.kryoserializer.buffer.max.mb",arguments.get("maxbuffersize").get)
>>
>>
>> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>>
>>
>> On Thu, Apr 9, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> Yes i had tried that.
>>>
>>> Now i see this
>>>
>>> 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
>>> VISummaryDataProvider.scala:38, took 275.334991 s
>>> 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw
>>> exception: Job aborted due to stage failure: Total size of serialized
>>> results of 4 tasks (1067.3 MB) is bigger than spark.driver.maxResultSize
>>> (1024.0 MB)
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Total
>>> size of serialized results of 4 tasks (1067.3 MB) is bigger than
>>> spark.driver.maxResultSize (1024.0 MB)
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at scala.Option.foreach(Option.scala:236)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> 15/04/09 07:58:08 INFO storage.BlockManagerInfo: Removed taskresult_4 on
>>> phxaishdc9dn0579.phx.ebay.com:42771 in memory (size: 273.5 MB, free:
>>> 6.2 GB)
>>> 15/04/09 07:58:08 INFO yarn.ApplicationMaster: Final app status: FAILED,
>>> exitCode: 15, (reason: User
>>>
>>> On Thu, Apr 9, 2015 at 8:18 PM, Ted Yu  wrote:
>>>
 Please take a look at
 https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
 , starting line 27.

 In Spark, you can control the maxBufferSize
 with "spark.kryoserializer.buffer.max.mb"

 Cheers

>>>
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>


How to submit job in a different user?

2015-04-09 Thread SecondDatke
Well, maybe a Linux configure problem...
I have a cluster that is about to expose to the public, and I want everyone 
that uses my cluster owns a user (without permissions of sudo, etc.)(e.g. 
'guest'), and is able to submit tasks to Spark, which working on Mesos that 
running with a different, private user ('sparkuser' for example).
But, now let's say I launched Mesos slave at Node 1 with 'sparkuser', Node 2 
with 'guest', and submit a job with 'guest', then Node 1 will fail, saying:
Failed to change user to 'guest': Failed to set gid: Operation not permitted.
Any solution? Or this just doesn't make any sense?
Thanks.   

Re: Spark Job #of attempts ?

2015-04-09 Thread Deepak Jain
Can I see current values of all configs. Similar to configuration in Hadoop 
world from ui ?

Sent from my iPhone

> On 09-Apr-2015, at 11:07 pm, Marcelo Vanzin  wrote:
> 
> Set spark.yarn.maxAppAttempts=1 if you don't want retries.
> 
>> On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>> Hello,
>> I have a spark job with 5 stages. After it runs 3rd stage, the console shows
>> 
>> 
>> 15/04/09 10:25:57 INFO yarn.Client: Application report for
>> application_1427705526386_127168 (state: RUNNING)
>> 15/04/09 10:25:58 INFO yarn.Client: Application report for
>> application_1427705526386_127168 (state: RUNNING)
>> 15/04/09 10:25:59 INFO yarn.Client: Application report for
>> application_1427705526386_127168 (state: ACCEPTED)
>> 15/04/09 10:25:59 INFO yarn.Client:
>> client token: N/A
>> diagnostics: N/A
>> ApplicationMaster host: N/A
>> ApplicationMaster RPC port: -1
>> queue: hdmi-express
>> start time: 1428598679223
>> final status: UNDEFINED
>> tracking URL:
>> https://apollo-phx-rm-1.vip.ebay.com:50030/proxy/application_1427705526386_127168/
>> user: dvasthimal
>> 15/04/09 10:26:00 INFO yarn.Client: Application report for
>> application_1427705526386_127168 (state: ACCEPTED)
>> 15/04/09 10:26:01 INFO yarn.Client: Application report for
>> application_1427705526386_127168 (state: ACCEPTED)
>> 
>> and then running again. This looks as if the stage failed and Spark
>> restarted the job from beginning. If thats not the case, when i click the
>> spark UI web page, it does not show already completed stages and instead
>> goes back to running stage #1. Is there some setting to turn this behavior
>> off ?
>> 
>> --
>> Deepak
> 
> 
> 
> -- 
> Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Please set "spark.kryoserializer.buffer.max.mb" to 1068 (or higher).

Cheers

On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Pressed send early.
>
> I had tried that with these settings
>
>  buffersize=128 maxbuffersize=1024
>
> val conf = new SparkConf()
>
> .setAppName(detail)
>
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>
> .set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").get)
>
>
> .set("spark.kryoserializer.buffer.max.mb",arguments.get("maxbuffersize").get)
>
>
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>
>
> On Thu, Apr 9, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> Yes i had tried that.
>>
>> Now i see this
>>
>> 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
>> VISummaryDataProvider.scala:38, took 275.334991 s
>> 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw
>> exception: Job aborted due to stage failure: Total size of serialized
>> results of 4 tasks (1067.3 MB) is bigger than spark.driver.maxResultSize
>> (1024.0 MB)
>> org.apache.spark.SparkException: Job aborted due to stage failure: Total
>> size of serialized results of 4 tasks (1067.3 MB) is bigger than
>> spark.driver.maxResultSize (1024.0 MB)
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/04/09 07:58:08 INFO storage.BlockManagerInfo: Removed taskresult_4 on
>> phxaishdc9dn0579.phx.ebay.com:42771 in memory (size: 273.5 MB, free: 6.2
>> GB)
>> 15/04/09 07:58:08 INFO yarn.ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User
>>
>> On Thu, Apr 9, 2015 at 8:18 PM, Ted Yu  wrote:
>>
>>> Please take a look at
>>> https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
>>> , starting line 27.
>>>
>>> In Spark, you can control the maxBufferSize
>>> with "spark.kryoserializer.buffer.max.mb"
>>>
>>> Cheers
>>>
>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>


Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
This is the jira I referred to
https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
working on it is evaluating priority between upgrading to scala 2.11.5(it
is non trivial I suppose because repl has changed a bit) or migrating that
patch is much simpler.

Prashant Sharma



On Thu, Apr 9, 2015 at 4:16 PM, Alex Nakos  wrote:

> Hi-
>
> Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988
>
> Any help in getting this working would be much appreciated!
>
> Thanks
> Alex
>
> On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma 
> wrote:
>
>> You are right this needs to be done. I can work on it soon, I was not
>> sure if there is any one even using scala 2.11 spark repl. Actually there
>> is a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID),
>> which has to be ported for scala 2.11 too. If however, you(or anyone else)
>> are planning to work, I can help you ?
>>
>> Prashant Sharma
>>
>>
>>
>> On Thu, Apr 9, 2015 at 3:08 PM, anakos  wrote:
>>
>>> Hi-
>>>
>>> I am having difficulty getting the 1.3.0 Spark shell to find an external
>>> jar.  I have build Spark locally for Scala 2.11 and I am starting the
>>> REPL
>>> as follows:
>>>
>>> bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar
>>>
>>> I see the following line in the console output:
>>>
>>> 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR
>>>
>>> file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
>>> at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
>>> with
>>> timestamp 1428569535904
>>>
>>> but when i try to import anything from this jar, it's simply not
>>> available.
>>> When I try to add the jar manually using the command
>>>
>>> :cp /path/to/jar
>>>
>>> the classes in the jar are still unavailable. I understand that 2.11 is
>>> not
>>> officially supported, but has anyone been able to get an external jar
>>> loaded
>>> in the 1.3.0 release?  Is this a known issue? I have tried searching
>>> around
>>> for answers but the only thing I've found that may be related is this:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-3257
>>>
>>> Any/all help is much appreciated.
>>> Thanks
>>> Alex
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
Finally got it working by increasing the spark version in maven to 1.2.1

On Thu, Apr 9, 2015 at 12:30 PM, Mohit Anchlia 
wrote:

> I changed the JDK to Oracle but I still get this error. Not sure what it
> means by "Stream class is incompatible with local class". I am using the
> following build on the server "spark-1.2.1-bin-hadoop2.4"
>
> 15/04/09 15:26:24 ERROR JobScheduler: Error running job streaming job
> 1428607584000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 15.0 (TID 66, ip-10-241-251-232.us-west-2.compute.internal):
> java.io.InvalidClassException: org.apache.spark.Aggregator; local class
> incompatible: stream classdesc serialVersionUID = 5032037208639381169,
> local class serialVersionUID = -9085606473104903453
> at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
>
> On Wed, Apr 8, 2015 at 3:40 PM, Ted Yu  wrote:
>
>> bq. one is Oracle and the other is OpenJDK
>>
>> I don't have experience with mixed JDK's.
>> Can you try with using single JDK ?
>>
>> Cheers
>>
>> On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia 
>> wrote:
>>
>>> For the build I am using java version "1.7.0_65" which seems to be the
>>> same as the one on the spark host. However one is Oracle and the other is
>>> OpenJDK. Does that make any difference?
>>>
>>> On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu  wrote:
>>>
 What version of Java do you use to build ?

 Cheers

 On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia 
 wrote:

> I am seeing the following, is this because of my maven version?
>
> 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
> 0, ip-10-241-251-232.us-west-2.compute.internal):
> java.io.InvalidClassException: org.apache.spark.Aggregator; local class
> incompatible: stream classdesc serialVersionUID = 5032037208639381169,
> local class serialVersionUID = -9085606473104903453
> at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>
>
> 
>
> org.apache.spark
>
> spark-streaming_2.10
>
> 1.2.0
>
> 
>
> 
>
> org.apache.spark
>
> spark-core_2.10
>
> 1.2.0
>
> 
>
>

>>>
>>
>


External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread anakos
Hi-

I am having difficulty getting the 1.3.0 Spark shell to find an external
jar.  I have build Spark locally for Scala 2.11 and I am starting the REPL
as follows:

bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

I see the following line in the console output:

15/04/09 09:52:15 INFO spark.SparkContext: Added JAR
file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar with
timestamp 1428569535904

but when i try to import anything from this jar, it's simply not available. 
When I try to add the jar manually using the command

:cp /path/to/jar

the classes in the jar are still unavailable. I understand that 2.11 is not
officially supported, but has anyone been able to get an external jar loaded
in the 1.3.0 release?  Is this a known issue? I have tried searching around
for answers but the only thing I've found that may be related is this:

https://issues.apache.org/jira/browse/SPARK-3257

Any/all help is much appreciated.
Thanks
Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Guillaume Pitel

Hi,

From my experience, those errors happen under very high memory 
pressure, and/or with machines with bad hardware (memory, network 
card,..)  I have had a few of them, as well as Snappy uncompress errors, 
on a machine with a slightly failing memory stick. Given the large 
amount of data transiting on the wire during shuffles, the probability 
of an error occuring during deserialization or uncompression is 
relatively high. In general, reducing the memory pressure also helps a lot.


Guillaume

Hi,

I'm facing an issue when I try to run my Spark application. I keep getting
the following exception:

15/04/09 15:14:07 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID 5)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID:
13994
Serialization trace:
ord (org.apache.spark.util.BoundedPriorityQueue)
at
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:599)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)
at
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I'm not sure where this exception occurs exactly...
Does anyone know about this issue?

I'm running Spark version 1.1.0.
My Master and workers are running on different machines (cluster mode), all
with the exact same architecture/configuration

Can anyone help?

Regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-exception-Encountered-unregistered-class-ID-13994-tp22437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Caching and Actions

2015-04-09 Thread spark_user_2015
That was helpful!

The conclusion:
(1) The mappers are not executed in parallel when processing independently
the same RDD.
(2) The best way seems to be (if enough memory is available and an action is
applied to d1 and d2 later on)
   val d1 = data.map((x,y,z) => (x,y)).cache
   val d2 = d1.map((x,y) => (y,x)) 
 -  This avoids pipelining the "d1" mapper and "d2" mapper when
computing d2

This is important to write efficient code, toDebugString helps a lot.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SQL can't not create Hive database

2015-04-09 Thread Hao Ren
Hi,

I am working on the local mode.

The following code 

hiveContext.setConf("hive.metastore.warehouse.dir",
/home/spark/hive/warehouse)
hiveContext.sql("create database if not exists db1")

 throws

15/04/09 13:53:16 ERROR RetryingHMSHandler: MetaException(message:Unable to
create database path file:/user/hive/warehouse/db1.db, failed to create
database db1)

It seems that it uses hdfs path, not the local one specified in hiveContext.

Any idea ? Thank you.

Hao




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
Thanks Sandy, apprechiate

On Thu, Apr 9, 2015 at 10:32 PM, Sandy Ryza  wrote:

> Hi Deepak,
>
> I'm going to shamelessly plug my blog post on tuning Spark:
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
>
> It talks about tuning executor size as well as how the number of tasks for
> a stage is calculated.
>
> -Sandy
>
> On Thu, Apr 9, 2015 at 9:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> I have a spark job that has multiple stages. For now i star it with 100
>> executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN
>> 2.4.x.
>>
>> For now i start the Spark Job with a very limited input (1 file of size
>> 2G), overall there are 200 files. My first run is yet to complete as its
>> taking too much of time / throwing OOM exceptions / buffer exceptions (keep
>> that aside).
>>
>> How will i know how much resources are required to run this job. (# of
>> cores, executors, mem, serialization buffers, and i do not yet what else).
>>
>> IN M/R world, all i do is set split size and rest is taken care
>> automatically (yes i need to worry about mem, in case of OOM).
>>
>>
>> 1) Can someone explain how they do resource estimation before running the
>> job or is there no way and one needs to only try it out ?
>> 2) Even if i give 100 executors, the first stage takes only 5, how did
>> spark decide this ?
>>
>> Please point me to any resources that also talks about similar things or
>> please explain here.
>>
>> --
>> Deepak
>>
>>
>


-- 
Deepak


Re: Spark Job #of attempts ?

2015-04-09 Thread Marcelo Vanzin
Set spark.yarn.maxAppAttempts=1 if you don't want retries.

On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
> Hello,
> I have a spark job with 5 stages. After it runs 3rd stage, the console shows
>
>
> 15/04/09 10:25:57 INFO yarn.Client: Application report for
> application_1427705526386_127168 (state: RUNNING)
> 15/04/09 10:25:58 INFO yarn.Client: Application report for
> application_1427705526386_127168 (state: RUNNING)
> 15/04/09 10:25:59 INFO yarn.Client: Application report for
> application_1427705526386_127168 (state: ACCEPTED)
> 15/04/09 10:25:59 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: hdmi-express
> start time: 1428598679223
> final status: UNDEFINED
> tracking URL:
> https://apollo-phx-rm-1.vip.ebay.com:50030/proxy/application_1427705526386_127168/
> user: dvasthimal
> 15/04/09 10:26:00 INFO yarn.Client: Application report for
> application_1427705526386_127168 (state: ACCEPTED)
> 15/04/09 10:26:01 INFO yarn.Client: Application report for
> application_1427705526386_127168 (state: ACCEPTED)
>
> and then running again. This looks as if the stage failed and Spark
> restarted the job from beginning. If thats not the case, when i click the
> spark UI web page, it does not show already completed stages and instead
> goes back to running stage #1. Is there some setting to turn this behavior
> off ?
>
> --
> Deepak
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel
Maybe I'm wrong, but what you are doing here is basically a bunch of 
cartesian product for each key. So if "hello" appear 100 times in your 
corpus, it will produce 100*100 elements in the join output.


I don't understand what you're doing here, but it's normal your join 
takes forever, it makes no sense as it, IMO.


Guillaume

Hello guys,

I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).

I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.

Do you have any idea why this is happening?

Thanks a lot,
Kostas
**
*val *sc = *new *SparkContext(
args(0),
*"DummyJoin"*,
System./getenv/(*"SPARK_HOME"*),
/Seq/(System./getenv/(*"SPARK_EXAMPLES_JAR"*)))

*val *file = sc.textFile(args(1))

*val *wordTuples = file
.flatMap(line => line.split(args(2)))
.map(word => (word, 1))

*val *big = wordTuples.filter {
*case *((k, v)) => k != *"a"
*}.cache()

*val *small = wordTuples.filter {
*case *((k, v)) => k != *"a" *&& k != *"to" *&& k != *"and"
*}.cache()

*val *res = big.leftOuterJoin(small)
res.saveAsTextFile(args(3))
}



--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Pairwise computations within partition

2015-04-09 Thread Guillaume Pitel

I would try something like that :

val a = rdd.sample(false,0.1,1).zipwithindex.map{ case (vector,index) => 
(index,vector)}
val b = rdd.sample(false,0.1,2).zipwithindex.map{ case (vector,index) => 
(index,vector)}

a.join(b).map { case (_,(vectora,vectorb)) => yourOperation }

Grouping by blocks is probably not what you want, since it would 
restrict the scope of a vector to the vectors in the same block.


Guillaume

Hello everyone,

I am a Spark novice facing a nontrivial problem to solve with Spark.

I have an RDD consisting of many elements (say, 60K), where each element is
is a d-dimensional vector.

I want to implement an iterative algorithm which does the following. At each
iteration, I want to apply an operation on *pairs* of elements (say, compute
their dot product). Of course the number of pairs is huge, but I only need
to consider a small random subset of the possible pairs at each iteration.

To minimize communication between nodes, I am willing to partition my RDD by
key (where each elements gets a random key) and to only consider pairs of
elements that belong to the same partition (i.e., that share the same key).
But I am not sure how to sample and apply the operation on pairs, and to
make sure that the computation for each pair is indeed done by the node
holding the corresponding elements.

Any help would be greatly appreciated. Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pairwise-computations-within-partition-tp22436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Any success on embedding local Spark in OSGi?

2015-04-09 Thread Deniz Acay
Hi,

I have been trying to use Spark in an OSGi bundle but I had no luck so far.

I have seen similar mails in the past, so I am wondering, had anyone 
successfully run Spark inside an OSGi bundle?

I am running Spark in the bundle created with Maven shade plugin and even tried 
adding Akka JARs inside the bundle classpath manually. I am not trying to 
expose it via services, but Akka configuration is not wroking. I even managed 
to read reference.conf directly from bundle context and set it with SparkConf. 
But it gave me an error regarding the parsing of value '20s'.

I had been struggling with this for some time and this is almost a matter of 
life and death (in business terms).

Do you have any advice?

Thanks in advance,
Deniz

Re: Overlapping classes warnings

2015-04-09 Thread Ted Yu
commons-beanutils is brought in transitively:

[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.4.0:compile
[INFO] |  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO] |  |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  |  +- commons-io:commons-io:jar:2.4:compile
[INFO] |  |  +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  |  +- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO] |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO] |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] |  |  |  \-
commons-beanutils:commons-beanutils-core:jar:1.8.0:compile

Looks like we can exclude commons-beanutils in pom.xml for hadoop-client
dependency.
e.g. in core/pom.xml :


  org.apache.hadoop
  hadoop-client
  

  javax.servlet
  servlet-api

  


On Thu, Apr 9, 2015 at 2:14 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Though the warnings can be ignored, they add up in the log files while
> compiling other projects too. And there are a lot of those warnings. Any
> workaround? How do we modify the pom.xml file to exclude these unnecessary
> dependencies?
>
> On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:
>
>> Generally, you can ignore these things. They mean some artifacts
>> packaged other artifacts, and so two copies show up when all the JAR
>> contents are merged.
>>
>> But here you do show a small dependency convergence problem; beanutils
>> 1.7 is present but beanutills-core 1.8 is too even though these should
>> be harmonized. I imagine one could be excluded; I imagine we could
>> harmonize the version manually. In practice, I also imagine it doesn't
>> cause any problem but feel free to propose a fix along those lines.
>>
>> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
>>  wrote:
>> > Hi,
>> >
>> > During compilation I get a lot of these:
>> >
>> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
>> >  23 overlappping classes:
>> >
>> > [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
>> > define
>> >  82 overlappping classes:
>> >
>> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
>> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
>> >
>> >
>> > And a lot of others. How do I fix these?
>>
>
>


override log4j.properties

2015-04-09 Thread patcharee

Hello,

How to override log4j.properties for a specific spark job?

BR,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Join on Spark too slow.

2015-04-09 Thread Kostas Kloudas
Hello guys,

I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).

I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.

Do you have any idea why this is happening?

Thanks a lot,
Kostas

*val *sc = *new *SparkContext(
  args(0),
  *"DummyJoin"*,
  System.*getenv*(*"SPARK_HOME"*),
  *Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*)))

*val *file = sc.textFile(args(1))

*val *wordTuples = file
  .flatMap(line => line.split(args(2)))
  .map(word => (word, 1))

*val *big = wordTuples.filter {
  *case *((k, v)) => k !=
*"a"*}.cache()

*val *small = wordTuples.filter {
  *case *((k, v)) => k != *"a" *&& k != *"to" *&& k !=
*"and"*}.cache()

*val *res = big.leftOuterJoin(small)
res.saveAsTextFile(args(3))
  }


Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Nandan Tammineedi
Hi,

I'm running a spark streaming job in local mode (--master local[4]), and
I'm seeing tons of these messages, roughly once every second -

WARN BlockManager: Block input-0-1428527584600 replicated to only 0 peer(s)
instead of 1 peers

We're using spark 1.2.1. Even with TRACE logging enabled, we're not seeing
any log messages indicating failure to replicate the blocks.

Should we be concerned about this warning (and if so, how should we debug
this), or is this a corner case in local mode where replication is not
attempted, but the warning is emitted anyway? If so, what is the workaround?

thanks

Nandan


Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Yes i had tried that.

Now i see this

15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
VISummaryDataProvider.scala:38, took 275.334991 s
15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Total size of serialized results of 4
tasks (1067.3 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
org.apache.spark.SparkException: Job aborted due to stage failure: Total
size of serialized results of 4 tasks (1067.3 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/04/09 07:58:08 INFO storage.BlockManagerInfo: Removed taskresult_4 on
phxaishdc9dn0579.phx.ebay.com:42771 in memory (size: 273.5 MB, free: 6.2 GB)
15/04/09 07:58:08 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User

On Thu, Apr 9, 2015 at 8:18 PM, Ted Yu  wrote:

> Please take a look at
> https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
> , starting line 27.
>
> In Spark, you can control the maxBufferSize
> with "spark.kryoserializer.buffer.max.mb"
>
> Cheers
>



-- 
Deepak


Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Hi Friends,

I am trying to solve a use case in spark streaming, I need help on getting to 
right approach on lookup / update the master data.

Use case ( simplified )
I've a dataset of entity with three attributes and identifier/row key in a 
persistent store.

Each attribute along with row key come from a different stream let's say, 
effectively 3 source streams.

Now whenever any attribute comes up, I want to update/sync the persistent store 
and do some processing, but the processing would require the latest state of 
entity with latest values of three attributes.

I wish if I have the all the entities cached in some sort of centralized cache 
( like we have data in hdfs ) within spark streaming which may be used for data 
local processing. But I assume there is no such thing.

potential approaches I m thinking of, I suspect first two are not feasible, but 
I want to confirm,
  1.  Is Broadcast Variables mutable ? If yes, can I use it as cache for 
all entities sizing  around 100s of GBs provided i have a cluster with enough 
RAM.

  1.  Is there any kind of sticky partition possible, so that I route my stream 
data to go through the same node where I've the corresponding entities, subset 
of entire store, cached in memory within JVM / off heap on the node, this would 
avoid lookups from store.
  2.  If I stream the entities from persistent store into engine, this becomes 
4th stream - the entity stream, how do i use join / merge to enable stream 
1,2,3 to lookup and update the data from stream 4. Would DStream.join work for 
few seconds worth of data in attribute streams with all data in entity stream ? 
Or do I use transform and within that use rdd join, I've doubts if I am leaning 
towards core spark approach in spark streaming ?

  1.  The last approach, which i think will surely work but i want to avoid, is 
i keep the entities in IMDB and do lookup/update calls on from stream 1,2 and 3.

Any help is deeply appreciated as this would help me design my system 
efficiently and the solution approach may become a beacon for lookup master 
data sort of problems.

Regards,
Amit








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
Generally, you can ignore these things. They mean some artifacts
packaged other artifacts, and so two copies show up when all the JAR
contents are merged.

But here you do show a small dependency convergence problem; beanutils
1.7 is present but beanutills-core 1.8 is too even though these should
be harmonized. I imagine one could be excluded; I imagine we could
harmonize the version manually. In practice, I also imagine it doesn't
cause any problem but feel free to propose a fix along those lines.

On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
 wrote:
> Hi,
>
> During compilation I get a lot of these:
>
> [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
>  23 overlappping classes:
>
> [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
> define
>  82 overlappping classes:
>
> [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
> commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
>
>
> And a lot of others. How do I fix these?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
I changed the JDK to Oracle but I still get this error. Not sure what it
means by "Stream class is incompatible with local class". I am using the
following build on the server "spark-1.2.1-bin-hadoop2.4"

15/04/09 15:26:24 ERROR JobScheduler: Error running job streaming job
1428607584000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage
15.0 (TID 66, ip-10-241-251-232.us-west-2.compute.internal):
java.io.InvalidClassException: org.apache.spark.Aggregator; local class
incompatible: stream classdesc serialVersionUID = 5032037208639381169,
local class serialVersionUID = -9085606473104903453
at
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)

On Wed, Apr 8, 2015 at 3:40 PM, Ted Yu  wrote:

> bq. one is Oracle and the other is OpenJDK
>
> I don't have experience with mixed JDK's.
> Can you try with using single JDK ?
>
> Cheers
>
> On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia 
> wrote:
>
>> For the build I am using java version "1.7.0_65" which seems to be the
>> same as the one on the spark host. However one is Oracle and the other is
>> OpenJDK. Does that make any difference?
>>
>> On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu  wrote:
>>
>>> What version of Java do you use to build ?
>>>
>>> Cheers
>>>
>>> On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia 
>>> wrote:
>>>
 I am seeing the following, is this because of my maven version?

 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
 0, ip-10-241-251-232.us-west-2.compute.internal):
 java.io.InvalidClassException: org.apache.spark.Aggregator; local class
 incompatible: stream classdesc serialVersionUID = 5032037208639381169,
 local class serialVersionUID = -9085606473104903453
 at
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
 at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)


 

 org.apache.spark

 spark-streaming_2.10

 1.2.0

 

 

 org.apache.spark

 spark-core_2.10

 1.2.0

 


>>>
>>
>


Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
means that a fixed time which scale with graph size is unavoidable on a
pregel-like iteration.

But what if I have to iterate nearly 100 iterations but at the last 50
iterations there are only < 0.1% nodes need to be updated ? The fixed time
make the program finished at a unacceptable time consumption.

Alcaid

2015-04-08 1:41 GMT+08:00 Ankur Dave :

> We thought it would be better to simplify the interface, since the
> active set is a performance optimization but the result is identical
> to calling subgraph before aggregateMessages.
>
> The active set option is still there in the package-private method
> aggregateMessagesWithActiveSet. You can actually access it publicly
> via GraphImpl, though the API isn't guaranteed to be stable:
> graph.asInstanceOf[GraphImpl[VD,ED]].aggregateMessagesWithActiveSet(...)
> Ankur
>
>
> On Tue, Apr 7, 2015 at 2:56 AM, James  wrote:
> > Hello,
> >
> > The old api of GraphX "mapReduceTriplets" has an optional parameter
> > "activeSetOpt: Option[(VertexRDD[_]" that limit the input of sendMessage.
> >
> > However, to the new api "aggregateMessages" I could not find this option,
> > why it does not offer any more?
> >
> > Alcaid
>


Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Though the warnings can be ignored, they add up in the log files while
compiling other projects too. And there are a lot of those warnings. Any
workaround? How do we modify the pom.xml file to exclude these unnecessary
dependencies?

On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:

> Generally, you can ignore these things. They mean some artifacts
> packaged other artifacts, and so two copies show up when all the JAR
> contents are merged.
>
> But here you do show a small dependency convergence problem; beanutils
> 1.7 is present but beanutills-core 1.8 is too even though these should
> be harmonized. I imagine one could be excluded; I imagine we could
> harmonize the version manually. In practice, I also imagine it doesn't
> cause any problem but feel free to propose a fix along those lines.
>
> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
>  wrote:
> > Hi,
> >
> > During compilation I get a lot of these:
> >
> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
> >  23 overlappping classes:
> >
> > [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
> > define
> >  82 overlappping classes:
> >
> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
> >
> >
> > And a lot of others. How do I fix these?
>


Re: make two rdd co-partitioned in python

2015-04-09 Thread Davies Liu
In Spark 1.3+, PySpark also support this kind of narrow dependencies,
for example,

N = 10
a1 = a.partitionBy(N)
b1 = b.partitionBy(N)

then a1.union(b1) will only have N partitions.

So, a1.join(b1) do not need shuffle anymore.

On Thu, Apr 9, 2015 at 11:57 AM, pop  wrote:
> In scala, we can make two Rdd using the same partitioner so that they are
> co-partitioned
>val partitioner = new HashPartitioner(5)
>val a1 = a.partitionBy(partitioner).cache()
>val b1 = b.partiitonBy(partitioner).cache()
>
> How can we achieve the same in python? It would be great if somebody can
> share some examples.
>
>
> Thanks,
> Xiang
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/make-two-rdd-co-partitioned-in-python-tp22445.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: override log4j.properties

2015-04-09 Thread Emre Sevinc
One method: By putting your custom log4j.properties file in your /resources
directory.

As an example, please see: http://stackoverflow.com/a/2736/236007

Kind regards,

Emre Sevinç
http://www.bigindustries.be/



On Thu, Apr 9, 2015 at 2:17 PM, patcharee  wrote:

> Hello,
>
> How to override log4j.properties for a specific spark job?
>
> BR,
> Patcharee
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Emre Sevinc


Re: Add row IDs column to data frame

2015-04-09 Thread Bojan Kostic
Hi,

I just checked and i can see that there is  method called withColumn:
def  withColumn(colName: String, col: Column

): DataFrame


Returns a new DataFrame

by adding a column.

I can't test it now... But i think it should work.

As i see it whole idea for data frames is to make them like data frames in
R. And in R you can do that easily.

It was late last night and i was tired but my idea was that you can iterate
over first set add some index to every log using acumulators and then
iterate over other set and add index from other acumulator then create
tuple with keys from indexes and join. It is ugly and not efficient, and
you should avoid it. :]

Best

Bojan

On Thu, Apr 9, 2015 at 1:35 AM, barmaley [via Apache Spark User List] <
ml-node+s1001560n22430...@n3.nabble.com> wrote:

> Hi Bojan,
>
> Could you please expand your idea on how to append to RDD? I can think of
> how to append a constant value to each row on RDD:
>
> //oldRDD - RDD[Array[String]]
> val c = "const"
> val newRDD = oldRDD.map(r=>c+:r)
>
> But how to append a custom column to RDD? Something like:
>
> val colToAppend = sc.makeRDD(1 to oldRDD.count().toInt)
> //or sc.parallelize(1 to oldRDD.count().toInt)
> //or (1 to 1 to oldRDD.count().toInt).toArray
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22430.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-tp22385p22432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark job progress-style report on console ?

2015-04-09 Thread roy
Hi,

  How do i get spark job progress-style report on console ?

I tried to set --conf spark.ui.showConsoleProgress=true but it 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive?  If you're getting the
same error within Hive, it sounds like a permissions issue as per Bojan.
More info can be found at:
http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error


On Thu, Apr 9, 2015 at 7:31 AM Bojan Kostic  wrote:

> I think it uses local dir, hdfs dir path starts with hdfs://
>
> Check permissions on folders, and also check logs. There should be more
> info
> about exception.
>
> Best
> Bojan
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-
> tp22435p22439.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Tathagata Das
Responses inline. Hope they help.

On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani  wrote:

>  Hi Friends,
>
>  I am trying to solve a use case in spark streaming, I need help on
> getting to right approach on lookup / update the master data.
>
>  Use case ( simplified )
>  I’ve a dataset of entity with three attributes and identifier/row key in
> a persistent store.
>
>  Each attribute along with row key come from a different stream let’s
> say, effectively 3 source streams.
>
>  Now whenever any attribute comes up, I want to update/sync the
> persistent store and do some processing, but the processing would require
> the latest state of entity with latest values of three attributes.
>
>  I wish if I have the all the entities cached in some sort of centralized
> cache ( like we have data in hdfs ) within spark streaming which may be
> used for data local processing. But I assume there is no such thing.
>
>  potential approaches I m thinking of, I suspect first two are not
> feasible, but I want to confirm,
>1.  Is Broadcast Variables mutable ? If yes, can I use it as cache
> for all entities sizing  around 100s of GBs provided i have a cluster with
> enough RAM.
>

Broadcast variables are not mutable. But you can always create a new
broadcast variable when you want and use the "latest" broadcast variable in
your computation.

dstream.transform { rdd =>

   val latestBroacast = getLatestBroadcastVariable()  // fetch existing or
update+create new and return
   val transformedRDD = rdd. ..  // use  latestBroacast in RDD
tranformations
   transformedRDD
}

Since the transform RDD-to-RDD function runs on the driver every batch
interval, it will always use the latest broadcast variable that you want.
Though note that whenever you create a new broadcast, the next batch may
take a little longer to as the data needs to be actually broadcasted out.
That can also be made asynchronous by running a simple task (to force the
broadcasting out) on any new broadcast variable in a different thread as
Spark Streaming batch schedule, but using the same underlying Spark Context.



>
>1. Is there any kind of sticky partition possible, so that I route my
>stream data to go through the same node where I've the corresponding
>entities, subset of entire store, cached in memory within JVM / off heap on
>the node, this would avoid lookups from store.
>
> You could use updateStateByKey. That is quite sticky, but does not
eliminate the possibility that it can run on a different node. In fact this
is necessary for fault-tolerance - what if the node it was supposed to run
goes down? The task will be run on a different node, and you have to
 design your application such that it can handle that.


>1. If I stream the entities from persistent store into engine, this
>becomes 4th stream - the entity stream, how do i use join / merge to enable
>stream 1,2,3 to lookup and update the data from stream 4. Would
>DStream.join work for few seconds worth of data in attribute streams with
>all data in entity stream ? Or do I use transform and within that use rdd
>join, I’ve doubts if I am leaning towards core spark approach in spark
>streaming ?
>
>
Depends on what kind of join! If you want the join every batch in stream
with a static data set (or rarely updated dataset), the transform+join is
the way to go. If you want to join one stream with a window of data from
another stream, then DStream.join is the way to go.

>
>1.
>
>
>1. The last approach, which i think will surely work but i want to
>avoid, is i keep the entities in IMDB and do lookup/update calls on from
>stream 1,2 and 3.
>
>
>   Any help is deeply appreciated as this would help me design my system
> efficiently and the solution approach may become a beacon for lookup master
> data sort of problems.
>
>  Regards,
>  Amit
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist
Hi Mohammed,

Sorry, I guess I was not really clear in my response.  Yes sbt fails, the
-DskipTests is for mvn as I showed it in the example on how II built it.

I do not believe that -DskipTests has any impact in sbt, but could be
wrong.  sbt package should skip tests.  I did not try to track down where
the dependency was coming from.  Based on Patrick comments it sound like
this is now resolved.

Sorry for the confustion.

-Todd

On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist  wrote:

> Hi Mohammed,
>
> I think you just need to add -DskipTests to you build.  Here is how I
> built it:
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> -DskipTests clean package install
>
> build/sbt does however fail even if only doing package which should skip
> tests.
>
> I am able to build the "MyThriftServer" above now.
>
> Thanks Michael for the assistance.
>
> -Todd
>
> On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller 
> wrote:
>
>>  Michael,
>>
>> Thank you!
>>
>>
>>
>> Looks like the sbt build is broken for 1.3. I downloaded the source code
>> for 1.3, but I get the following error a few minutes after I run “sbt/sbt
>> publishLocal”
>>
>>
>>
>> [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
>> dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
>> not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
>> was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
>>
>> [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>> *Sent:* Wednesday, April 8, 2015 11:54 AM
>> *To:* Mohammed Guller
>> *Cc:* Todd Nist; James Aley; user; Patrick Wendell
>>
>> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server
>>
>>
>>
>> Sorry guys.  I didn't realize that
>> https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.
>>
>>
>>
>> You can publish locally in the mean time (sbt/sbt publishLocal).
>>
>>
>>
>> On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller 
>> wrote:
>>
>> +1
>>
>>
>>
>> Interestingly, I ran into the exactly the same issue yesterday.  I
>> couldn’t find any documentation about which project to include as a
>> dependency in build.sbt to use HiveThriftServer2. Would appreciate help.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Todd Nist [mailto:tsind...@gmail.com]
>> *Sent:* Wednesday, April 8, 2015 5:49 AM
>> *To:* James Aley
>> *Cc:* Michael Armbrust; user
>> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server
>>
>>
>>
>> To use the HiveThriftServer2.startWithContext, I thought one would use
>> the  following artifact in the build:
>>
>>
>>
>> "org.apache.spark"%% "spark-hive-thriftserver"   % "1.3.0"
>>
>>
>>
>> But I am unable to resolve the artifact.  I do not see it in maven
>> central or any other repo.  Do I need to build Spark and publish locally or
>> just missing something obvious here?
>>
>>
>>
>> Basic class is like this:
>>
>>
>>
>> import org.apache.spark.{SparkConf, SparkContext}
>>
>>
>>
>> import  org.apache.spark.sql.hive.HiveContext
>>
>> import org.apache.spark.sql.hive.HiveMetastoreTypes._
>>
>> import org.apache.spark.sql.types._
>>
>> import  org.apache.spark.sql.hive.thriftserver._
>>
>>
>>
>> object MyThriftServer {
>>
>>
>>
>>   val sparkConf = new SparkConf()
>>
>> // master is passed to spark-submit, but could also be specified 
>> explicitely
>>
>> // .setMaster(sparkMaster)
>>
>> .setAppName("My ThriftServer")
>>
>> .set("spark.cores.max", "2")
>>
>>   val sc = new SparkContext(sparkConf)
>>
>>   val  sparkContext  =  sc
>>
>>   import  sparkContext._
>>
>>   val  sqlContext  =  new  HiveContext(sparkContext)
>>
>>   import  sqlContext._
>>
>>   import sqlContext.implicits._
>>
>>
>>
>> // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)
>>
>> }
>>
>>  Build has the following:
>>
>>
>>
>> scalaVersion := "2.10.4"
>>
>>
>>
>> val SPARK_VERSION = "1.3.0"
>>
>>
>>
>>
>>
>> libraryDependencies ++= Seq(
>>
>> "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION
>>
>>   exclude("org.apache.spark", "spark-core_2.10")
>>
>>   exclude("org.apache.spark", "spark-streaming_2.10")
>>
>>   exclude("org.apache.spark", "spark-sql_2.10")
>>
>>   exclude("javax.jms", "jms"),
>>
>> "org.apache.spark" %% "spark-core"  % SPARK_VERSION %  "provided",
>>
>> "org.apache.spark" %% "spark-streaming" % SPARK_VERSION %  "provided",
>>
>> "org.apache.spark"  %% "spark-sql"  % SPARK_VERSION % "provided",
>>
>> "org.apache.spark"  %% "spark-hive" % SPARK_VERSION % "provided",
>>
>> "org.apache.spark" %% "spark-hive-thriftserver"  % SPARK_VERSION   %
>> "provided",
>>
>> "org.apache.kafka" %% "kafka" % "0.8.1.1"
>>
>>   exclude("javax.jms", "jms")
>>
>>   exclude("com.sun.jdmk", "jmxtools")
>>
>>   exclude("com.sun.jmx", "jmxri"),
>>
>> "joda-time" % "joda-time" % "2.7",
>>
>> "log

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak,

I'm going to shamelessly plug my blog post on tuning Spark:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

It talks about tuning executor size as well as how the number of tasks for
a stage is calculated.

-Sandy

On Thu, Apr 9, 2015 at 9:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I have a spark job that has multiple stages. For now i star it with 100
> executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN
> 2.4.x.
>
> For now i start the Spark Job with a very limited input (1 file of size
> 2G), overall there are 200 files. My first run is yet to complete as its
> taking too much of time / throwing OOM exceptions / buffer exceptions (keep
> that aside).
>
> How will i know how much resources are required to run this job. (# of
> cores, executors, mem, serialization buffers, and i do not yet what else).
>
> IN M/R world, all i do is set split size and rest is taken care
> automatically (yes i need to worry about mem, in case of OOM).
>
>
> 1) Can someone explain how they do resource estimation before running the
> job or is there no way and one needs to only try it out ?
> 2) Even if i give 100 executors, the first stage takes only 5, how did
> spark decide this ?
>
> Please point me to any resources that also talks about similar things or
> please explain here.
>
> --
> Deepak
>
>


Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Pressed send early.

I had tried that with these settings

 buffersize=128 maxbuffersize=1024

val conf = new SparkConf()

.setAppName(detail)

.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

.set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").get)


.set("spark.kryoserializer.buffer.max.mb",arguments.get("maxbuffersize").get)


.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


On Thu, Apr 9, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Yes i had tried that.
>
> Now i see this
>
> 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
> VISummaryDataProvider.scala:38, took 275.334991 s
> 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw
> exception: Job aborted due to stage failure: Total size of serialized
> results of 4 tasks (1067.3 MB) is bigger than spark.driver.maxResultSize
> (1024.0 MB)
> org.apache.spark.SparkException: Job aborted due to stage failure: Total
> size of serialized results of 4 tasks (1067.3 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/04/09 07:58:08 INFO storage.BlockManagerInfo: Removed taskresult_4 on
> phxaishdc9dn0579.phx.ebay.com:42771 in memory (size: 273.5 MB, free: 6.2
> GB)
> 15/04/09 07:58:08 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User
>
> On Thu, Apr 9, 2015 at 8:18 PM, Ted Yu  wrote:
>
>> Please take a look at
>> https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
>> , starting line 27.
>>
>> In Spark, you can control the maxBufferSize
>> with "spark.kryoserializer.buffer.max.mb"
>>
>> Cheers
>>
>
>
>
> --
> Deepak
>
>


-- 
Deepak


Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
My Spark (1.3.0) job is failing with

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1+details

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available:
0, required: 1
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeByte(Output.java:194)
at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:599)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:566)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


-- 
Deepak






This is how am creating SparkContext (only once)

val conf = new SparkConf()
.setAppName(detail)
.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")

.set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").get)

.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
val sc = new SparkContext(conf)


Command:

./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar
--jars
/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/spark_reporting_dep_only-1.0-SNAPSHOT.jar
 --num-executors 100 --driver-memory 12g --driver-java-options
"-XX:MaxPermSize=6G" --executor-memory 12g --executor-cores 1 --queue
hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
/home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-04-6 endDate=2015-04-7
input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem
output=/user/dvasthimal/epdatasets/viewItem* buffersize=200*
Spark assembly has been built with Hive, including Datanucleus jars on
classpath

buffer size is 200.

1. What is this buffer ?
2. What should be the value of this buffer ?
3. My Spark Job has many stages, does the above value need to be different
for each stage ?


Please clarify

Regards.
Deepak


Re: SQL can't not create Hive database

2015-04-09 Thread Bojan Kostic
I think it uses local dir, hdfs dir path starts with hdfs://

Check permissions on folders, and also check logs. There should be more info
about exception.

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435p22439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



save as text file throwing null pointer error.

2015-04-09 Thread Somnath Pandeya

JavaRDD lineswithoutStopWords = nonEmptylines
   .map(new Function() {

  /**
  *
   */
  private static final long serialVersionUID = 
1L;

  @Override
  public String call(String line) throws 
Exception {
 // TODO Auto-generated method stub
 return removeStopWords(line, 
stopwords);
  }

   });

  lineswithoutStopWords.saveAsTextFile("output/testop.txt");



Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
15/04/09 18:44:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
localhost): java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

15/04/09 18:44:36 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
aborting job
15/04/09 18:44:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have 
all completed, from pool
15/04/09 18:44:36 INFO TaskSchedulerImpl: Cancelling stage 1
15/04/09 18:44:36 INFO DAGScheduler: Job 1 failed: saveAsTextFile at 
TextPreProcessing.java:49, took 0.172959 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
   at java.lang.ProcessBuil

Which Hive version should be used for Spark 1.3

2015-04-09 Thread Arthur Chan
Hi,

I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark
1.3.x

Could anyone advise which Hive version should be used to match Spark 1.3.x?
Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3?

Regards
Arthur


Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading.

>> (1) The mappers are not executed in parallel when processing
independently the same RDD.

To clarify, I'd say: In one stage of execution, when pipelining occurs,
mappers are not executed in parallel when processing independently the same
RDD partition.

On Thu, Apr 9, 2015 at 11:19 AM, spark_user_2015  wrote:

> That was helpful!
>
> The conclusion:
> (1) The mappers are not executed in parallel when processing independently
> the same RDD.
> (2) The best way seems to be (if enough memory is available and an action
> is
> applied to d1 and d2 later on)
>val d1 = data.map((x,y,z) => (x,y)).cache
>val d2 = d1.map((x,y) => (y,x))
>  -  This avoids pipelining the "d1" mapper and "d2" mapper when
> computing d2
>
> This is important to write efficient code, toDebugString helps a lot.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22444.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Hi-

Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988

Any help in getting this working would be much appreciated!

Thanks
Alex

On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma 
wrote:

> You are right this needs to be done. I can work on it soon, I was not sure
> if there is any one even using scala 2.11 spark repl. Actually there is a
> patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
> has to be ported for scala 2.11 too. If however, you(or anyone else) are
> planning to work, I can help you ?
>
> Prashant Sharma
>
>
>
> On Thu, Apr 9, 2015 at 3:08 PM, anakos  wrote:
>
>> Hi-
>>
>> I am having difficulty getting the 1.3.0 Spark shell to find an external
>> jar.  I have build Spark locally for Scala 2.11 and I am starting the REPL
>> as follows:
>>
>> bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar
>>
>> I see the following line in the console output:
>>
>> 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR
>>
>> file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
>> at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar
>> with
>> timestamp 1428569535904
>>
>> but when i try to import anything from this jar, it's simply not
>> available.
>> When I try to add the jar manually using the command
>>
>> :cp /path/to/jar
>>
>> the classes in the jar are still unavailable. I understand that 2.11 is
>> not
>> officially supported, but has anyone been able to get an external jar
>> loaded
>> in the 1.3.0 release?  Is this a known issue? I have tried searching
>> around
>> for answers but the only thing I've found that may be related is this:
>>
>> https://issues.apache.org/jira/browse/SPARK-3257
>>
>> Any/all help is much appreciated.
>> Thanks
>> Alex
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
I have a spark job that has multiple stages. For now i star it with 100
executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN
2.4.x.

For now i start the Spark Job with a very limited input (1 file of size
2G), overall there are 200 files. My first run is yet to complete as its
taking too much of time / throwing OOM exceptions / buffer exceptions (keep
that aside).

How will i know how much resources are required to run this job. (# of
cores, executors, mem, serialization buffers, and i do not yet what else).

IN M/R world, all i do is set split size and rest is taken care
automatically (yes i need to worry about mem, in case of OOM).


1) Can someone explain how they do resource estimation before running the
job or is there no way and one needs to only try it out ?
2) Even if i give 100 executors, the first stage takes only 5, how did
spark decide this ?

Please point me to any resources that also talks about similar things or
please explain here.

-- 
Deepak


Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Thanks a lot TD for detailed answers. The answers lead to few more questions,


  1.  "the transform RDD-to-RDD function runs on the driver “ - I didn’t 
understand this, does it mean when I use transform function on DStream, it is 
not parallelized, surely I m missing something here.
  2.   updateStateByKey I think won’t work in this use case,  I have three 
separate attribute streams ( with different frequencies ) make up the combined 
state ( i.e. Entity ) at point in time on which I want to do some processing. 
Do you think otherwise ?
  3.  transform+join seems only option so far, but any guestimate how would 
this perform/ react on cluster ? Assuming, master data in 100s of Gbs, and join 
is based on some row key. We are talking about slice of stream data to be 
joined with 100s of Gbs of master data continuously. Is it something can be 
done but should not be done ?

Regards,
Amit

From: Tathagata Das mailto:t...@databricks.com>>
Date: Thursday, April 9, 2015 at 3:13 PM
To: amit assudani mailto:aassud...@impetus.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Lookup / Access of master data in spark streaming

Responses inline. Hope they help.

On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani 
mailto:aassud...@impetus.com>> wrote:
Hi Friends,

I am trying to solve a use case in spark streaming, I need help on getting to 
right approach on lookup / update the master data.

Use case ( simplified )
I’ve a dataset of entity with three attributes and identifier/row key in a 
persistent store.

Each attribute along with row key come from a different stream let’s say, 
effectively 3 source streams.

Now whenever any attribute comes up, I want to update/sync the persistent store 
and do some processing, but the processing would require the latest state of 
entity with latest values of three attributes.

I wish if I have the all the entities cached in some sort of centralized cache 
( like we have data in hdfs ) within spark streaming which may be used for data 
local processing. But I assume there is no such thing.

potential approaches I m thinking of, I suspect first two are not feasible, but 
I want to confirm,
  1.  Is Broadcast Variables mutable ? If yes, can I use it as cache for 
all entities sizing  around 100s of GBs provided i have a cluster with enough 
RAM.

Broadcast variables are not mutable. But you can always create a new broadcast 
variable when you want and use the "latest" broadcast variable in your 
computation.

dstream.transform { rdd =>

   val latestBroacast = getLatestBroadcastVariable()  // fetch existing or 
update+create new and return
   val transformedRDD = rdd. ..  // use  latestBroacast in RDD 
tranformations
   transformedRDD
}

Since the transform RDD-to-RDD function runs on the driver every batch 
interval, it will always use the latest broadcast variable that you want. 
Though note that whenever you create a new broadcast, the next batch may take a 
little longer to as the data needs to be actually broadcasted out. That can 
also be made asynchronous by running a simple task (to force the broadcasting 
out) on any new broadcast variable in a different thread as Spark Streaming 
batch schedule, but using the same underlying Spark Context.



  1.  Is there any kind of sticky partition possible, so that I route my stream 
data to go through the same node where I've the corresponding entities, subset 
of entire store, cached in memory within JVM / off heap on the node, this would 
avoid lookups from store.

You could use updateStateByKey. That is quite sticky, but does not eliminate 
the possibility that it can run on a different node. In fact this is necessary 
for fault-tolerance - what if the node it was supposed to run goes down? The 
task will be run on a different node, and you have to  design your application 
such that it can handle that.


  1.  If I stream the entities from persistent store into engine, this becomes 
4th stream - the entity stream, how do i use join / merge to enable stream 
1,2,3 to lookup and update the data from stream 4. Would DStream.join work for 
few seconds worth of data in attribute streams with all data in entity stream ? 
Or do I use transform and within that use rdd join, I’ve doubts if I am leaning 
towards core spark approach in spark streaming ?

Depends on what kind of join! If you want the join every batch in stream with a 
static data set (or rarely updated dataset), the transform+join is the way to 
go. If you want to join one stream with a window of data from another stream, 
then DStream.join is the way to go.

  1.

  1.  The last approach, which i think will surely work but i want to avoid, is 
i keep the entities in IMDB and do lookup/update calls on from stream 1,2 and 3.

Any help is deeply appreciated as this would help me design my system 
efficiently and the solution approach may become a beacon for lookup master 
data sort of problems.

Regards,
Amit

___

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread ๏̯͡๏
Most likely you have an existing Hive installation with data in it. In this
case i was not able to get Spark 1.3 communicate with existing Hive meta
store. Hence when i read any table created in hive, Spark SQL used to
complain "Data table not found"

If you get it working, please share the steps.

On Thu, Apr 9, 2015 at 9:25 PM, Arthur Chan 
wrote:

> Hi,
>
> I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark
> 1.3.x
>
> Could anyone advise which Hive version should be used to match Spark
> 1.3.x?
> Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3?
>
> Regards
> Arthur
>



-- 
Deepak


Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
Are you running # of receivers = # machines?

TD

On Thu, Apr 9, 2015 at 9:56 AM, Saiph Kappa  wrote:

> Sorry, I was getting those errors because my workload was not sustainable.
>
> However, I noticed that, by just running the spark-streaming-benchmark (
> https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala
> ), I get no difference on the execution time, number of processed records,
> and delay whether I'm using 1 machine or 2 machines with the setup
> described before (using spark standalone). Is it normal?
>
>
>
> On Fri, Mar 27, 2015 at 5:32 PM, Tathagata Das 
> wrote:
>
>> If it is deterministically reproducible, could you generate full DEBUG
>> level logs, from the driver and the workers and give it to me? Basically I
>> want to trace through what is happening to the block that is not being
>> found.
>> And can you tell what Cluster manager are you using? Spark Standalone,
>> Mesos or YARN?
>>
>>
>> On Fri, Mar 27, 2015 at 10:09 AM, Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I am just running this simple example with
>>> machineA: 1 master + 1 worker
>>> machineB: 1 worker
>>> «
>>> val ssc = new StreamingContext(sparkConf, Duration(1000))
>>>
>>> val rawStreams = (1 to numStreams).map(_
>>> =>ssc.rawSocketStream[String](host, port,
>>> StorageLevel.MEMORY_ONLY_SER)).toArray
>>> val union = ssc.union(rawStreams)
>>>
>>> union.filter(line => Random.nextInt(1) == 0).map(line => {
>>>   var sum = BigInt(0)
>>>   line.toCharArray.foreach(chr => sum += chr.toInt)
>>>   fib2(sum)
>>>   sum
>>> }).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s => s"### result:
>>> $s").print()
>>> »
>>>
>>> And I'm getting the following exceptions:
>>>
>>> Log from machineB
>>> «
>>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 132
>>> 15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
>>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 134
>>> 15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
>>> 15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast
>>> variable 24
>>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 136
>>> 15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
>>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 138
>>> 15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
>>> 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 140
>>> 15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
>>> 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
>>> curMem=47117, maxMem=280248975
>>> 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
>>> bytes in memory (estimated size 1886.0 B, free 267.2 MB)
>>> 15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
>>> broadcast_24_piece0
>>> 15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24
>>> took 19 ms
>>> 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
>>> curMem=49003, maxMem=280248975
>>> 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values
>>> in memory (estimated size 3.0 KB, free 267.2 MB)
>>> 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0
>>> (TID 140)
>>> java.lang.Exception: Could not compute split, block
>>> input-0-1427473262420 not found
>>> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:701)
>>> 15

RE: SQL can't not create Hive database

2015-04-09 Thread java8964
Can you try the URI of local file format, something like this:
hiveContext.hql("SET 
hive.metastore.warehouse.dir=file:///home/spark/hive/warehouse")
Yong
> Date: Thu, 9 Apr 2015 04:59:00 -0700
> From: inv...@gmail.com
> To: user@spark.apache.org
> Subject: SQL can't not create Hive database
> 
> Hi,
> 
> I am working on the local mode.
> 
> The following code 
> 
> hiveContext.setConf("hive.metastore.warehouse.dir",
> /home/spark/hive/warehouse)
> hiveContext.sql("create database if not exists db1")
> 
>  throws
> 
> 15/04/09 13:53:16 ERROR RetryingHMSHandler: MetaException(message:Unable to
> create database path file:/user/hive/warehouse/db1.db, failed to create
> database db1)
> 
> It seems that it uses hdfs path, not the local one specified in hiveContext.
> 
> Any idea ? Thank you.
> 
> Hao
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
I agree, but as I say, most are out of the control of Spark. They
aren't because of unnecessary dependencies.

On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh
 wrote:
> Though the warnings can be ignored, they add up in the log files while
> compiling other projects too. And there are a lot of those warnings. Any
> workaround? How do we modify the pom.xml file to exclude these unnecessary
> dependencies?
>
> On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:
>>
>> Generally, you can ignore these things. They mean some artifacts
>> packaged other artifacts, and so two copies show up when all the JAR
>> contents are merged.
>>
>> But here you do show a small dependency convergence problem; beanutils
>> 1.7 is present but beanutills-core 1.8 is too even though these should
>> be harmonized. I imagine one could be excluded; I imagine we could
>> harmonize the version manually. In practice, I also imagine it doesn't
>> cause any problem but feel free to propose a fix along those lines.
>>
>> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
>>  wrote:
>> > Hi,
>> >
>> > During compilation I get a lot of these:
>> >
>> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
>> >  23 overlappping classes:
>> >
>> > [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
>> > define
>> >  82 overlappping classes:
>> >
>> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
>> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
>> >
>> >
>> > And a lot of others. How do I fix these?
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
I found this jira  when
googling for fixes. Wonder if it can fix anything here.
But anyways, thanks for the help :)

On Fri, Apr 10, 2015 at 2:46 AM, Sean Owen  wrote:

> I agree, but as I say, most are out of the control of Spark. They
> aren't because of unnecessary dependencies.
>
> On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh
>  wrote:
> > Though the warnings can be ignored, they add up in the log files while
> > compiling other projects too. And there are a lot of those warnings. Any
> > workaround? How do we modify the pom.xml file to exclude these
> unnecessary
> > dependencies?
> >
> > On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:
> >>
> >> Generally, you can ignore these things. They mean some artifacts
> >> packaged other artifacts, and so two copies show up when all the JAR
> >> contents are merged.
> >>
> >> But here you do show a small dependency convergence problem; beanutils
> >> 1.7 is present but beanutills-core 1.8 is too even though these should
> >> be harmonized. I imagine one could be excluded; I imagine we could
> >> harmonize the version manually. In practice, I also imagine it doesn't
> >> cause any problem but feel free to propose a fix along those lines.
> >>
> >> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
> >>  wrote:
> >> > Hi,
> >> >
> >> > During compilation I get a lot of these:
> >> >
> >> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
> >> >  23 overlappping classes:
> >> >
> >> > [WARNING] commons-beanutils-1.7.0.jar,
> commons-beanutils-core-1.8.0.jar
> >> > define
> >> >  82 overlappping classes:
> >> >
> >> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
> >> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
> >> >
> >> >
> >> > And a lot of others. How do I fix these?
> >
> >
>


Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Please take a look at
https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
, starting line 27.

In Spark, you can control the maxBufferSize
with "spark.kryoserializer.buffer.max.mb"

Cheers


Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Yeah, just ran with 2g for that setting and max.mb with 1068

I am trying to do a map-side join by using broadcast variable. This first
collects all the data (key, value) and then sends it. Its causing error
while running this stage.

On Thu, Apr 9, 2015 at 9:29 PM, Ted Yu  wrote:

> Typo in previous email, pardon me.
>
> Set "spark.driver.maxResultSize" to 1068 or higher.
>
> On Thu, Apr 9, 2015 at 8:57 AM, Ted Yu  wrote:
>
>> Please set "spark.kryoserializer.buffer.max.mb" to 1068 (or higher).
>>
>> Cheers
>>
>> On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> Pressed send early.
>>>
>>> I had tried that with these settings
>>>
>>>  buffersize=128 maxbuffersize=1024
>>>
>>> val conf = new SparkConf()
>>>
>>> .setAppName(detail)
>>>
>>> .set("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer")
>>>
>>>
>>> .set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").get)
>>>
>>>
>>> .set("spark.kryoserializer.buffer.max.mb",arguments.get("maxbuffersize").get)
>>>
>>>
>>> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>>>
>>>
>>> On Thu, Apr 9, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>>> wrote:
>>>
 Yes i had tried that.

 Now i see this

 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
 VISummaryDataProvider.scala:38, took 275.334991 s
 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Total size of serialized
 results of 4 tasks (1067.3 MB) is bigger than spark.driver.maxResultSize
 (1024.0 MB)
 org.apache.spark.SparkException: Job aborted due to stage failure:
 Total size of serialized results of 4 tasks (1067.3 MB) is bigger than
 spark.driver.maxResultSize (1024.0 MB)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/09 07:58:08 INFO storage.BlockManagerInfo: Removed taskresult_4
 on phxaishdc9dn0579.phx.ebay.com:42771 in memory (size: 273.5 MB,
 free: 6.2 GB)
 15/04/09 07:58:08 INFO yarn.ApplicationMaster: Final app status:
 FAILED, exitCode: 15, (reason: User

 On Thu, Apr 9, 2015 at 8:18 PM, Ted Yu  wrote:

> Please take a look at
> https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
> , starting line 27.
>
> In Spark, you can control the maxBufferSize
> with "spark.kryoserializer.buffer.max.mb"
>
> Cheers
>



 --
 Deepak


>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>


-- 
Deepak


Spark Streaming scenarios

2015-04-09 Thread Vinay Kesarwani
Hi,

I have following scenario.. need some help ASAP

1. Ad hoc query on spark streaming.
   How can i run spark queries on ongoing streaming context.
   Scenario: If a stream job running to find out min and max value in last
5 min(which i am able to do.)
   Now i want to run interactive query to find min max in last 30 min on
this stream.
   What i was thinking to store the streaming RDD as schemaRDD and do query
on that.Is there any better approach??
   Where should i store schemaRDD for near real time performance??
2. Saving and loading intermediate RDDs in cache/disk.
   What is the best approach to do this. In case any worker fails , whether
new worker will resume task,load this saved RDDs??
3. Write ahead log and Check point.
   How are the significance of WAL, and checkpoint?? In case of checkpoint
if any worker fails will other worker load checkpoint detail and resume its
job??
   What scenarios i should use WAL and Checkpoint.
4. Spawning multiple processes within spark streaming.
   Doing multiple operations on same stream.
5. Accessing cached data between spark components.
   Can cached data in spark streaming is accessible to spark sql?? Can it
be shared between these component? or can it be between to sparkcontext?
   If yes how? if not any alternative approach?
6. Dynamic look up data in spark streaming.
   I have a scenario where on a stream i want to do some filtering using
dynamic lookup data. How can i achieve this scenario?
   In case i get this lookup data as another stream, and cache it..will it
possible to updata/merge this data in cache in 24/7?
What is the best approach to do this. I refered Twitter streaming example
in spark where it reads a spamfile. but this file is not dynamic in nature.


Re: Join on Spark too slow.

2015-04-09 Thread ๏̯͡๏
If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.

Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might increase # of executors.

On Thu, Apr 9, 2015 at 8:37 PM, Guillaume Pitel 
wrote:

>  Maybe I'm wrong, but what you are doing here is basically a bunch of
> cartesian product for each key. So if "hello" appear 100 times in your
> corpus, it will produce 100*100 elements in the join output.
>
> I don't understand what you're doing here, but it's normal your join takes
> forever, it makes no sense as it, IMO.
>
> Guillaume
>
> Hello guys,
>
>  I am trying to run the following dummy example for Spark,
> on a dataset of 250MB, using 5 machines with >10GB RAM
> each, but the join seems to be taking too long (> 2hrs).
>
>  I am using Spark 0.8.0 but I have also tried the same example
> on more recent versions, with the same results.
>
>  Do you have any idea why this is happening?
>
>  Thanks a lot,
> Kostas
>
>  *val *sc = *new *SparkContext(
>   args(0),
>   *"DummyJoin"*,
>   System.*getenv*(*"SPARK_HOME"*),
>   *Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*)))
>
> *val *file = sc.textFile(args(1))
>
> *val *wordTuples = file
>   .flatMap(line => line.split(args(2)))
>   .map(word => (word, 1))
>
> *val *big = wordTuples.filter {
>   *case *((k, v)) => k !=
> *"a" *}.cache()
>
> *val *small = wordTuples.filter {
>   *case *((k, v)) => k != *"a" *&& k != *"to" *&& k !=
> *"and" *}.cache()
>
> *val *res = big.leftOuterJoin(small)
> res.saveAsTextFile(args(3))
>   }
>
>
>
> --
>[image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)626 222 431
>
> eXenSa S.A.S. 
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705
>



-- 
Deepak


Spark Job #of attempts ?

2015-04-09 Thread ๏̯͡๏
Hello,
I have a spark job with 5 stages. After it runs 3rd stage, the console
shows


15/04/09 10:25:57 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: RUNNING)
15/04/09 10:25:58 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: RUNNING)
15/04/09 10:25:59 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: ACCEPTED)
15/04/09 10:25:59 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: hdmi-express
 start time: 1428598679223
 final status: UNDEFINED
 tracking URL:
https://apollo-phx-rm-1.vip.ebay.com:50030/proxy/application_1427705526386_127168/
 user: dvasthimal
15/04/09 10:26:00 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: ACCEPTED)
15/04/09 10:26:01 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: ACCEPTED)

and then running again. This looks as if the stage failed and Spark
restarted the job from beginning. If thats not the case, when i click the
spark UI web page, it does not show already completed stages and instead
goes back to running stage #1. Is there some setting to turn this behavior
off ?

-- 
Deepak


Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job.

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Tathagata Das
Well, you are running in local mode, so it cannot find another peer to
replicate the blocks received from receivers. That's it. Its not a real
concern and that error will go away when you are run it in a cluster.

On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi 
wrote:

> Hi,
>
> I'm running a spark streaming job in local mode (--master local[4]), and
> I'm seeing tons of these messages, roughly once every second -
>
> WARN BlockManager: Block input-0-1428527584600 replicated to only 0
> peer(s) instead of 1 peers
>
> We're using spark 1.2.1. Even with TRACE logging enabled, we're not seeing
> any log messages indicating failure to replicate the blocks.
>
> Should we be concerned about this warning (and if so, how should we debug
> this), or is this a corner case in local mode where replication is not
> attempted, but the warning is emitted anyway? If so, what is the workaround?
>
> thanks
>
> Nandan
>


Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it
maintains a clustered index on the source vertex id (that is, it sorts
the edges by source vertex id and stores the offsets in a hash table).
If the activeDirection is appropriately set, it can then jump only to
the clusters with active source vertices.

See the EdgePartition#index field [1], which stores the offsets, and
the logic in GraphImpl#aggregateMessagesWithActiveSet [2], which
decides whether to do a full scan or use the index.

[1] 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala#L60
[2]. 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266

Ankur


On Thu, Apr 9, 2015 at 3:21 AM, James  wrote:
> In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
> means that a fixed time which scale with graph size is unavoidable on a
> pregel-like iteration.
>
> But what if I have to iterate nearly 100 iterations but at the last 50
> iterations there are only < 0.1% nodes need to be updated ? The fixed time
> make the program finished at a unacceptable time consumption.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



make two rdd co-partitioned in python

2015-04-09 Thread pop
In scala, we can make two Rdd using the same partitioner so that they are
co-partitioned 
   val partitioner = new HashPartitioner(5) 
   val a1 = a.partitionBy(partitioner).cache() 
   val b1 = b.partiitonBy(partitioner).cache() 

How can we achieve the same in python? It would be great if somebody can
share some examples. 


Thanks, 
Xiang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/make-two-rdd-co-partitioned-in-python-tp22445.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
You are right this needs to be done. I can work on it soon, I was not sure
if there is any one even using scala 2.11 spark repl. Actually there is a
patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma



On Thu, Apr 9, 2015 at 3:08 PM, anakos  wrote:

> Hi-
>
> I am having difficulty getting the 1.3.0 Spark shell to find an external
> jar.  I have build Spark locally for Scala 2.11 and I am starting the REPL
> as follows:
>
> bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar
>
> I see the following line in the console output:
>
> 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR
>
> file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
> at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar with
> timestamp 1428569535904
>
> but when i try to import anything from this jar, it's simply not available.
> When I try to add the jar manually using the command
>
> :cp /path/to/jar
>
> the classes in the jar are still unavailable. I understand that 2.11 is not
> officially supported, but has anyone been able to get an external jar
> loaded
> in the 1.3.0 release?  Is this a known issue? I have tried searching around
> for answers but the only thing I've found that may be related is this:
>
> https://issues.apache.org/jira/browse/SPARK-3257
>
> Any/all help is much appreciated.
> Thanks
> Alex
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
Thanks Ted, using HiveTest as my context worked. It still left a metastore
directory and Derby log in my current working directory though; I manually
added a shutdown hook to delete them and all was well.

On Wed, Apr 8, 2015 at 4:33 PM, Ted Yu  wrote:

> Please take a look at
> sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala :
>
>   protected def configure(): Unit = {
> warehousePath.delete()
> metastorePath.delete()
> setConf("javax.jdo.option.ConnectionURL",
>   s"jdbc:derby:;databaseName=$metastorePath;create=true")
> setConf("hive.metastore.warehouse.dir", warehousePath.toString)
>   }
>
> Cheers
>
> On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann <
> daniel.siegm...@teamaol.com> wrote:
>
>> I am trying to unit test some code which takes an existing HiveContext
>> and uses it to execute a CREATE TABLE query (among other things).
>> Unfortunately I've run into some hurdles trying to unit test this, and I'm
>> wondering if anyone has a good approach.
>>
>> The metastore DB is automatically created in the local directory, but it
>> doesn't seem to be cleaned up afterward. Is there any way to get Spark to
>> clean this up when the context is stopped? Or can I point this to some
>> other location, such as a temp directory?
>>
>> Trying to create a table fails because it is using the default warehouse
>> directory (/user/hive/warehouse). Is there some way to change this without
>> hard-coding a directory in a hive-site.xml; again, I'd prefer to point it
>> to a temp directory so it will be automatically removed. I tried a couple
>> of things that didn't work:
>>
>>- hiveContext.sql("SET hive.metastore.warehouse.dir=/tmp/dir/xyz")
>>- hiveContext.setConf("hive.metastore.warehouse.dir", "/tmp/dir/xyz")
>>
>> Any advice from those who have been here before would be appreciated.
>>
>
>