Re: return probability \ confidence instead of actual class

2014-09-21 Thread Liquan Pei
HI Adamantios, For your first question, after you train the SVM, you get a model with a vector of weights w and an intercept b, point x such that w.dot(x) + b = 1 and w.dot(x) + b = -1 are points that on the decision boundary. The quantity w.dot(x) + b for point x is a confidence measure of class

Re: Spark streaming twitter exception

2014-09-21 Thread Akhil Das
Can you try adding these dependencies? libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "1.0.1" libraryDependencies += "org.twitter4j" % "twitter4j-core" % "4.0.0" libraryDependencies += "org.twitter4j" % "twitter4j" % "4.0.0" And make sure these 3 jars are downloaded

Worker state is 'killed'

2014-09-21 Thread Sarath Chandra
Hi All, I'm executing a simple job in spark which reads a file on HDFS, processes the lines and saves the processed lines back to HDFS. All the 3 stages are happening correctly and I'm able to see the processed file on the HDFS. But on the spark UI, the worker state is shown as "killed". And I'm

Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-21 Thread innowireless TaeYun Kim
Hi, I'm confused with saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset. What's the difference between the two? What's the individual use cases of the two APIs? Could you describe the internal flows of the two APIs briefly? I've used Spark several months, but I have no experience on M

Re: return probability \ confidence instead of actual class

2014-09-21 Thread Adamantios Corais
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais < adamantios.cor...@gmail.com> wrote: > Hi, > > I am working with the SVMWithSGD classification algorithm on Spark. It > works f

Re: java.lang.ClassNotFoundException on driver class in executor

2014-09-21 Thread Andrew Or
Hi Barrington, Have you tried running it from the command line? (i.e. bin/spark-submit --master yarn-client --class YOUR_CLASS YOUR_JAR) Does it still fail? I am not super familiar with running Spark through intellij, but the AFAIK the classpaths are setup a little differently there. Also, Spark s

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-21 Thread Andrew Or
Hi Didata, An alternative to what Sandy proposed is to set the Spark properties in a special file `conf/spark-defaults.conf`. That way you don't have to specify all the configs through the command line every time. The `--conf` option is mostly intended to change one or two parameters, but it becom

Re: Spark and disk usage.

2014-09-21 Thread Andrew Ash
Thanks for the info Burak! I filed a bug on myself at https://issues.apache.org/jira/browse/SPARK-3631 to turn this information into a new section on the programming guide. Thanks for the explanation it's very helpful. Andrew On Wed, Sep 17, 2014 at 12:08 PM, Burak Yavuz wrote: > Yes, writing

Re: Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-21 Thread Koert Kuipers
i have found no way around this. basically this makes SPARK_CLASSPATH unusable. and the alternative for enabling lzo on a cluster is not reasonable. one has to set in spark-defaults.conf: spark.executor.extraClassPath /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar spark.executor.extraLib

java.lang.ClassNotFoundException on driver class in executor

2014-09-21 Thread Barrington Henry
Hi, I am running spark from my IDE (InteliJ) using YARN as my cluster manager. However, the executor node is not able to find my main driver class “LascoScript”. I keep getting java.lang.ClassNotFoundException. I tried adding the jar of the main class by running the snippet below val conf

Re: Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Michael Armbrust
Spark SQL always uses a custom configuration of Kryo under the hood to improve shuffle performance: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala Michael On Sun, Sep 21, 2014 at 9:04 AM, Grega Kešpret wrote: > Hi, > >

Re: pyspark on yarn - lost executor

2014-09-21 Thread Sandy Ryza
Hi Oleg, Those parameters control the number and size of Spark's daemons on the cluster. If you're interested in how these daemons relate to each other and interact with YARN, I wrote a post on this a little while ago - http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-ya

How to initialize updateStateByKey operation

2014-09-21 Thread Soumitra Kumar
I started with StatefulNetworkWordCount to have a running count of words seen. I have a file 'stored.count' which contains the word counts. $ cat stored.count a 1 b 2 I want to initialize StatefulNetworkWordCount with the values in 'stored.count' file, how do I do that? I looked at the paper '

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-21 Thread Sandy Ryza
If using a client deploy mode, the driver memory can't go through --conf. spark-submit handles --driver-memory as a special case because it needs to know how much memory to give the JVM before starting it and interpreting the other properties. -Sandy On Tue, Sep 16, 2014 at 10:20 PM, Dimension D

Can SparkContext shared across nodes/drivers

2014-09-21 Thread 林武康
Hi all, So far as I known, a SparkContext instance take in charge of some resources of a cluster the master assigned to. And It is hardly shared with different sparkcontexts. meanwhile, schedule between applications is also not easier. To address this without introducing extra resource schedule

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
zipWithUniqueId is also affected... I had to persist the dictionaries to make use of the indices lower down in the flow... On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen wrote: > Reference - https://issues.apache.org/jira/browse/SPARK-3098 > I imagine zipWithUniqueID is also affected, but may not h

Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Grega Kešpret
Hi, I am seeing different shuffle write sizes when using SchemaRDD (versus normal RDD). I'm doing the following: case class DomainObj(a: String, b: String, c: String, d: String) val logs: RDD[String] = sc.textFile(...) val filtered: RDD[String] = logs.filter(...) val myDomainObjects: RDD[DomainO

Re: Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja Mohideen
Setting java_opts helped me fix the problem. Thanks, -Khaja On Sun, Sep 21, 2014 at 9:25 AM, Khaja Mohideen wrote: > I was able to move past this error by deleting the .ivy2/cache folder. > > However, I am running into an out of memory error > [error] java.util.concurrent.ExecutionException: >

Re: Avoid broacasting huge variables

2014-09-21 Thread octavian.ganea
Using mapPartitions and passing the big index object as a parameter to it was not the best option, given the size of the big object and my RAM. The workers died before starting the actual computation. Anyway, creating a singleton object worked for me: http://apache-spark-user-list.1001560.n3.na

Re: Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja Mohideen
I was able to move past this error by deleting the .ivy2/cache folder. However, I am running into an out of memory error [error] java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Jav a heap space [error] Use 'last' for the full log. This is despite the fact that I have set m2_o

Re: Issues with partitionBy: FetchFailed

2014-09-21 Thread David Rowe
Hi, I've seen this problem before, and I'm not convinced it's GC. When spark shuffles it writes a lot of small files to store the data to be sent to other executors (AFAICT). According to what I've read around the place the intention is that these files be stored in disk buffers, and since sync()

Re: Saving RDD with array of strings

2014-09-21 Thread Julien Carme
Just use flatMap, it does exactly what you need: newLines.flatMap { lines => lines }.saveAsTextFile(...) 2014-09-21 11:26 GMT+02:00 Sarath Chandra < sarathchandra.jos...@algofusiontech.com>: > Hi All, > > If my RDD is having array/sequence of strings, how can I save them as a > HDFS file with e

RE: Issues with partitionBy: FetchFailed

2014-09-21 Thread Shao, Saisai
Hi, I’ve also met this problem before, I think you can try to set “spark.core.connection.ack.wait.timeout” to a large value to avoid ack timeout, default is 60 seconds. Sometimes because of GC pause or some other reasons, acknowledged message will be timeout, which will lead to this exception,

Issues with partitionBy: FetchFailed

2014-09-21 Thread Julien Carme
Hello, I am facing an issue with partitionBy, it is not clear whether it is a problem with my code or with my spark setup. I am using Spark 1.1, standalone, and my other spark projects work fine. So I have to repartition a relatively large file (about 70 million lines). Here is a minimal version

Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja M
Hi: I am trying to setup Spark 1.1 on a Windows 7 box and I am running the sbt assembly command and this is the error that I am seeing. [error] (streaming-flume-sink/*:update) sbt.ResolveException: unresolved depende ncy: commons-lang#commons-lang;2.6: configuration not found in commons-lang#com

Saving RDD with array of strings

2014-09-21 Thread Sarath Chandra
Hi All, If my RDD is having array/sequence of strings, how can I save them as a HDFS file with each string on separate line? For example if I write code as below, the output should get saved as hdfs file having one string per line ... ... var newLines = lines.map(line => myfunc(line)); newLines.s

Re: Distributed dictionary building

2014-09-21 Thread Sean Owen
Reference - https://issues.apache.org/jira/browse/SPARK-3098 I imagine zipWithUniqueID is also affected, but may not happen to have exhibited in your test. On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das wrote: > Some more debug revealed that as Sean said I have to keep the dictionaries > persisted