HI Adamantios,
For your first question, after you train the SVM, you get a model with a
vector of weights w and an intercept b, point x such that w.dot(x) + b = 1
and w.dot(x) + b = -1 are points that on the decision boundary. The
quantity w.dot(x) + b for point x is a confidence measure of
class
Can you try adding these dependencies?
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10"
% "1.0.1"
libraryDependencies += "org.twitter4j" % "twitter4j-core" % "4.0.0"
libraryDependencies += "org.twitter4j" % "twitter4j" % "4.0.0"
And make sure these 3 jars are downloaded
Hi All,
I'm executing a simple job in spark which reads a file on HDFS, processes
the lines and saves the processed lines back to HDFS. All the 3 stages are
happening correctly and I'm able to see the processed file on the HDFS.
But on the spark UI, the worker state is shown as "killed". And I'm
Hi,
I'm confused with saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset.
What's the difference between the two?
What's the individual use cases of the two APIs?
Could you describe the internal flows of the two APIs briefly?
I've used Spark several months, but I have no experience on M
Nobody?
If that's not supported already, can please, at least, give me a few hints
on how to implement it?
Thanks!
On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
adamantios.cor...@gmail.com> wrote:
> Hi,
>
> I am working with the SVMWithSGD classification algorithm on Spark. It
> works f
Hi Barrington,
Have you tried running it from the command line? (i.e. bin/spark-submit
--master yarn-client --class YOUR_CLASS YOUR_JAR) Does it still fail? I am
not super familiar with running Spark through intellij, but the AFAIK the
classpaths are setup a little differently there. Also, Spark s
Hi Didata,
An alternative to what Sandy proposed is to set the Spark properties in a
special file `conf/spark-defaults.conf`. That way you don't have to specify
all the configs through the command line every time. The `--conf` option is
mostly intended to change one or two parameters, but it becom
Thanks for the info Burak!
I filed a bug on myself at https://issues.apache.org/jira/browse/SPARK-3631
to turn this information into a new section on the programming guide.
Thanks for the explanation it's very helpful.
Andrew
On Wed, Sep 17, 2014 at 12:08 PM, Burak Yavuz wrote:
> Yes, writing
i have found no way around this. basically this makes SPARK_CLASSPATH
unusable. and the alternative for enabling lzo on a cluster is not
reasonable.
one has to set in spark-defaults.conf:
spark.executor.extraClassPath
/usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
spark.executor.extraLib
Hi,
I am running spark from my IDE (InteliJ) using YARN as my cluster manager.
However, the executor node is not able to find my main driver class
“LascoScript”. I keep getting java.lang.ClassNotFoundException.
I tried adding the jar of the main class by running the snippet below
val conf
Spark SQL always uses a custom configuration of Kryo under the hood to
improve shuffle performance:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala
Michael
On Sun, Sep 21, 2014 at 9:04 AM, Grega Kešpret wrote:
> Hi,
>
>
Hi Oleg,
Those parameters control the number and size of Spark's daemons on the
cluster. If you're interested in how these daemons relate to each other
and interact with YARN, I wrote a post on this a little while ago -
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-ya
I started with StatefulNetworkWordCount to have a running count of words seen.
I have a file 'stored.count' which contains the word counts.
$ cat stored.count
a 1
b 2
I want to initialize StatefulNetworkWordCount with the values in 'stored.count'
file, how do I do that?
I looked at the paper '
If using a client deploy mode, the driver memory can't go through --conf.
spark-submit handles --driver-memory as a special case because it needs to
know how much memory to give the JVM before starting it and interpreting
the other properties.
-Sandy
On Tue, Sep 16, 2014 at 10:20 PM, Dimension D
Hi all,
So far as I known, a SparkContext instance take in charge of some resources of
a cluster the master assigned to. And It is hardly shared with different
sparkcontexts. meanwhile, schedule between applications is also not easier.
To address this without introducing extra resource schedule
zipWithUniqueId is also affected...
I had to persist the dictionaries to make use of the indices lower down in
the flow...
On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen wrote:
> Reference - https://issues.apache.org/jira/browse/SPARK-3098
> I imagine zipWithUniqueID is also affected, but may not h
Hi,
I am seeing different shuffle write sizes when using SchemaRDD (versus
normal RDD). I'm doing the following:
case class DomainObj(a: String, b: String, c: String, d: String)
val logs: RDD[String] = sc.textFile(...)
val filtered: RDD[String] = logs.filter(...)
val myDomainObjects: RDD[DomainO
Setting java_opts helped me fix the problem.
Thanks,
-Khaja
On Sun, Sep 21, 2014 at 9:25 AM, Khaja Mohideen wrote:
> I was able to move past this error by deleting the .ivy2/cache folder.
>
> However, I am running into an out of memory error
> [error] java.util.concurrent.ExecutionException:
>
Using mapPartitions and passing the big index object as a parameter to it was
not the best option, given the size of the big object and my RAM. The
workers died before starting the actual computation.
Anyway, creating a singleton object worked for me:
http://apache-spark-user-list.1001560.n3.na
I was able to move past this error by deleting the .ivy2/cache folder.
However, I am running into an out of memory error
[error] java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: Jav
a heap space
[error] Use 'last' for the full log.
This is despite the fact that I have set m2_o
Hi,
I've seen this problem before, and I'm not convinced it's GC.
When spark shuffles it writes a lot of small files to store the data to be
sent to other executors (AFAICT). According to what I've read around the
place the intention is that these files be stored in disk buffers, and
since sync()
Just use flatMap, it does exactly what you need:
newLines.flatMap { lines => lines }.saveAsTextFile(...)
2014-09-21 11:26 GMT+02:00 Sarath Chandra <
sarathchandra.jos...@algofusiontech.com>:
> Hi All,
>
> If my RDD is having array/sequence of strings, how can I save them as a
> HDFS file with e
Hi,
I’ve also met this problem before, I think you can try to set
“spark.core.connection.ack.wait.timeout” to a large value to avoid ack timeout,
default is 60 seconds.
Sometimes because of GC pause or some other reasons, acknowledged message will
be timeout, which will lead to this exception,
Hello,
I am facing an issue with partitionBy, it is not clear whether it is a
problem with my code or with my spark setup. I am using Spark 1.1,
standalone, and my other spark projects work fine.
So I have to repartition a relatively large file (about 70 million lines).
Here is a minimal version
Hi:
I am trying to setup Spark 1.1 on a Windows 7 box and I am running the sbt
assembly command and this is the error that I am seeing.
[error] (streaming-flume-sink/*:update) sbt.ResolveException: unresolved
depende
ncy: commons-lang#commons-lang;2.6: configuration not found in
commons-lang#com
Hi All,
If my RDD is having array/sequence of strings, how can I save them as a
HDFS file with each string on separate line?
For example if I write code as below, the output should get saved as hdfs
file having one string per line
...
...
var newLines = lines.map(line => myfunc(line));
newLines.s
Reference - https://issues.apache.org/jira/browse/SPARK-3098
I imagine zipWithUniqueID is also affected, but may not happen to have
exhibited in your test.
On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das wrote:
> Some more debug revealed that as Sean said I have to keep the dictionaries
> persisted
27 matches
Mail list logo