Re: Spark speed performance

2014-10-19 Thread jan.zikes
Thank you very much lot of very small json files was exactly the speed performance problem, using coalesce makes my Spark program to run on single node only twice slower (even with starting Spark) than single node Python program, which is acceptable. Jan  ___

why does driver connects to master fail ?

2014-10-19 Thread randylu
In my programer, the application always connects to master fail for serveral iterations. The driver' log is as follows: WARN AppClient$ClientActor: Connection to akka.tcp://sparkMaster@master1:7077 failed; waiting for master to reconnect... why does this warnning happen and how to avoid it? B

Re: why does driver connects to master fail ?

2014-10-19 Thread randylu
In additional, driver receives serveral DisassociatedEvent messages. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-driver-connects-to-master-fail-tp16758p16759.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-19 Thread Ilya Ganelin
Check for any variables you've declared in your class. Even if you're not calling them from the function they are passed to the worker nodes as part of the context. Consequently, if you have something without a default serializer (like an imported class) it will also get passed. To fix this you ca

scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scal

Error while running Streaming examples - no snappyjava in java.library.path

2014-10-19 Thread bdev
I built the latest Spark project and I'm running into these errors when attempting to run the streaming examples locally on the Mac, how do I fix these errors? java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)

Using SVMWithSGD model to predict

2014-10-19 Thread npomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD example. I ran my dataset through it and build a model. When I call predict() on the testing data (after clearThreshold()) I was expecting to get answers in the range of 0 to 1. But they aren't, all predictions seem to be negat

Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD example. I ran my dataset through it and build a model. When I call predict() on the testing data (after clearThreshold()) I was expecting to get answers in the range of 0 to 1. But they aren't, all predictions seem to be negat

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
The problem is that you called clearThreshold(). The result becomes the SVM margin not a 0/1 class prediction. There is no probability output. There was a very similar question last week. Is there an example out there suggesting clearThreshold()? I also wonder if it is good to overload the meaning

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks. The example I used is here https://spark.apache.org/docs/latest/mllib-linear-methods.html see SVMClassifier So there's no way to get a probability based output? What about from linear regression, or logistic regression? On 19 October 2014 19:52, Sean Owen wrote: > The problem is that

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
Ah right. It is important to use clearThreshold() in that example in order to generate margins, because the AUC metric needs the classifications to be ranked by some relative strength, rather than just 0/1. These outputs are not probabilities, and that is not what SVMs give you in general. There ar

Re: What executes on worker and what executes on driver side

2014-10-19 Thread Saurabh Wadhawan
Any response for this? 1. How do I know what statements will be executed on worker side out of the spark script in a stage. e.g. if I have val x = 1 (or any other code) in my driver code, will the same statements be executed on the worker side in a stage? 2. How can I do a map side

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei > On Oct 18, 2014, at 9:52 PM, Matei Zaharia wrote: > > After successful events in the past two ye

Re: Oryx + Spark mllib

2014-10-19 Thread Jayant Shekhar
Hi Deb, Do check out https://github.com/OryxProject/oryx. It does integrate with Spark. Sean has put in quite a bit of neat details on the page about the architecture. It has all the things you are thinking about:) Thanks, Jayant On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das wrote: > Hi, > >

Is Spark the right tool?

2014-10-19 Thread kc66
I am very new to Spark. I am work on a project that involves reading stock transactions off a number of TCP connections and 1. periodically (once every few hours) uploads the transaction records to HBase 2. maintains the records that are not yet written into HBase and acts as a HTTP query server fo

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks for the info. On 19 October 2014 20:46, Sean Owen wrote: > Ah right. It is important to use clearThreshold() in that example in > order to generate margins, because the AUC metric needs the > classifications to be ranked by some relative strength, rather than > just 0/1. These outputs are

mlib model build and low CPU usage

2014-10-19 Thread Nick Pomfret
I'm building a model in a stand alone cluster with just a single worker limited to use 3 cores and 4GB ram. The node starts up and spits out the message: Starting Spark worker 192.168.1.185:60203 with 3 cores, 4.0 GB RAM During the model train (SVMWithSGD) the CPU on the worker is very low. It

Spark Streaming scheduling control

2014-10-19 Thread davidkl
Hello, I have a cluster 1 master and 2 slaves running on 1.1.0. I am having problems to get both slaves working at the same time. When I launch the driver on the master, one of the slaves is assigned the receiver task, and initially both slaves start processing tasks. After a few tens of batches,

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Pat Ferrel
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the problem but anyway... I get a NoClassDefFoundError for RandomGenerator when running a driver from the CLI. But only when using a named master, even a standalone master. If I run using master = local[4] the job execute

RE: how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-19 Thread Henry Hung
@Sean Owen, Thank you for the information. I change the pom file to include math3, because I needed the math3 library from my previous use with 1.0.2. Best regards, Henry -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Saturday, October 18, 2014 2:19 AM To: MA33 YT

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp sc

All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Hi all, I have a Spark-0.9 cluster, which has 16 nodes. I wrote a Spark application to read data from an HBase table, which has 86 regions spreading over 20 RegionServers. I submitted the Spark app in Spark standalone mode and found that there were 86 executors running on just 3 nodes and it too

Re: All executors run on just a few nodes

2014-10-19 Thread raymond
My best guess is the speed that your executors got registered with driver differs between each run. when you run it for the first time, the executors is not fully registered when task set manager start to assign tasks, and thus the tasks was assigned to available executors which have already sa

default parallelism bug?

2014-10-19 Thread Kevin Jung
Hi, I usually use file on hdfs to make PairRDD and analyze it by using combineByKey,reduceByKey, etc. But sometimes it hangs when I set spark.default.parallelism configuration, though the size of file is small. If I remove this configuration, all works fine. Does anyone tell me why this occur? Reg

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
I have created an issue for this https://issues.apache.org/jira/browse/SPARK-4003 From: Cheng, Hao Sent: Monday, October 20, 2014 9:20 AM To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp Seems bugs in the JavaSQLContext.getSchema(),

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..." On Friday, October 17, 2014, Sean Owen wrote: > You can save to a local file. What are you trying and what doesn't work? > > You can output one file by repartitioning to 1 partition but this is > probably not a good ide

Re: All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Raymond, Thank you. But I read from other thread that "PROCESS_LOCAL" means the data is in the same JVM as the code that is running. When data is in the same JVM

Re: All executors run on just a few nodes

2014-10-19 Thread raymond
when the data’s source host is not one of the registered executors, it will also be marked as PROCESS_LOCAL too, though it should have a different NAME for this. I don’t know did someone change this name very recently. but for 0.9, it is the case . When I say satisfy, yes, if the executors hav

Re: checkpoint and not running out of disk space

2014-10-19 Thread sivarani
I am new to spark, i am using Spark streaming with Kafka.. My streaming duration is 1s.. Assume i get 100 records in 1s and 120 records in 2s and 80 records in 3s --> {sec 1 1,2,...100} --> {sec 2 1,2..120} --> {sec 3 1,2,..80} I apply my logic in sec 1 and have a result => result1 i want to

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Dmitriy Lyubimov
Mahout context does not include _all_ possible transitive dependencies. Would not be lighting fast to take all legacy etc. dependencies. There's an "ignored" unit test that asserts context path correctness. you can "uningnore" it and run to verify it still works as ex[ected.The reason it is set to