Re: Worker is KILLED for no reason

2015-06-23 Thread Demi Ben-Ari
Hi, I've open up an issue bug on the Spark project on JIRA: https://issues.apache.org/jira/browse/SPARK-8557 Would really appreciate some insights on the issue, *It's strange that no one else encountered the problem.* Have a great day! On Mon, Jun 15, 2015 at 12:03 PM, nizang wrote: > hi, > >

Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread Sonal Goyal
When you deploy spark over hadoop, you typically want to leverage the replication of hdfs or your data is already in hadoop. Again, if your data is already in Cassandra or if you want to do updateable atomic row operations and access to your data as well as run analytic jobs, that may be another ca

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread Akhil Das
Depending the size of the memory you are having, you ccould allocate 60-80% of the memory for the spark worker process. Datanode doesn't require too much memory. On 23 Jun 2015 21:26, "maxdml" wrote: > I'm wondering if there is a real benefit for splitting my memory in two for > the datanode/work

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-23 Thread Akhil Das
I think this is how it works, So RDDs will have partitions which are made up of blocks and the blockManager will know where these blocks are available, based on the availability (PROCESS_LOCAL, NODE_LOCAL etc), spark will launch the tasks on those nodes. This behaviour can be controlled with spark.

Re: when cached RDD will unpersist its data

2015-06-23 Thread eric wong
In a case that memory cannot hold all the cached RDD, then BlockManager will evict some older block for storage of new RDD block. Hope that will helpful. 2015-06-24 13:22 GMT+08:00 bit1...@163.com : > I am kind of consused about when cached RDD will unpersist its data. I > know we can explicitl

when cached RDD will unpersist its data

2015-06-23 Thread bit1...@163.com
I am kind of consused about when cached RDD will unpersist its data. I know we can explicitly unpersist it with RDD.unpersist ,but can it be unpersist automatically by the spark framework? Thanks. bit1...@163.com

Re: Yarn application ID for Spark job on Yarn

2015-06-23 Thread canan chen
I don't think there is yarn related stuff to access in spark. Spark don't depend on yarn. BTW, why do you want the yarn application id ? On Mon, Jun 22, 2015 at 11:45 PM, roy wrote: > Hi, > > Is there a way to get Yarn application ID inside spark application, when > running spark Job on YARN

Re: Spark launching without all of the requested YARN resources

2015-06-23 Thread canan chen
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra wrote: > Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark

Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question. Spark can be deployed on different cluster manager frameworks like standard alone, yarn & mesos. Spark can't run without these cluster manager framework, that means spark depend on cluster manager framework. And the data management layer is the upstream

Re: map V mapPartitions

2015-06-23 Thread canan chen
One example is that you'd like to set up jdbc connection for each partition and share this connection across the records. mapPartitions is much more like the paradigm of mapper in mapreduce. In the mapper of mapreduce, you have setup method to do any initialization stuff before processing the spl

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/streaming-flume-integration.html Yep, avro sink is the correct one. -- Ruslan Dautkhanov On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid wrote: > Hi! > > > I want to integrate flume with spark streaming. I want to know which sink > type of flume are suppo

Re: Spark standalone cluster - resource management

2015-06-23 Thread canan chen
Check the available resources you have (cpu cores & memory ) on master web ui. The log you see means the job can't get any resources. On Wed, Jun 24, 2015 at 5:03 AM, Nizan Grauer wrote: > I'm having 30G per machine > > This is the first (and only) job I'm trying to submit. So it's weird that

flume sinks supported by spark streaming

2015-06-23 Thread Hafiz Mujadid
Hi! I want to integrate flume with spark streaming. I want to know which sink type of flume are supported by spark streaming? I saw one example using avroSink. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flume-sinks-supported-by-spark-streami

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Tathagata Das
Please run it in your own application and not in the spark shell. I see that you are trying to stop the Spark context and create a new StreamingContext. That will lead to unexpected issue, that you are seeing. Please make a standalone SBT/Maven app for Spark Streaming. On Tue, Jun 23, 2015 at 3:43

Re: mutable vs. pure functional implementation - StatCounter

2015-06-23 Thread Xiangrui Meng
Creating millions of temporary (immutable) objects is bad for performance. It should be simple to do a micro-benchmark locally. -Xiangrui On Mon, Jun 22, 2015 at 7:25 PM, mzeltser wrote: > Using StatCounter as an example, I'd like to understand if "pure" functional > implementation would be more

Re: which mllib algorithm for large multi-class classification?

2015-06-23 Thread Xiangrui Meng
We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an "array larger than MaxInt" e

Re: How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-23 Thread Xiangrui Meng
Please check the input path to your test data, and call `.count()` and see whether there are records in it. -Xiangrui On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue wrote: > Hey, > > I am testing the StreamingLinearRegressionWithSGD following the tutorial. > > > It works, but I could not output the p

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-23 Thread Xiangrui Meng
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI, how

Re: Issue running Spark 1.4 on Yarn

2015-06-23 Thread Matt Kapilevich
Hi Kevin I never did. I checked for free space in the root partition, don't think this was an issue. Now that 1.4 is officially out I'll probably give it another shot. On Jun 22, 2015 4:28 PM, "Kevin Markey" wrote: > Matt: Did you ever resolve this issue? When running on a cluster or > pseudoc

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-23 Thread Xiangrui Meng
This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter transactions and keep only frequent items 3. do NOT order by frequency 4. use suffix to partition the transactions (whethe

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
A rough estimate of the worst case memory requirement for driver is about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at the beginning because the previous centers are still in memory while receiving new center updates. -Xiangrui On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey w

[no subject]

2015-06-23 Thread ๏̯͡๏
I have a Spark job that has 7 stages. The first 3 stage complete and the fourth stage beings (joins two RDDs). This stage has multiple task failures all the below exception. Multiple tasks (100s) of them get the same exception with different hosts. How can all the host suddenly stop responding wh

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen
I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency ->

Re: map V mapPartitions

2015-06-23 Thread Holden Karau
I think one of the primary cases where mapPartitions is useful if you are going to be doing any setup work that can be re-used between processing each element, this way the setup work only needs to be done once per partition (for example creating an instance of jodatime). Both map and mapPartition

map V mapPartitions

2015-06-23 Thread ๏̯͡๏
I know when to use a map () but when should i use mapPartitions() ? Which is faster ? -- Deepak

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Michael Armbrust
You can also do this using a sequence of case classes (in the example stored in a tuple, though the outer container could also be a case class): case class MyRecord(name: String, location: String) val df = Seq((1, Seq(MyRecord("Michael", "Berkeley"), MyRecord("Andy", "Oakland".toDF("id", "peop

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Roberto Congiu
I wrote a brief howto on building nested records in spark and storing them in parquet here: http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/ 2015-06-23 16:12 GMT-07:00 Richard Catlin : > How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a > column? Is there an

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Thanks DB Tsai, it is very helpful. Cheers, Wei 2015-06-23 16:00 GMT-07:00 DB Tsai : > Please see the current version of code for better documentation. > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala > > Sincerely, > > DB

RE: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Richard Catlin
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a column? Is there an example? Will this store as a nested parquet file? Thanks. Richard Catlin

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Ke

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
The regularization is handled after the objective function of data is computed. See https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala line 348 for L2. For L1, it's handled by OWLQN, so you don'

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
If yo change to ```val numbers2 = numbers```, then it have the same problem On Tue, Jun 23, 2015 at 2:54 PM, Ignacio Blasco wrote: > It seems that it doesn't happen in Scala API. Not exactly the same as in > python, but pretty close. > > https://gist.github.com/elnopintan/675968d2e4be68958df8 >

Re: Kafka createDirectStream ​issue

2015-06-23 Thread syepes
yes, I have two clusters one standalone an another using Mesos Sebastian YEPES http://sebastian-yepes.com On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List] < ml-node+s1001560n23457...@n3.nabble.com> wrote: > Hi syepes, > Are u run the application in standalone mode? > Reg

Re: Kafka createDirectStream ​issue

2015-06-23 Thread drarse
Hi syepes, Are u run the application in standalone mode? Regards El 23/06/2015 22:48, "syepes [via Apache Spark User List]" < ml-node+s1001560n23456...@n3.nabble.com> escribió: > Hello, > > I ​am trying ​use the new Kafka ​consumer > ​​"KafkaUtils.createDirectStream"​ but I am having some issues m

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Hi DB Tsai, Thanks for your reply. I went through the source code of LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A weights - y||^2^. I cannot match this with the elasticNet loss function found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which is the s

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
It seems that it doesn't happen in Scala API. Not exactly the same as in python, but pretty close. https://gist.github.com/elnopintan/675968d2e4be68958df8 2015-06-23 23:11 GMT+02:00 Davies Liu : > I think it also happens in DataFrames API of all languages. > > On Tue, Jun 23, 2015 at 9:16 AM, Ig

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-23 Thread Guillaume Pitel
Hi, So I've done this "Node-centered accumulator", I've written a small piece about it : http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ Hope it can help someone Guillaume 2015-06-18 15:17 GMT+02:00 Guillaume Pitel >:

Re: How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Richard Marscher
There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value. It's also worth pointing

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
Thanks a lot. It worked after keeping all versions to same.1.2.0 On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das wrote: > Why are you mixing spark versions between streaming and core?? > Your core is 1.2.0 and streaming is 1.4.0. > > On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora > wrote: > >>

How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Ashish Soni
Hi All , What is difference between below in terms of execution to the cluster with 1 or more worker node rdd.map(...).map(...)...map(..) vs val rdd1 = rdd.map(...) val rdd2 = rdd1.map(...) val rdd3 = rdd2.map(...) Thanks, Ashish

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
I think it also happens in DataFrames API of all languages. On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco wrote: > That issue happens only in python dsl? > > El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: >> >> Thanks! The solution: >> >> https://gist.github.com/dokipen/018a1deeab668efdf455

Re: Spark standalone cluster - resource management

2015-06-23 Thread Nizan Grauer
I'm having 30G per machine This is the first (and only) job I'm trying to submit. So it's weird that for --total-executor-cores=20 it works, and for --total-executor-cores=4 it doesn't On Tue, Jun 23, 2015 at 10:46 PM, Igor Berman wrote: > probably there are already running jobs there > in addi

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora wrote: > It throws exception for WriteAheadLogUtils after excluding core and > streaming jar. > > Exception in thread "main" java.lang.NoClass

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Cody Koeninger
The exception $line49 is referring to a line of the spark shell. Have you tried it from an actual assembled job with spark-submit ? On Tue, Jun 23, 2015 at 3:48 PM, syepes wrote: > Hello, > > I ​am trying ​use the new Kafka ​consumer > ​​"KafkaUtils.createDirectStream"​ > but I am having so

Kafka createDirectStream ​issue

2015-06-23 Thread syepes
Hello, I ​am trying ​use the new Kafka ​consumer ​​"KafkaUtils.createDirectStream"​ but I am having some issues making it work. I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and I am still getting the same strange exception "ClassNotFoundException: $line49.$read$$iwC$$i..

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
It throws exception for WriteAheadLogUtils after excluding core and streaming jar. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/util/WriteAheadLogUtils$ at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84) at org

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu shukl

kafka spark streaming with mesos

2015-06-23 Thread Bartek Radziszewski
Hey, I’m trying to run kafka spark streaming using mesos with following example: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.StringDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import org.apache.s

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Tathagata Das
Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I'm running a program in Spark 1.4 where

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23, 20

When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread commtech
Hi, I work at a large financial institution in New York. We're looking into Spark and trying to learn more about the deployment/use cases for real-time analytics with Spark. When would it be better to deploy standalone Spark versus Spark on top of a more comprehensive data management layer (Hadoop

Re: Spark standalone cluster - resource management

2015-06-23 Thread Igor Berman
probably there are already running jobs there in addition, memory is also a resource, so if you are running 1 application that took all your memory and then you are trying to run another application that asks for the memory the cluster doesn't have then the second app wont be running so why are u

Re: Registering custom metrics

2015-06-23 Thread Otis Gospodnetić
Hi, Not sure if this will fit your needs, but if you are trying to collect+chart some metrics specific to your app, yet want to correlate them with what's going on in Spark, maybe Spark's performance numbers, you may want to send your custom metrics to SPM, so they can be visualized/analyzed/"dash

spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
hi While using spark streaming (1.2) with kafka . I am getting below error and receivers are getting killed but jobs get scheduled at each stream interval. 15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82, ip(XX)): java.io.IOException: Failed to connect to ip(XXX

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora wrote: > I found the error so just posting on the list. > > It seems broadcast variables cannot be declared static. > If you do you get a null pointer exception. >

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska
Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in S

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers wrote: > just a heads up, i was doing some basic coding using DataFrame, Row, > StructType, etc. and i ended up with deadlocks in my sbt tests due to the > usage of > ScalaReflectionLock.synchronized in the spark sql code. > the

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora wrote: > btw. just for reference I have added the code in a gist: > > https://gist.githu

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

Re: Limitations using SparkContext

2015-06-23 Thread Richard Marscher
Hi, can you detail the symptom further? Was it that only 12 requests were services and the other 440 timed out? I don't think that Spark is well suited for this kind of workload, or at least the way it is being represented. How long does a single request take Spark to complete? Even with fair sch

Limitations using SparkContext

2015-06-23 Thread daunnc
So the situation is following: got a spray server, with a spark context available (fair scheduling in a cluster mode, via spark-submit). There are some http urls, which calling spark rdd, and collecting information from accumulo / hdfs / etc (using rdd). Noticed, that there is a sort of limitation,

org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Koert Kuipers
just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue away when i changed my tests to run consecutively...

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan
64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one iss

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
I've only tried it in python On Tue, Jun 23, 2015 at 12:16 PM Ignacio Blasco wrote: > That issue happens only in python dsl? > El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: > >> Thanks! The solution: >> >> https://gist.github.com/dokipen/018a1deeab668efdf455 >> >> On Mon, Jun 22, 2015 at 4:3

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
That issue happens only in python dsl? El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: > Thanks! The solution: > > https://gist.github.com/dokipen/018a1deeab668efdf455 > > On Mon, Jun 22, 2015 at 4:33 PM Davies Liu wrote: > >> Right now, we can not figure out which column you referenced in >> `

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
I'm wondering if there is a real benefit for splitting my memory in two for the datanode/workers. Datanodes and OS needs memory to perform their business. I suppose there could be loss of performance if they came to compete for memory with the worker(s). Any opinion? :-) -- View this message i

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I don't think I have explicitly check-pointed anywhere. Unless it's internal in some interface, I don't believe the application is checkpointed. Thanks for the suggestion though.. Nipun On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet wrote: > Are you using checkpointing? > > I had a similar

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies Liu wrote: > Right now, we can not figure out which column you referenced in > `select`, if there are multiple row with the same name in the joined > DataFrame (for example, two `va

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Benjamin Fradet
Are you using checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote: > Hi, > > I have a spark streaming application where I need to access a model saved > in a HashMap. > I

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun A

[Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I have *no problems in running the same code with broadcast variables in the local installation.* However I get a *null pointer* *exception* when I deploy it on my spark test cluster. I have stored a mode

java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Juan Rodríguez Hortalá
Hi, I'm running a program in Spark 1.4 where several Spark Streaming contexts are created from the same Spark context. As pointed in https://spark.apache.org/docs/latest/streaming-programming-guide.html each Spark Streaming context is stopped before creating the next Spark Streaming context. The p

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
Thanks, will try this out and get back... On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das wrote: > Try adding the provided scopes > > > org.apache.spark > spark-core_2.10 > 1.4.0 > > *provided * > > org.apache.spark > spark-stre

Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
It all depends on what it is you need to do with the pages. If you’re just going to be collecting them then it’s really not much different than a groupByKey. If instead you’re looking to derive some other value from the series of pages then you could potentially partition by user id and run a m

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resource

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread scar scar
Thank you Tathagata, It is great to know about this issue, but our problem is a little bit different. We have 3 nodes in our Spark cluster, and when the Zookeeper leader dies, the Master Spark gets shut down, and remains down, but a new master gets elected and loads the UI. I think if the problem

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
Thanks. Yes, unfortunately, they all need to be grouped. I guess I can partition the record by user id. However, I have millions of users, do you think partition by user id will help? Jianguo On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You’re right

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I can not. I've already limited the number of cores to 10, so it gets 5 executors with 2 cores each... wt., 23.06.2015 o 13:45 użytkownik Akhil Das napisał: > Use *spark.cores.max* to limit the CPU per job, then you can easily > accommodate your third job also. > > Thanks > Best Regards > > On T

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła wrote: > I have set up small standalone cluster: 5 nodes, every node has 5GB of > memory an 8 cores. As you can see, node doe

Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run o

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-23 Thread Matthew Johnson
Awesome, thanks Pawan – for now I’ll give spark-notebook a go until Zeppelin catches up to Spark 1.4 (and when Zeppelin has a binary release – my PC doesn’t seem too happy about building a Node.js app from source). Thanks for the detailed instructions!! *From:* pawan kumar [mailto:pkv...@gmail

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Dmitry Goldenberg
Yes, Akhil. We already have an origination timestamp in the body of the message when we send it. But we can't guarantee the network speed nor a precise enough synchronization of clocks across machines. Pulling the timestamp from Kafka itself would be a step forward although the broker is most l

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
Thanks for the suggestions everyone, appreciate the advice. I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead of 1.3, replacing the date casts with a "between" operation on the corresponding long constants instead and changing COUNT(*) to COUNT(1). None of these seem to ha

How to disable parquet schema merging in 1.4?

2015-06-23 Thread Rex Xiong
I remember in a previous PR, schema merging can be disabled by setting spark.sql.hive.convertMetastoreParquet.mergeSchema to false. But in 1.4 release, I don't see this config anymore, is there a new way to do it? Thanks

Re: Velox Model Server

2015-06-23 Thread Sean Owen
Yes, and typically needs are <100ms. Now imagine even 10 concurrent requests. My experience has been that this approach won't nearly scale. The best you could probably do is async mini-batch near-real-time scoring, pushing results to some store for retrieval, which could be entirely suitable for yo

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Then apply a transformation over the dstream to pull those required information. :) Thanks Best Regards On Tue, Jun 23, 2015 at 3:22 PM, anshu shukla wrote: > Thanks alot , > > Because i just want to log timestamp and unique message id and not full > RDD . > > On Tue, Jun 23, 2015 at 12:41 PM,

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread anshu shukla
Thanks alot , Because i just want to log timestamp and unique message id and not full RDD . On Tue, Jun 23, 2015 at 12:41 PM, Akhil Das wrote: > Why don't you do a normal .saveAsTextFiles? > > Thanks > Best Regards > > On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla > wrote: > >> Thanx for rep

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov
Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server t

Re: s3 - Can't make directory for path

2015-06-23 Thread Steve Loughran
> On 23 Jun 2015, at 00:09, Danny wrote: > > hi, > > have you tested > > "s3://ww-sandbox/name_of_path/" instead of "s3://ww-sandbox/name_of_path" > + make sure the bucket is there already. Hadoop s3 clients don't currently handle that step > or have you test to add your file extension wi

Calculating tuple count /input rate with time

2015-06-23 Thread anshu shukla
I am calculating input rate using the following logic. And i think this foreachRDD is always running on driver (println are seen on driver) 1- Is there any other way to do that in less cost . 2- Will this give me the correct count for rate . //code - inputStream.foreachRDD(new Function, Void

RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin, If possible, can you please provide a part of the code? Or perhaps try with the ut in RandomForestSuite to see if the issue repros. Regards, yuhao -Original Message- From: samsudhin [mailto:samsud...@pigstick.com] Sent: Tuesday, June 23, 2015 2:14 PM To: user@spark.apac

How to figure out how many records received by individual receiver

2015-06-23 Thread bit1...@163.com
Hi, I am using spark1.3.1, and have 2 receivers, On the web UI, I can only see the total records received by all these 2 receivers, but I can't figure out the records received by individual receiver? Not sure whether the information is shown on the UI in spark1.4. bit1...@163.com

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
This could be because of some subtle change in the classloaders used by executors. I think there has been issues in the past with libraries that use Class.forName to find classes by reflection. Because the executors load classes dynamically using custom class loaders, libraries that use Class.forNa

Re: Re: What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, Akhil, Thank you for the explanation! bit1...@163.com From: Akhil Das Date: 2015-06-23 16:29 To: bit1...@163.com CC: user Subject: Re: What does [Stage 0:> (0 + 2) / 2] mean on the console Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4

Re: What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) <-- This 2 is your 2 threads.) And the

What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, I have a spark streaming application that runs locally with two receivers, some code snippet is as follows: conf.setMaster("local[4]") //RPC Log Streaming val rpcStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, consumerParams, topicRPC, StorageLevel.MEMOR

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best R

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDD hadoopFile = sc.hadoopFile( "/sigmoid", DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 wrote: > Hello, everyone! I'm new in spark. I have already written programs i

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste y

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Murthy Chelankuri
yes , in spark standalone mode witht the master URL. Jar are copying to execeutor and the application is running fine but its failing at some point when kafka is trying to load the classes using some reflection mechanisims for loading the Encoder and Partitioner classes. Here are my finding so fa

  1   2   >