Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
So you have Kafka in your classpath in you Java application, where you are creating the sparkContext with the spark standalone master URL, right? The recommended way of submitting spark applications to any cluster is using spark-submit. See https://spark.apache.org/docs/latest/submitting-applicati

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread Tathagata Das
Maybe this is a known issue with spark streaming and master web ui. Disable event logging, and it should be fine. https://issues.apache.org/jira/browse/SPARK-6270 On Mon, Jun 22, 2015 at 8:54 AM, scar scar wrote: > Sorry I was on vacation for a few days. Yes, it is on. This is what I have > in

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and development,

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla wrote: > Thanx for reply !! > > YES , Either it should write on any machine of cluster or Can you please > help me ... that how to do this . Previously i was using writing using

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Murthy Chelankuri
yes , in spark standalone mode witht the master URL. Jar are copying to execeutor and the application is running fine but its failing at some point when kafka is trying to load the classes using some reflection mechanisims for loading the Encoder and Partitioner classes. Here are my finding so fa

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste y

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDD hadoopFile = sc.hadoopFile( "/sigmoid", DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 wrote: > Hello, everyone! I'm new in spark. I have already written programs i

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best R

What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, I have a spark streaming application that runs locally with two receivers, some code snippet is as follows: conf.setMaster("local[4]") //RPC Log Streaming val rpcStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, consumerParams, topicRPC, StorageLevel.MEMOR

Re: What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) <-- This 2 is your 2 threads.) And the

Re: Re: What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, Akhil, Thank you for the explanation! bit1...@163.com From: Akhil Das Date: 2015-06-23 16:29 To: bit1...@163.com CC: user Subject: Re: What does [Stage 0:> (0 + 2) / 2] mean on the console Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
This could be because of some subtle change in the classloaders used by executors. I think there has been issues in the past with libraries that use Class.forName to find classes by reflection. Because the executors load classes dynamically using custom class loaders, libraries that use Class.forNa

How to figure out how many records received by individual receiver

2015-06-23 Thread bit1...@163.com
Hi, I am using spark1.3.1, and have 2 receivers, On the web UI, I can only see the total records received by all these 2 receivers, but I can't figure out the records received by individual receiver? Not sure whether the information is shown on the UI in spark1.4. bit1...@163.com

RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin, If possible, can you please provide a part of the code? Or perhaps try with the ut in RandomForestSuite to see if the issue repros. Regards, yuhao -Original Message- From: samsudhin [mailto:samsud...@pigstick.com] Sent: Tuesday, June 23, 2015 2:14 PM To: user@spark.apac

Calculating tuple count /input rate with time

2015-06-23 Thread anshu shukla
I am calculating input rate using the following logic. And i think this foreachRDD is always running on driver (println are seen on driver) 1- Is there any other way to do that in less cost . 2- Will this give me the correct count for rate . //code - inputStream.foreachRDD(new Function, Void

Re: s3 - Can't make directory for path

2015-06-23 Thread Steve Loughran
> On 23 Jun 2015, at 00:09, Danny wrote: > > hi, > > have you tested > > "s3://ww-sandbox/name_of_path/" instead of "s3://ww-sandbox/name_of_path" > + make sure the bucket is there already. Hadoop s3 clients don't currently handle that step > or have you test to add your file extension wi

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov
Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server t

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread anshu shukla
Thanks alot , Because i just want to log timestamp and unique message id and not full RDD . On Tue, Jun 23, 2015 at 12:41 PM, Akhil Das wrote: > Why don't you do a normal .saveAsTextFiles? > > Thanks > Best Regards > > On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla > wrote: > >> Thanx for rep

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Then apply a transformation over the dstream to pull those required information. :) Thanks Best Regards On Tue, Jun 23, 2015 at 3:22 PM, anshu shukla wrote: > Thanks alot , > > Because i just want to log timestamp and unique message id and not full > RDD . > > On Tue, Jun 23, 2015 at 12:41 PM,

Re: Velox Model Server

2015-06-23 Thread Sean Owen
Yes, and typically needs are <100ms. Now imagine even 10 concurrent requests. My experience has been that this approach won't nearly scale. The best you could probably do is async mini-batch near-real-time scoring, pushing results to some store for retrieval, which could be entirely suitable for yo

How to disable parquet schema merging in 1.4?

2015-06-23 Thread Rex Xiong
I remember in a previous PR, schema merging can be disabled by setting spark.sql.hive.convertMetastoreParquet.mergeSchema to false. But in 1.4 release, I don't see this config anymore, is there a new way to do it? Thanks

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
Thanks for the suggestions everyone, appreciate the advice. I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead of 1.3, replacing the date casts with a "between" operation on the corresponding long constants instead and changing COUNT(*) to COUNT(1). None of these seem to ha

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Dmitry Goldenberg
Yes, Akhil. We already have an origination timestamp in the body of the message when we send it. But we can't guarantee the network speed nor a precise enough synchronization of clocks across machines. Pulling the timestamp from Kafka itself would be a step forward although the broker is most l

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-23 Thread Matthew Johnson
Awesome, thanks Pawan – for now I’ll give spark-notebook a go until Zeppelin catches up to Spark 1.4 (and when Zeppelin has a binary release – my PC doesn’t seem too happy about building a Node.js app from source). Thanks for the detailed instructions!! *From:* pawan kumar [mailto:pkv...@gmail

Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run o

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła wrote: > I have set up small standalone cluster: 5 nodes, every node has 5GB of > memory an 8 cores. As you can see, node doe

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I can not. I've already limited the number of cores to 10, so it gets 5 executors with 2 cores each... wt., 23.06.2015 o 13:45 użytkownik Akhil Das napisał: > Use *spark.cores.max* to limit the CPU per job, then you can easily > accommodate your third job also. > > Thanks > Best Regards > > On T

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
Thanks. Yes, unfortunately, they all need to be grouped. I guess I can partition the record by user id. However, I have millions of users, do you think partition by user id will help? Jianguo On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You’re right

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread scar scar
Thank you Tathagata, It is great to know about this issue, but our problem is a little bit different. We have 3 nodes in our Spark cluster, and when the Zookeeper leader dies, the Master Spark gets shut down, and remains down, but a new master gets elected and loads the UI. I think if the problem

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resource

Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
It all depends on what it is you need to do with the pages. If you’re just going to be collecting them then it’s really not much different than a groupByKey. If instead you’re looking to derive some other value from the series of pages then you could potentially partition by user id and run a m

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
Thanks, will try this out and get back... On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das wrote: > Try adding the provided scopes > > > org.apache.spark > spark-core_2.10 > 1.4.0 > > *provided * > > org.apache.spark > spark-stre

java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Juan Rodríguez Hortalá
Hi, I'm running a program in Spark 1.4 where several Spark Streaming contexts are created from the same Spark context. As pointed in https://spark.apache.org/docs/latest/streaming-programming-guide.html each Spark Streaming context is stopped before creating the next Spark Streaming context. The p

[Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I have *no problems in running the same code with broadcast variables in the local installation.* However I get a *null pointer* *exception* when I deploy it on my spark test cluster. I have stored a mode

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun A

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Benjamin Fradet
Are you using checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote: > Hi, > > I have a spark streaming application where I need to access a model saved > in a HashMap. > I

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies Liu wrote: > Right now, we can not figure out which column you referenced in > `select`, if there are multiple row with the same name in the joined > DataFrame (for example, two `va

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I don't think I have explicitly check-pointed anywhere. Unless it's internal in some interface, I don't believe the application is checkpointed. Thanks for the suggestion though.. Nipun On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet wrote: > Are you using checkpointing? > > I had a similar

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
I'm wondering if there is a real benefit for splitting my memory in two for the datanode/workers. Datanodes and OS needs memory to perform their business. I suppose there could be loss of performance if they came to compete for memory with the worker(s). Any opinion? :-) -- View this message i

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
That issue happens only in python dsl? El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: > Thanks! The solution: > > https://gist.github.com/dokipen/018a1deeab668efdf455 > > On Mon, Jun 22, 2015 at 4:33 PM Davies Liu wrote: > >> Right now, we can not figure out which column you referenced in >> `

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
I've only tried it in python On Tue, Jun 23, 2015 at 12:16 PM Ignacio Blasco wrote: > That issue happens only in python dsl? > El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: > >> Thanks! The solution: >> >> https://gist.github.com/dokipen/018a1deeab668efdf455 >> >> On Mon, Jun 22, 2015 at 4:3

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan
64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one iss

org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Koert Kuipers
just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue away when i changed my tests to run consecutively...

Limitations using SparkContext

2015-06-23 Thread daunnc
So the situation is following: got a spray server, with a spark context available (fair scheduling in a cluster mode, via spark-submit). There are some http urls, which calling spark rdd, and collecting information from accumulo / hdfs / etc (using rdd). Noticed, that there is a sort of limitation,

Re: Limitations using SparkContext

2015-06-23 Thread Richard Marscher
Hi, can you detail the symptom further? Was it that only 12 requests were services and the other 440 timed out? I don't think that Spark is well suited for this kind of workload, or at least the way it is being represented. How long does a single request take Spark to complete? Even with fair sch

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora wrote: > btw. just for reference I have added the code in a gist: > > https://gist.githu

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers wrote: > just a heads up, i was doing some basic coding using DataFrame, Row, > StructType, etc. and i ended up with deadlocks in my sbt tests due to the > usage of > ScalaReflectionLock.synchronized in the spark sql code. > the

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska
Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in S

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora wrote: > I found the error so just posting on the list. > > It seems broadcast variables cannot be declared static. > If you do you get a null pointer exception. >

spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
hi While using spark streaming (1.2) with kafka . I am getting below error and receivers are getting killed but jobs get scheduled at each stream interval. 15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82, ip(XX)): java.io.IOException: Failed to connect to ip(XXX

Re: Registering custom metrics

2015-06-23 Thread Otis Gospodnetić
Hi, Not sure if this will fit your needs, but if you are trying to collect+chart some metrics specific to your app, yet want to correlate them with what's going on in Spark, maybe Spark's performance numbers, you may want to send your custom metrics to SPM, so they can be visualized/analyzed/"dash

Re: Spark standalone cluster - resource management

2015-06-23 Thread Igor Berman
probably there are already running jobs there in addition, memory is also a resource, so if you are running 1 application that took all your memory and then you are trying to run another application that asks for the memory the cluster doesn't have then the second app wont be running so why are u

When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread commtech
Hi, I work at a large financial institution in New York. We're looking into Spark and trying to learn more about the deployment/use cases for real-time analytics with Spark. When would it be better to deploy standalone Spark versus Spark on top of a more comprehensive data management layer (Hadoop

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23, 20

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Tathagata Das
Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I'm running a program in Spark 1.4 where

kafka spark streaming with mesos

2015-06-23 Thread Bartek Radziszewski
Hey, I’m trying to run kafka spark streaming using mesos with following example: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.StringDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import org.apache.s

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu shukl

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
It throws exception for WriteAheadLogUtils after excluding core and streaming jar. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/util/WriteAheadLogUtils$ at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84) at org

Kafka createDirectStream ​issue

2015-06-23 Thread syepes
Hello, I ​am trying ​use the new Kafka ​consumer ​​"KafkaUtils.createDirectStream"​ but I am having some issues making it work. I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and I am still getting the same strange exception "ClassNotFoundException: $line49.$read$$iwC$$i..

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Cody Koeninger
The exception $line49 is referring to a line of the spark shell. Have you tried it from an actual assembled job with spark-submit ? On Tue, Jun 23, 2015 at 3:48 PM, syepes wrote: > Hello, > > I ​am trying ​use the new Kafka ​consumer > ​​"KafkaUtils.createDirectStream"​ > but I am having so

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora wrote: > It throws exception for WriteAheadLogUtils after excluding core and > streaming jar. > > Exception in thread "main" java.lang.NoClass

Re: Spark standalone cluster - resource management

2015-06-23 Thread Nizan Grauer
I'm having 30G per machine This is the first (and only) job I'm trying to submit. So it's weird that for --total-executor-cores=20 it works, and for --total-executor-cores=4 it doesn't On Tue, Jun 23, 2015 at 10:46 PM, Igor Berman wrote: > probably there are already running jobs there > in addi

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
I think it also happens in DataFrames API of all languages. On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco wrote: > That issue happens only in python dsl? > > El 23/6/2015 5:05 p. m., "Bob Corsaro" escribió: >> >> Thanks! The solution: >> >> https://gist.github.com/dokipen/018a1deeab668efdf455

How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Ashish Soni
Hi All , What is difference between below in terms of execution to the cluster with 1 or more worker node rdd.map(...).map(...)...map(..) vs val rdd1 = rdd.map(...) val rdd2 = rdd1.map(...) val rdd3 = rdd2.map(...) Thanks, Ashish

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
Thanks a lot. It worked after keeping all versions to same.1.2.0 On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das wrote: > Why are you mixing spark versions between streaming and core?? > Your core is 1.2.0 and streaming is 1.4.0. > > On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora > wrote: > >>

Re: How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Richard Marscher
There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value. It's also worth pointing

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-23 Thread Guillaume Pitel
Hi, So I've done this "Node-centered accumulator", I've written a small piece about it : http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ Hope it can help someone Guillaume 2015-06-18 15:17 GMT+02:00 Guillaume Pitel >:

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
It seems that it doesn't happen in Scala API. Not exactly the same as in python, but pretty close. https://gist.github.com/elnopintan/675968d2e4be68958df8 2015-06-23 23:11 GMT+02:00 Davies Liu : > I think it also happens in DataFrames API of all languages. > > On Tue, Jun 23, 2015 at 9:16 AM, Ig

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Hi DB Tsai, Thanks for your reply. I went through the source code of LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A weights - y||^2^. I cannot match this with the elasticNet loss function found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which is the s

Re: Kafka createDirectStream ​issue

2015-06-23 Thread drarse
Hi syepes, Are u run the application in standalone mode? Regards El 23/06/2015 22:48, "syepes [via Apache Spark User List]" < ml-node+s1001560n23456...@n3.nabble.com> escribió: > Hello, > > I ​am trying ​use the new Kafka ​consumer > ​​"KafkaUtils.createDirectStream"​ but I am having some issues m

Re: Kafka createDirectStream ​issue

2015-06-23 Thread syepes
yes, I have two clusters one standalone an another using Mesos Sebastian YEPES http://sebastian-yepes.com On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List] < ml-node+s1001560n23457...@n3.nabble.com> wrote: > Hi syepes, > Are u run the application in standalone mode? > Reg

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
If yo change to ```val numbers2 = numbers```, then it have the same problem On Tue, Jun 23, 2015 at 2:54 PM, Ignacio Blasco wrote: > It seems that it doesn't happen in Scala API. Not exactly the same as in > python, but pretty close. > > https://gist.github.com/elnopintan/675968d2e4be68958df8 >

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
The regularization is handled after the objective function of data is computed. See https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala line 348 for L2. For L1, it's handled by OWLQN, so you don'

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Ke

RE: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Richard Catlin
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a column? Is there an example? Will this store as a nested parquet file? Thanks. Richard Catlin

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Thanks DB Tsai, it is very helpful. Cheers, Wei 2015-06-23 16:00 GMT-07:00 DB Tsai : > Please see the current version of code for better documentation. > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala > > Sincerely, > > DB

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Roberto Congiu
I wrote a brief howto on building nested records in spark and storing them in parquet here: http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/ 2015-06-23 16:12 GMT-07:00 Richard Catlin : > How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a > column? Is there an

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Michael Armbrust
You can also do this using a sequence of case classes (in the example stored in a tuple, though the outer container could also be a case class): case class MyRecord(name: String, location: String) val df = Seq((1, Seq(MyRecord("Michael", "Berkeley"), MyRecord("Andy", "Oakland".toDF("id", "peop

map V mapPartitions

2015-06-23 Thread ๏̯͡๏
I know when to use a map () but when should i use mapPartitions() ? Which is faster ? -- Deepak

Re: map V mapPartitions

2015-06-23 Thread Holden Karau
I think one of the primary cases where mapPartitions is useful if you are going to be doing any setup work that can be re-used between processing each element, this way the setup work only needs to be done once per partition (for example creating an instance of jodatime). Both map and mapPartition

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen
I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency ->

[no subject]

2015-06-23 Thread ๏̯͡๏
I have a Spark job that has 7 stages. The first 3 stage complete and the fourth stage beings (joins two RDDs). This stage has multiple task failures all the below exception. Multiple tasks (100s) of them get the same exception with different hosts. How can all the host suddenly stop responding wh

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
A rough estimate of the worst case memory requirement for driver is about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at the beginning because the previous centers are still in memory while receiving new center updates. -Xiangrui On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey w

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-23 Thread Xiangrui Meng
This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter transactions and keep only frequent items 3. do NOT order by frequency 4. use suffix to partition the transactions (whethe

Re: Issue running Spark 1.4 on Yarn

2015-06-23 Thread Matt Kapilevich
Hi Kevin I never did. I checked for free space in the root partition, don't think this was an issue. Now that 1.4 is officially out I'll probably give it another shot. On Jun 22, 2015 4:28 PM, "Kevin Markey" wrote: > Matt: Did you ever resolve this issue? When running on a cluster or > pseudoc

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-23 Thread Xiangrui Meng
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI, how

Re: How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-23 Thread Xiangrui Meng
Please check the input path to your test data, and call `.count()` and see whether there are records in it. -Xiangrui On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue wrote: > Hey, > > I am testing the StreamingLinearRegressionWithSGD following the tutorial. > > > It works, but I could not output the p

Re: which mllib algorithm for large multi-class classification?

2015-06-23 Thread Xiangrui Meng
We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an "array larger than MaxInt" e

Re: mutable vs. pure functional implementation - StatCounter

2015-06-23 Thread Xiangrui Meng
Creating millions of temporary (immutable) objects is bad for performance. It should be simple to do a micro-benchmark locally. -Xiangrui On Mon, Jun 22, 2015 at 7:25 PM, mzeltser wrote: > Using StatCounter as an example, I'd like to understand if "pure" functional > implementation would be more

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Tathagata Das
Please run it in your own application and not in the spark shell. I see that you are trying to stop the Spark context and create a new StreamingContext. That will lead to unexpected issue, that you are seeing. Please make a standalone SBT/Maven app for Spark Streaming. On Tue, Jun 23, 2015 at 3:43

flume sinks supported by spark streaming

2015-06-23 Thread Hafiz Mujadid
Hi! I want to integrate flume with spark streaming. I want to know which sink type of flume are supported by spark streaming? I saw one example using avroSink. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flume-sinks-supported-by-spark-streami

Re: Spark standalone cluster - resource management

2015-06-23 Thread canan chen
Check the available resources you have (cpu cores & memory ) on master web ui. The log you see means the job can't get any resources. On Wed, Jun 24, 2015 at 5:03 AM, Nizan Grauer wrote: > I'm having 30G per machine > > This is the first (and only) job I'm trying to submit. So it's weird that

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/streaming-flume-integration.html Yep, avro sink is the correct one. -- Ruslan Dautkhanov On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid wrote: > Hi! > > > I want to integrate flume with spark streaming. I want to know which sink > type of flume are suppo

Re: map V mapPartitions

2015-06-23 Thread canan chen
One example is that you'd like to set up jdbc connection for each partition and share this connection across the records. mapPartitions is much more like the paradigm of mapper in mapreduce. In the mapper of mapreduce, you have setup method to do any initialization stuff before processing the spl

Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question. Spark can be deployed on different cluster manager frameworks like standard alone, yarn & mesos. Spark can't run without these cluster manager framework, that means spark depend on cluster manager framework. And the data management layer is the upstream

Re: Spark launching without all of the requested YARN resources

2015-06-23 Thread canan chen
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra wrote: > Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark

Re: Yarn application ID for Spark job on Yarn

2015-06-23 Thread canan chen
I don't think there is yarn related stuff to access in spark. Spark don't depend on yarn. BTW, why do you want the yarn application id ? On Mon, Jun 22, 2015 at 11:45 PM, roy wrote: > Hi, > > Is there a way to get Yarn application ID inside spark application, when > running spark Job on YARN

when cached RDD will unpersist its data

2015-06-23 Thread bit1...@163.com
I am kind of consused about when cached RDD will unpersist its data. I know we can explicitly unpersist it with RDD.unpersist ,but can it be unpersist automatically by the spark framework? Thanks. bit1...@163.com

Re: when cached RDD will unpersist its data

2015-06-23 Thread eric wong
In a case that memory cannot hold all the cached RDD, then BlockManager will evict some older block for storage of new RDD block. Hope that will helpful. 2015-06-24 13:22 GMT+08:00 bit1...@163.com : > I am kind of consused about when cached RDD will unpersist its data. I > know we can explicitl

  1   2   >