Re: in joins, does one side stream?

2015-09-19 Thread Rishitesh Mishra
Got it..thnx Reynold.. On 20 Sep 2015 07:08, "Reynold Xin" wrote: > The RDDs themselves are not materialized, but the implementations can > materialize. > > E.g. in cogroup (which is used by RDD.join), it materializes all the data > during grouping. > > In SQL/DataFrame join, depending on the joi

Re: Using Spark for portfolio manager app

2015-09-19 Thread Jörn Franke
I think generally the way forward would be to put aggregate statistics to an external storage (eg hbase) - it should not have that much influence on latency. You will probably need it anyway if you need to store historical information. Wrt to deltas - always a tricky topic. You may want to work wit

Re: Using Spark for portfolio manager app

2015-09-19 Thread Huy Banh
Hi Thuy, You can check Rdd.lookup(). It requires the rdd is partitioned, and of course, cached in memory. Or you may consider a distributed cache like ehcache, aws elastic cache. I think an external storage is an option, too. Especially nosql databases, they can handle updates at high speed, at c

Re: question building spark in a virtual machine

2015-09-19 Thread Eyal Altshuler
I allocated almost 6GB of RAM to the ubuntu virtual machine and got the same problem. I will go over this post and try to zoom in into the java vm settings. meanwhile - can someone with a working ubuntu machine can specify her JVM settings? Thanks, Eyal On Sat, Sep 19, 2015 at 7:49 PM, Ted Yu w

DataGenerator for streaming application

2015-09-19 Thread Saiph Kappa
Hi, I am trying to build a data generator that feeds a streaming application. This data generator just reads a file and send its lines through a socket. I get no errors on the logs, and the benchmark bellow always prints "Received 0 records". Am I doing something wrong? object MyDataGenerator {

Re: Using Spark for portfolio manager app

2015-09-19 Thread Thúy Hằng Lê
Thanks Adrian and Jorn for the answers. Yes, you're right there are lot of things I need to consider if I want to use Spark for my app. I still have few concerns/questions from your information: 1/ I need to combine trading stream with tick stream, I am planning to use Kafka for that If I am usi

Re: PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-19 Thread Zhan Zhang
Hi Richard, I am not sure how to support user-defined type. But regarding your second question, you can have a walkaround as following. Suppose you have a struct a, and want to filter a.c with a.c > X. You can define a alias C as a.c, and add extra column C to the schema of the relation, and

Re: in joins, does one side stream?

2015-09-19 Thread Reynold Xin
The RDDs themselves are not materialized, but the implementations can materialize. E.g. in cogroup (which is used by RDD.join), it materializes all the data during grouping. In SQL/DataFrame join, depending on the join: 1. For broadcast join, only the smaller side is materialized in memory as a

PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-19 Thread Richard Eggert
I defined my own relation (extending BaseRelation) and implemented the PrunedFilteredScan interface, but discovered that if the column referenced in a WHERE = clause is a user-defined type or a field of a struct column, then Spark SQL passes NO filters to the PrunedFilteredScan.buildScan method, re

Re: Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, figured it out for got mention local as loca[2] , at least two node required. package com.examples /** * Created by kalit_000 on 19/09/2015. */ import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.sql.SQLContext import org.apache.spark.SparkConf imp

Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, I am unable to see the output getting printed in the console can anyone help. package com.examples /** * Created by kalit_000 on 19/09/2015. */ import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.sql.SQLContext import org.apache.spark.SparkConf impo

Re: Kafka createDirectStream ​issue

2015-09-19 Thread kali.tumm...@gmail.com
Hi , I am trying to develop in intellij Idea same code I am having the same issue is there any work around. Error in intellij:- cannot resolve symbol createDirectStream import kafka.serializer.StringDecoder import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.

Re: in joins, does one side stream?

2015-09-19 Thread Rishitesh Mishra
Hi Reynold, Can you please elaborate on this. I thought RDD also opens only an iterator. Does it get materialized for joins? Rishi On Saturday, September 19, 2015, Reynold Xin wrote: > Yes for RDD -- both are materialized. No for DataFrame/SQL - one side > streams. > > > On Thu, Sep 17, 2015 at

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-19 Thread Timothy Chen
You can still provide properties through the docker container by putting configuration in the conf directory, but we try to pass all properties submitted from the driver spark-submit through which I believe will override the defaults. This is not what you are seeing? Tim > On Sep 19, 2015, a

Re: question building spark in a virtual machine

2015-09-19 Thread Ted Yu
Please read this article: http://blogs.vmware.com/apps/2011/06/taking-a-closer-look-at-sizing-the-java-process.html Can you increase the memory given to the ubuntu virtual machine ? Cheers On Sat, Sep 19, 2015 at 9:30 AM, Eyal Altshuler wrote: > Hi, > > I allocate 4GB for the ubuntu virtual ma

Re: question building spark in a virtual machine

2015-09-19 Thread Eyal Altshuler
Hi, I allocate 4GB for the ubuntu virtual machine, how to check what is the maximal available for a jvm process? Regarding the thread - I see it's related to building on windows. Thanks, Eyal On Sat, Sep 19, 2015 at 6:54 PM, Ted Yu wrote: > See also this thread: > > https://bukkit.org/threads/

Re: word count (group by users) in spark

2015-09-19 Thread Aniket Bhatnagar
Using scala API, you can first group by user and then use combineByKey. Thanks, Aniket On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com wrote: > Hi All, > I would like to achieve this below output using spark , I managed to write > in Hive and call it in spark but not in just spark (scala),

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-19 Thread Alan Braithwaite
The assumption that the executor has no default properties set in it's environment through the docker container. Correct me if I'm wrong, but any properties which are unset in the SparkContext will come from the environment of the executor will it not? Thanks, - Alan On Sat, Sep 19, 2015 at 1:09

Re: question building spark in a virtual machine

2015-09-19 Thread Ted Yu
See also this thread: https://bukkit.org/threads/complex-craftbukkit-server-and-java-problem-could-not-reserve-enough-space-for-object-heap.155192/ Cheers On Sat, Sep 19, 2015 at 8:51 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Hi Eval > > Can you check if your Ubuntu VM has enou

Re: question building spark in a virtual machine

2015-09-19 Thread Aniket Bhatnagar
Hi Eval Can you check if your Ubuntu VM has enough RAM allocated to run JVM of size 3gb? thanks, Aniket On Sat, Sep 19, 2015, 9:09 PM Eyal Altshuler wrote: > Hi, > > I had configured the MAVEN_OPTS environment variable the same as you wrote. > My java version is 1.7.0_75. > I didn't customized

Re: question building spark in a virtual machine

2015-09-19 Thread Eyal Altshuler
Hi, I had configured the MAVEN_OPTS environment variable the same as you wrote. My java version is 1.7.0_75. I didn't customized the JVM heap size specifically. Is there an additional configuration I have to run besides the MAVEN_OPTS configutaion? Thanks, Eyal On Sat, Sep 19, 2015 at 5:29 PM, T

Re: question building spark in a virtual machine

2015-09-19 Thread Ted Yu
Can you tell us how you configured the JVM heap size ? Which version of Java are you using ? When I build Spark, I do the following: export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" Cheers On Sat, Sep 19, 2015 at 5:31 AM, Eyal Altshuler wrote: > Hi, > Trying to b

word count (group by users) in spark

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, I would like to achieve this below output using spark , I managed to write in Hive and call it in spark but not in just spark (scala), how to group word counts on particular user (column) for example. Imagine users and their given tweets I want to do word count based on user name. Input:-

Docker/Mesos with Spark

2015-09-19 Thread John Omernik
I was searching in the 1.5.0 docs on the Docker on Mesos capabilities and just found you CAN run it this way. Are there any user posts, blog posts, etc on why and how you'd do this? Basically, at first I was questioning why you'd run spark in a docker container, i.e., if you run with tar balled e

question building spark in a virtual machine

2015-09-19 Thread Eyal Altshuler
Hi, Trying to build spark in my ubuntu virtual machine, I am getting the following error: "Error occurred during initialization of VM Could not reserve enough space for object heap Error: could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit". I have

Re: Using Spark for portfolio manager app

2015-09-19 Thread Jörn Franke
If you want to be able to let your users query their portfolio then you may want to think about storing the current state of the portfolios in hbase/phoenix or alternatively a cluster of relationaldatabases can make sense. For the rest you may use Spark. Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-19 Thread Tim Chen
I guess I need a bit more clarification, what kind of assumptions was the dispatcher making? Tim On Thu, Sep 17, 2015 at 10:18 PM, Alan Braithwaite wrote: > Hi Tim, > > Thanks for the follow up. It's not so much that I expect the executor to > inherit the configuration of the dispatcher as I*

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-19 Thread Ewan Leith
yarn-client still runs the executor tasks on the cluster, the main difference is where the driver job runs. Thanks, Ewan -- Original message-- From: shahab Date: Fri, 18 Sep 2015 13:11 To: Aniket Bhatnagar; Cc: user@spark.apache.org; Subject:Re: Zeppelin on Yarn : org.apache.spar