Could Spark batch processing live within Spark Streaming?

2015-06-11 Thread diplomatic Guru
Hello all, I was wondering if it is possible to have a high latency batch processing job coexists within Spark Streaming job? If it's possible then could we share the state of the batch job with the Spark Streaming job? For example when the long-running batch computation is complete, could we inf

Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread diplomatic Guru
Hello All, I have a Spark job that throws "java.lang.OutOfMemoryError: GC overhead limit exceeded". The job is trying to process a filesize 4.5G. I've tried following spark configuration: --num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G I tried increasing more cor

load Java properties file in Spark

2015-06-29 Thread diplomatic Guru
I want to store the Spark application arguments such as input file, output file into a Java property files and pass that file into Spark Driver. I'm using spark-submit for submitting the job but couldn't find a parameter to pass the properties file. Have you got any suggestions?

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is expe

query on Spark + Flume integration using push model

2015-07-09 Thread diplomatic Guru
Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick up the data. My events are in JSON format, but the "Spark + Flume integration" [1] document only refer to Avro sink. [1] https://spark.apache.org/docs/latest/streaming-flume-integration.html I

Re: query on Spark + Flume integration using push model

2015-07-10 Thread diplomatic Guru
wrote: > Here's an example https://github.com/przemek1990/spark-streaming > > Thanks > Best Regards > > On Thu, Jul 9, 2015 at 4:35 PM, diplomatic Guru > wrote: > >> Hello all, >> >> I'm trying to configure the flume to push data into a sink so that

[Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-08 Thread diplomatic Guru
Hello all, I'm using Random Forest for my machine learning (batch), I would like to use online prediction using Streaming job. However, the document only states linear algorithm for regression job. Could we not use other algorithms?

Re: [Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-09 Thread diplomatic Guru
Could someone verify this for me? On 8 March 2016 at 14:06, diplomatic Guru wrote: > Hello all, > > I'm using Random Forest for my machine learning (batch), I would like to > use online prediction using Streaming job. However, the document only > states linear algorith

H2O + Spark Streaming?

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

Could we use Sparkling Water Lib with Spark Streaming

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

StreamingLinearRegression Java example

2016-05-09 Thread diplomatic Guru
Hello, I'm trying to find an example of using StreamingLinearRegression in Java, but couldn't find any. There are examples for Scala but not for Java, Has anyone got any example that I can take a look. Thanks.

[Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
Hello all, I have a job for forecasting using linear regression, but sometimes I'm getting a negative prediction. How do I prevent this? Thanks.

Fwd: [Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
wanted to find out. Thanks. On 21 June 2016 at 13:55, Sean Owen wrote: > There's nothing inherently wrong with a regression predicting a > negative value. What is the issue, more specifically? > > On Tue, Jun 21, 2016 at 1:38 PM, diplomatic Guru > wrote: > > Hello all

[Spark + MLlib] how to update offline model with the online model

2016-06-22 Thread diplomatic Guru
Hello all, I have built a spark batch model using MLlib and a Streaming online model. Now I would like to load the offline model in streaming job and apply and update the model. Could to please advise me how to do it. is there an example to look at. The streaming model does not allow saving or loa

[Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread diplomatic Guru
Hello guys, I've been trying to read avro file using Spark's DataFrame but it's throwing this error: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.read()Lorg/apache/spark/sql/DataFrameReader; This is what I've done so far: I've added the dependency to pom.xml: co

[MLlib] What is the best way to forecast the next month page visit?

2016-01-29 Thread diplomatic Guru
Hello guys, I'm trying understand how I could predict the next month page views based on the previous access pattern. For example, I've collected statistics on page views: e.g. Page,UniqueView - pageA, 1 pageB, 999 ... pageZ,200 I aggregate the statistics monthly. I

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-01 Thread diplomatic Guru
Any suggestions please? On 29 January 2016 at 22:31, diplomatic Guru wrote: > Hello guys, > > I'm trying understand how I could predict the next month page views based > on the previous access pattern. > > For example, I've collected statistics on page views

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-02 Thread diplomatic Guru
se let me know what I'm doing wrong? PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer, OneHotEncoder but for testing this I've installed the 1.6.0 on my local machine. Cheer. On 2 February 2016 at 10:25, Jorge Machado wrote: > Hi Guru, >

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-18 Thread diplomatic Guru
he end for Januar and PageA something like : > > LabeledPoint (label , (0,0,1,0,0,01,1.0,2.0,3.0)) > > Pass the LabeledPoint to the ML model. > > test it. > > PS: label is what you want to predict. > > On 02/02/2016, at 20:44, diplomatic Guru wrote: >

[MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hello guys, I think the default Loss algorithm is Squared Error for regression, but how do I change that to Absolute Error in Java. Could you please show me an example?

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
/org/apache/spark/mllib/tree/GradientBoostedTrees.html#GradientBoostedTrees(org.apache.spark.mllib.tree.configuration.BoostingStrategy)>* > (BoostingStrategy > <http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/tree/configuration/BoostingStrategy.html> > boostingStrategy) > &

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
lements the Loss interface. > For example. > > val loss = new AbsoluteError() > boostingStrategy.setLoss(loss) > > On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru > wrote: > >> Hi Kevin, >> >> Yes, I've set the bootingStrategy like that using the ex

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
16 at 15:49, Kevin Mellott wrote: > Looks like it should be present in 1.3 at > org.apache.spark.mllib.tree.loss.AbsoluteError > > > spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html > > On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
ing the Loss, you should be able to do something like: > > Losses.fromString("leastSquaresError") > > On Mon, Feb 29, 2016 at 10:03 AM, diplomatic Guru < > diplomaticg...@gmail.com> wrote: > >> It's strange as you are correct the doc does state it. Bu

How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
Hello All, I have a Spark Streaming job that should do some action only if the RDD is not empty. This can be done easily with the spark batch RDD as I could .take(1) and check whether it is empty or not. But this cannot been done in Spark Streaming DStrem JavaPairInputDStream input = ssc.fileS

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
I tried below code but still carrying out the action even though there is no new data. JavaPairInputDStream input = ssc.fileStream(iFolder, LongWritable.class,Text.class, TextInputFormat.class); if(input != null){ //do some action if it is not empty } On 21 October 2015 at 18:00, diplomatic

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
Das wrote: > What do you mean by checking when a "DStream is empty"? DStream represents > an endless stream of data, and at point of time checking whether it is > empty or not does not make sense. > > FYI, there is RDD.isEmpty() > > > > On Wed, Oct 21, 2015 at 10:03 A

[Spark Streaming] Why are some uncached RDDs are growing?

2015-10-27 Thread diplomatic Guru
Hello All, When I checked my running Stream job on WebUI, I can see that some RDDs are being listed that were not requested to be cached. What more is that they are growing! I've not asked them to be cached. What are they? Are they the state (UpdateStateByKey)? Only the rows in white are being re

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-27 Thread diplomatic Guru
I know it uses lazy model, which is why I was wondering. On 27 October 2015 at 19:02, Uthayan Suthakar wrote: > Hello all, > > What I wanted to do is configure the spark streaming job to read the > database using JdbcRDD and cache the results. This should occur only once > at the start of the jo

How to enable debug in Spark Streaming?

2015-11-03 Thread diplomatic Guru
I have an issue with a Spark Streaming job that appears to be running but not producing any results. Therefore, I would like to enable the debugging mode to get much logging as possible.

[SPARK STREAMING] multiple hosts and multiple ports for Stream job

2015-11-19 Thread diplomatic Guru
Hello team, I was wondering whether it is a good idea to have multiple hosts and multiple ports for a spark job. Let's say that there are two hosts, and each has 2 ports, is this a good idea? If this is not an issue then what is the best way to do it. Currently, we pass it as an argument comma sep

[Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread diplomatic Guru
Hello, I know how I could clear the old state depending on the input value. If some condition matches to determine that the state is old then set the return null, will invalidate the record. But this is only feasible if a new record arrives that matches the old key. What if no new data arrives for

[SPARK] Obtaining matrices of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job, n

Obtaining metrics of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job, n

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too man

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
a performance > problem. > > Robin > > On 22 Jul 2015, at 19:11, diplomatic Guru > wrote: > > Hello all, > > We are having a major performance issue with the Spark, which is holding > us from going live. > > We have a job that carries out computation on lo

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
o > bulk inserts to oracle. > > On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru > wrote: > >> Thanks Robin for your reply. >> >> I'm pretty sure that writing to Oracle is taking longer as when writing >> to HDFS is only taking ~5minutes. >> >&

Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
I'm testing the Flume + Spark integration example (flume count). I'm deploying the job using yarn cluster mode. I first logged into the Yarn cluster, then submitted the job and passed in a specific worker node's IP to deploy the job. But when I checked the WebUI, it failed to bind to the specifie

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
he older stream? > > Such problems of binding used to occur in the older push-based approach, > hence we built the polling stream (pull-based). > > > On Tue, Aug 18, 2015 at 4:45 AM, diplomatic Guru > wrote: > >> I'm testing the Flume + Spark integration example

How to calculate average from multiple values

2015-09-16 Thread diplomatic Guru
have a mapper that emit key/value pairs(composite keys and composite values separated by comma). e.g *key:* a,b,c,d *Value:* 1,2,3,4,5 *key:* a1,b1,c1,d1 *Value:* 5,4,3,2,1 ... ... *key:* a,b,c,d *Value:* 5,4,3,2,1 I could easily SUM these values using reduceByKey. e.g. reduceByKey(new F

Re: How to calculate average from multiple values

2015-09-17 Thread diplomatic Guru
-- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 16 Sep 2015, at 15:46, diplomatic Guru > wro

[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team, Spark version: 1.6.0 I'm trying to persist done data into memory for reusing them. However, when I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not store the data as I can not see any rdd information under WebUI (Storage Tab). Therefore I tried rdd.persist(St

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
; Chin Wei > > On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru > wrote: > >> Hello team, >> >> Spark version: 1.6.0 >> >> I'm trying to persist done data into memory for reusing them. However, >> when I call rdd.cache() OR rdd.persist(StorageLevel.

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
= 1224.6 MB. Storage limit = 1397.3 MB. Therefore, I repartitioned the RDDs for better memory utilisation, wich resolved the issue. Kind regards, Guru On 11 October 2016 at 11:23, diplomatic Guru wrote: > @Song, I have called an action but it did not cache as you can see in the > provide

How do I access the SPARK SQL

2014-04-23 Thread diplomatic Guru
Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
sembly, and then > try it out. We’re also going to post some release candidates soon that will > be pre-built. > > Matei > > On Apr 23, 2014, at 1:30 PM, diplomatic Guru > wrote: > > > Hello Team, > > > > I'm new to SPARK and just came across SPARK SQL,

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
w Or wrote: > >> Did you build it with SPARK_HIVE=true? >> >> >> On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru < >> diplomaticg...@gmail.com> wrote: >> >>> Hi Matei, >>> >>> I checked out the git repositor

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
> wrote: > >> Yeah, you'll need to run `sbt publish-local` to push the jars to your >> local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. >> >> >> On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru < >> diplomaticg..

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
It worked!! Many thanks for your brilliant support. On 24 April 2014 18:20, diplomatic Guru wrote: > Many thanks for your prompt reply. I'll try your suggestions and will get > back to you. > > > > > On 24 April 2014 18:17, Michael Armbrust wrote: > >>

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread diplomatic Guru
Thanks Dean, very useful indeed! Best regards, Raj On 1 May 2014 14:46, Dean Wampler wrote: > That's great! Thanks. Let me know if it works ;) or what I could improve > to make it work. > > dean > > > On Thu, May 1, 2014 at 8:45 AM, ZhangYi wrote: > >> Very Useful material. Currently, I am