from:"Abhishek Anand"

Insert a JavaPairDStream into multiple cassandra table on the basis of key.

2016-11-02 Thread Abhishek Anand

Hi All, I have a JavaPairDStream. I want to insert this dstream into multiple cassandra tables on the basis of key. One approach is to filter each key and then insert it into cassandra table. But this would call filter operation "n" times depending on the number of keys. Is there any better appro

Re: Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand

n use distinct over you data frame or rdd >> >> rdd.distinct >> >> It will give you distinct across your row. >> >> On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand >> wrote: >> >>> I have an rdd which contains 14 different columns. I need to

Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand

I have an rdd which contains 14 different columns. I need to find the distinct across all the columns of rdd and write it to hdfs. How can I acheive this ? Is there any distributed data structure that I can use and keep on updating it as I traverse the new rows ? Regards, Abhi

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand

catColumns.length; i++) { > concatColumns[i]=df.col(array[i]); > } > > return functions.concat(concatColumns).alias(fieldName); > } > > > > On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand > wrote: > >> Hi Nihed, >> >>

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand

< columns.length; i++) { > selectColumns[i]=df.col(columns[i]); > } > > > selectColumns[columns.length]=functions.concat(df.col("firstname"), > df.col("lastname")); > > df.select(selectColumns).show(); > --

Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand

Hi, I have a dataframe say having C0,C1,C2 and so on as columns. I need to create interaction variables to be taken as input for my program. For eg - I need to create I1 as concatenation of C0,C3,C5 Similarly, I2 = concat(C4,C5) and so on .. How can I achieve this in my Java code for conca

Change spark dataframe to LabeledPoint in Java

2016-06-30 Thread Abhishek Anand

Hi , I have a dataframe which i want to convert to labeled point. DataFrame labeleddf = model.transform(newdf).select("label","features"); How can I convert this to a LabeledPoint to use in my Logistic Regression model. I could do this in scala using val trainData = labeleddf.map(row => Labeled

Re: spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand

I also tried jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2") But, still its not working. Any ideas why its not working ? Abhi On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand wrote: > My spark streaming checkpoint directory is being written

spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand

My spark streaming checkpoint directory is being written to HDFS with default replication factor of 3. In my streaming application where I am listening from kafka and setting the dfs.replication = 2 as below the files are still being written with replication factor=3 SparkConf sparkConfig = new S

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand

type string) will be one-hot > encoded automatically. > So pre-processing like `as.factor` is not necessary, you can directly feed > your data to the model training. > > Thanks > Yanbo > > 2016-05-30 2:06 GMT-07:00 Abhishek Anand : > >> Hi , >> >> I want to ru

Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand

Hi , I want to run glm variant of sparkR for my data that is there in a csv file. I see that the glm function in sparkR takes a spark dataframe as input. Now, when I read a file from csv and create a spark dataframe, how could I take care of the factor variables/columns in my data ? Do I need t

Calculating log-loss for the trained model in Spark ML

2016-05-03 Thread Abhishek Anand

I am building a ML pipeline for logistic regression. val lr = new LogisticRegression() lr.setMaxIter(100).setRegParam(0.001) val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder, devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder, hourEnc

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand

You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at 12:5

Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand

Hi All, I am trying to build a logistic regression pipeline in ML. How can I clear the threshold which by default is 0.5. In mllib I am able to clear the threshold to get the raw predictions using model.clearThreshold() function. Regards, Abhi

Fwd: Facing Unusual Behavior with the executors in spark streaming

2016-04-05 Thread Abhishek Anand

Hi , Needed inputs for a couple of issue that I am facing in my production environment. I am using spark version 1.4.0 spark streaming. 1) It so happens that the worker is lost on a machine and the executor still shows up in the executor's tab in the UI. Even when I kill a worker using kill -9

Timeout in mapWithState

2016-04-04 Thread Abhishek Anand

What exactly is timeout in mapWithState ? I want the keys to get remmoved from the memory if there is no data received on that key for 10 minutes. How can I acheive this in mapWithState ? Regards, Abhi

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-04-01 Thread Abhishek Anand

(SingleThreadEventExecutor.java:116) ... 1 more Cheers !! Abhi On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand wrote: > This is what I am getting in the executor logs > > 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while > reverting partial writes to file &

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand

Yu wrote: > Can you show the stack trace ? > > The log message came from > DiskBlockObjectWriter#revertPartialWritesAndClose(). > Unfortunately, the method doesn't throw exception, making it a bit hard > for caller to know of the disk full condition. > > On Thu, Mar 3

Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand

Hi, Why is it so that when my disk space is full on one of the workers then the executor on that worker becomes unresponsive and the jobs on that worker fails with the exception 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /data/spark-e

Output the data to external database at particular time in spark streaming

2016-03-08 Thread Abhishek Anand

I have a spark streaming job where I am aggregating the data by doing reduceByKeyAndWindow with inverse function. I am keeping the data in memory for upto 2 hours and In order to output the reduced data to an external storage I conditionally need to puke the data to DB say at every 15th minute of

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Abhishek Anand

wrote: > Sorry that I forgot to tell you that you should also call `rdd.count()` > for "reduceByKey" as well. Could you try it and see if it works? > > On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand > wrote: > >> Hi Ryan, >> >> I am using mapWithState

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-29 Thread Abhishek Anand

our new machines? > > On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand > wrote: > >> Any insights on this ? >> >> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand >> wrote: >> >>> On changing the default compression codec which is snappy to lzf the &g

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-27 Thread Abhishek Anand

stateDStream.stateSnapshots(); > > > On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand > wrote: > >> Hi Ryan, >> >> Reposting the code. >> >> Basically my use case is something like - I am receiving the web >> impression logs and may get the noti

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-26 Thread Abhishek Anand

Any insights on this ? On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand wrote: > On changing the default compression codec which is snappy to lzf the > errors are gone !! > > How can I fix this using snappy as the codec ? > > Is there any downside of using lzf as snappy is the

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-25 Thread Abhishek Anand

On changing the default compression codec which is snappy to lzf the errors are gone !! How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark. Thanks !!! Abhi On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand

Query Kafka Partitions from Spark SQL

2016-02-23 Thread Abhishek Anand

Is there a way to query the json (or any other format) data stored in kafka using spark sql by providing the offset range on each of the brokers ? I just want to be able to query all the partitions in a sq manner. Thanks ! Abhi

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand

at 1:25 AM, Shixiong(Ryan) Zhu wrote: > Hey Abhi, > > Could you post how you use mapWithState? By default, it should do > checkpointing every 10 batches. > However, there is a known issue that prevents mapWithState from > checkpointing in some special cases: > https://issues.ap

java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-22 Thread Abhishek Anand

Hi , I am getting the following exception on running my spark streaming job. The same job has been running fine since long and when I added two new machines to my cluster I see the job failing with the following exception. 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand

Any Insights on this one ? Thanks !!! Abhi On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand wrote: > I am now trying to use mapWithState in the following way using some > example codes. But, by looking at the DAG it does not seem to checkpoint > the state and when restarting the ap

Spark Streaming with Kafka Use Case

2016-02-17 Thread Abhishek Anand

I have a spark streaming application running in production. I am trying to find a solution for a particular use case when my application has a downtime of say 5 hours and is restarted. Now, when I start my streaming application after 5 hours there would be considerable amount of data then in the Ka

Re: Worker's BlockManager Folder not getting cleared

2016-02-17 Thread Abhishek Anand

Looking for answer to this. Is it safe to delete the older files using find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \; For a window duration of 2 hours how older files can we delete ? Thanks. On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand wrote: > Hi All, >

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Abhishek Anand

forward to just use the normal cassandra client > to save them from the driver. > > On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand > wrote: > >> I have a kafka rdd and I need to save the offsets to cassandra table at >> the begining of each batch. >> >> Basi

Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-15 Thread Abhishek Anand

I have a kafka rdd and I need to save the offsets to cassandra table at the begining of each batch. Basically I need to write the offsets of the type Offsets below that I am getting inside foreachRD, to cassandra. The javafunctions api to write to cassandra needs a rdd. How can I create a rdd from

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-15 Thread Abhishek Anand

ince release of 1.6.0 >> e.g. >> SPARK-12591 NullPointerException using checkpointed mapWithState with >> KryoSerializer >> >> which is in the upcoming 1.6.1 >> >> Cheers >> >> On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand > > wrote: >>

Re: Worker's BlockManager Folder not getting cleared

2016-02-13 Thread Abhishek Anand

Hi All, Any ideas on this one ? The size of this directory keeps on growing. I can see there are many files from a day earlier too. Cheers !! Abhi On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand wrote: > Hi Adrian, > > I am running spark in standalone mode. > > The spark ve

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Abhishek Anand

there. Is there any other work around ? Cheers!! Abhi On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu wrote: > Looks like mapWithState could help you? > On 11 Feb 2016 8:40 p.m., "Abhishek Anand" > wrote: > >> Hi All, >> >> I have an use case like follows

Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand

Hi All, I have an use case like follows in my production environment where I am listening from kafka with slideInterval of 1 min and windowLength of 2 hours. I have a JavaPairDStream where for each key I am getting the same key but with different value,which might appear in the same batch or some

Re: Repartition taking place for all previous windows even after checkpointing

2016-02-01 Thread Abhishek Anand

Any insights on this ? On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand wrote: > Hi All, > > Can someone help me with the following doubts regarding checkpointing : > > My code flow is something like follows -> > > 1) create direct stream from kafka > 2) repartition k

Repartition taking place for all previous windows even after checkpointing

2016-01-28 Thread Abhishek Anand

Hi All, Can someone help me with the following doubts regarding checkpointing : My code flow is something like follows -> 1) create direct stream from kafka 2) repartition kafka stream 3) mapToPair followed by reduceByKey 4) filter 5) reduceByKeyAndWindow without the inverse function 6) writ

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Abhishek Anand

sues.apache.org/jira/browse/SPARK-10975 > With spark >= 1.6: > https://issues.apache.org/jira/browse/SPARK-12430 > and also be aware of: > https://issues.apache.org/jira/browse/SPARK-12583 > > > On 25/01/2016 07:14, Abhishek Anand wrote: > > Hi All, > > How long the s

Worker's BlockManager Folder not getting cleared

2016-01-24 Thread Abhishek Anand

Hi All, How long the shuffle files and data files are stored on the block manager folder of the workers. I have a spark streaming job with window duration of 2 hours and slide interval of 15 minutes. When I execute the following command in my block manager path find . -type f -cmin +150 -name "

Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Abhishek Anand

Hi, Is there a way so that I can fetch the offsets from where the spark streaming starts reading from Kafka when my application starts ? What I am trying is to create an initial RDD with offsest at a particular time passed as input from the command line and the offsets from where my spark streami

Error on using updateStateByKey

2015-12-18 Thread Abhishek Anand

I am trying to use updateStateByKey but receiving the following error. (Spark Version 1.4.0) Can someone please point out what might be the possible reason for this error. *The method updateStateByKey(Function2,Optional,Optional>) in the type JavaPairDStream is not applicable for the arguments *

Re: Unable to use "Batch Start Time" on worker nodes.

2015-11-30 Thread Abhishek Anand

orm that allows you specify a function with two params - the > parent RDD and the batch time at which the RDD was generated. > > TD > > On Thu, Nov 26, 2015 at 1:33 PM, Abhishek Anand > wrote: > >> Hi , >> >> I need to use batch start time in my spark streaming job

Unable to use "Batch Start Time" on worker nodes.

2015-11-26 Thread Abhishek Anand

Hi , I need to use batch start time in my spark streaming job. I need the value of batch start time inside one of the functions that is called within a flatmap function in java. Please suggest me how this can be done. I tried to use the StreamingListener class and set the value of a variable in

Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Abhishek Anand

Hi , I need to get the batch time of the active batches which appears on the UI of spark streaming tab, How can this be achieved in Java ? BR, Abhi

External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Abhishek Anand

Hi , I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:" ); I had an impression that in case o

47 matches

Mail list logo