Hi All,
I have a JavaPairDStream. I want to insert this dstream into multiple
cassandra tables on the basis of key. One approach is to filter each key
and then insert it into cassandra table. But this would call filter
operation "n" times depending on the number of keys.
Is there any better appro
n use distinct over you data frame or rdd
>>
>> rdd.distinct
>>
>> It will give you distinct across your row.
>>
>> On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand
>> wrote:
>>
>>> I have an rdd which contains 14 different columns. I need to
I have an rdd which contains 14 different columns. I need to find the
distinct across all the columns of rdd and write it to hdfs.
How can I acheive this ?
Is there any distributed data structure that I can use and keep on updating
it as I traverse the new rows ?
Regards,
Abhi
catColumns.length; i++) {
> concatColumns[i]=df.col(array[i]);
> }
>
> return functions.concat(concatColumns).alias(fieldName);
> }
>
>
>
> On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand
> wrote:
>
>> Hi Nihed,
>>
>>
< columns.length; i++) {
> selectColumns[i]=df.col(columns[i]);
> }
>
>
> selectColumns[columns.length]=functions.concat(df.col("firstname"),
> df.col("lastname"));
>
> df.select(selectColumns).show();
> --
Hi,
I have a dataframe say having C0,C1,C2 and so on as columns.
I need to create interaction variables to be taken as input for my program.
For eg -
I need to create I1 as concatenation of C0,C3,C5
Similarly, I2 = concat(C4,C5)
and so on ..
How can I achieve this in my Java code for conca
Hi ,
I have a dataframe which i want to convert to labeled point.
DataFrame labeleddf = model.transform(newdf).select("label","features");
How can I convert this to a LabeledPoint to use in my Logistic Regression
model.
I could do this in scala using
val trainData = labeleddf.map(row =>
Labeled
I also tried
jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2")
But, still its not working.
Any ideas why its not working ?
Abhi
On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand
wrote:
> My spark streaming checkpoint directory is being written
My spark streaming checkpoint directory is being written to HDFS with
default replication factor of 3.
In my streaming application where I am listening from kafka and setting the
dfs.replication = 2 as below the files are still being written with
replication factor=3
SparkConf sparkConfig = new
S
type string) will be one-hot
> encoded automatically.
> So pre-processing like `as.factor` is not necessary, you can directly feed
> your data to the model training.
>
> Thanks
> Yanbo
>
> 2016-05-30 2:06 GMT-07:00 Abhishek Anand :
>
>> Hi ,
>>
>> I want to ru
Hi ,
I want to run glm variant of sparkR for my data that is there in a csv file.
I see that the glm function in sparkR takes a spark dataframe as input.
Now, when I read a file from csv and create a spark dataframe, how could I
take care of the factor variables/columns in my data ?
Do I need t
I am building a ML pipeline for logistic regression.
val lr = new LogisticRegression()
lr.setMaxIter(100).setRegParam(0.001)
val pipeline = new
Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder,
devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,
hourEnc
You can use this function to remove the header from your dataset(applicable
to RDD)
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
Abhi
On Wed, Apr 27, 2016 at 12:5
Hi All,
I am trying to build a logistic regression pipeline in ML.
How can I clear the threshold which by default is 0.5. In mllib I am able
to clear the threshold to get the raw predictions using
model.clearThreshold() function.
Regards,
Abhi
Hi ,
Needed inputs for a couple of issue that I am facing in my production
environment.
I am using spark version 1.4.0 spark streaming.
1) It so happens that the worker is lost on a machine and the executor
still shows up in the executor's tab in the UI.
Even when I kill a worker using kill -9
What exactly is timeout in mapWithState ?
I want the keys to get remmoved from the memory if there is no data
received on that key for 10 minutes.
How can I acheive this in mapWithState ?
Regards,
Abhi
(SingleThreadEventExecutor.java:116)
... 1 more
Cheers !!
Abhi
On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand
wrote:
> This is what I am getting in the executor logs
>
> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
> reverting partial writes to file
&
Yu wrote:
> Can you show the stack trace ?
>
> The log message came from
> DiskBlockObjectWriter#revertPartialWritesAndClose().
> Unfortunately, the method doesn't throw exception, making it a bit hard
> for caller to know of the disk full condition.
>
> On Thu, Mar 3
Hi,
Why is it so that when my disk space is full on one of the workers then the
executor on that worker becomes unresponsive and the jobs on that worker
fails with the exception
16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/data/spark-e
I have a spark streaming job where I am aggregating the data by doing
reduceByKeyAndWindow with inverse function.
I am keeping the data in memory for upto 2 hours and In order to output the
reduced data to an external storage I conditionally need to puke the data
to DB say at every 15th minute of
wrote:
> Sorry that I forgot to tell you that you should also call `rdd.count()`
> for "reduceByKey" as well. Could you try it and see if it works?
>
> On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand
> wrote:
>
>> Hi Ryan,
>>
>> I am using mapWithState
our new machines?
>
> On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand
> wrote:
>
>> Any insights on this ?
>>
>> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand
>> wrote:
>>
>>> On changing the default compression codec which is snappy to lzf the
&g
stateDStream.stateSnapshots();
>
>
> On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand
> wrote:
>
>> Hi Ryan,
>>
>> Reposting the code.
>>
>> Basically my use case is something like - I am receiving the web
>> impression logs and may get the noti
Any insights on this ?
On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand
wrote:
> On changing the default compression codec which is snappy to lzf the
> errors are gone !!
>
> How can I fix this using snappy as the codec ?
>
> Is there any downside of using lzf as snappy is the
On changing the default compression codec which is snappy to lzf the errors
are gone !!
How can I fix this using snappy as the codec ?
Is there any downside of using lzf as snappy is the default codec that
ships with spark.
Thanks !!!
Abhi
On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand
Is there a way to query the json (or any other format) data stored in kafka
using spark sql by providing the offset range on each of the brokers ?
I just want to be able to query all the partitions in a sq manner.
Thanks !
Abhi
at 1:25 AM, Shixiong(Ryan) Zhu wrote:
> Hey Abhi,
>
> Could you post how you use mapWithState? By default, it should do
> checkpointing every 10 batches.
> However, there is a known issue that prevents mapWithState from
> checkpointing in some special cases:
> https://issues.ap
Hi ,
I am getting the following exception on running my spark streaming job.
The same job has been running fine since long and when I added two new
machines to my cluster I see the job failing with the following exception.
16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0
Any Insights on this one ?
Thanks !!!
Abhi
On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand
wrote:
> I am now trying to use mapWithState in the following way using some
> example codes. But, by looking at the DAG it does not seem to checkpoint
> the state and when restarting the ap
I have a spark streaming application running in production. I am trying to
find a solution for a particular use case when my application has a
downtime of say 5 hours and is restarted. Now, when I start my streaming
application after 5 hours there would be considerable amount of data then
in the Ka
Looking for answer to this.
Is it safe to delete the older files using
find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \;
For a window duration of 2 hours how older files can we delete ?
Thanks.
On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand
wrote:
> Hi All,
>
forward to just use the normal cassandra client
> to save them from the driver.
>
> On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand
> wrote:
>
>> I have a kafka rdd and I need to save the offsets to cassandra table at
>> the begining of each batch.
>>
>> Basi
I have a kafka rdd and I need to save the offsets to cassandra table at the
begining of each batch.
Basically I need to write the offsets of the type Offsets below that I am
getting inside foreachRD, to cassandra. The javafunctions api to write to
cassandra needs a rdd. How can I create a rdd from
ince release of 1.6.0
>> e.g.
>> SPARK-12591 NullPointerException using checkpointed mapWithState with
>> KryoSerializer
>>
>> which is in the upcoming 1.6.1
>>
>> Cheers
>>
>> On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand > > wrote:
>>
Hi All,
Any ideas on this one ?
The size of this directory keeps on growing.
I can see there are many files from a day earlier too.
Cheers !!
Abhi
On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand
wrote:
> Hi Adrian,
>
> I am running spark in standalone mode.
>
> The spark ve
there. Is there any
other work around ?
Cheers!!
Abhi
On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu
wrote:
> Looks like mapWithState could help you?
> On 11 Feb 2016 8:40 p.m., "Abhishek Anand"
> wrote:
>
>> Hi All,
>>
>> I have an use case like follows
Hi All,
I have an use case like follows in my production environment where I am
listening from kafka with slideInterval of 1 min and windowLength of 2
hours.
I have a JavaPairDStream where for each key I am getting the same key but
with different value,which might appear in the same batch or some
Any insights on this ?
On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand
wrote:
> Hi All,
>
> Can someone help me with the following doubts regarding checkpointing :
>
> My code flow is something like follows ->
>
> 1) create direct stream from kafka
> 2) repartition k
Hi All,
Can someone help me with the following doubts regarding checkpointing :
My code flow is something like follows ->
1) create direct stream from kafka
2) repartition kafka stream
3) mapToPair followed by reduceByKey
4) filter
5) reduceByKeyAndWindow without the inverse function
6) writ
sues.apache.org/jira/browse/SPARK-10975
> With spark >= 1.6:
> https://issues.apache.org/jira/browse/SPARK-12430
> and also be aware of:
> https://issues.apache.org/jira/browse/SPARK-12583
>
>
> On 25/01/2016 07:14, Abhishek Anand wrote:
>
> Hi All,
>
> How long the s
Hi All,
How long the shuffle files and data files are stored on the block manager
folder of the workers.
I have a spark streaming job with window duration of 2 hours and slide
interval of 15 minutes.
When I execute the following command in my block manager path
find . -type f -cmin +150 -name "
Hi,
Is there a way so that I can fetch the offsets from where the spark
streaming starts reading from Kafka when my application starts ?
What I am trying is to create an initial RDD with offsest at a particular
time passed as input from the command line and the offsets from where my
spark streami
I am trying to use updateStateByKey but receiving the following error.
(Spark Version 1.4.0)
Can someone please point out what might be the possible reason for this
error.
*The method
updateStateByKey(Function2,Optional,Optional>)
in the type JavaPairDStream is not applicable
for the arguments *
orm that allows you specify a function with two params - the
> parent RDD and the batch time at which the RDD was generated.
>
> TD
>
> On Thu, Nov 26, 2015 at 1:33 PM, Abhishek Anand
> wrote:
>
>> Hi ,
>>
>> I need to use batch start time in my spark streaming job
Hi ,
I need to use batch start time in my spark streaming job.
I need the value of batch start time inside one of the functions that is
called within a flatmap function in java.
Please suggest me how this can be done.
I tried to use the StreamingListener class and set the value of a variable
in
Hi ,
I need to get the batch time of the active batches which appears on the UI
of spark streaming tab,
How can this be achieved in Java ?
BR,
Abhi
Hi ,
I am using spark streaming to write the aggregated output as parquet files
to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:"
);
I had an impression that in case o
47 matches
Mail list logo