@Ángel Álvarez Pascua
Thanks, however I am thinking of some other solution which does not involve
saving the dataframe result. Will update this thread with details soon.
@daniel williams
Thanks, I will surely check spark-testing-base out.
Regards,
Abhishek Singla
On Thu, Apr 17, 2025 at 11
am missing something?
Regards,
Abhishek Singla
On Thu, Apr 17, 2025 at 6:25 AM daniel williams
wrote:
> The contract between the two is a dataset, yes; but did you construct the
> former via headstream? If not, it’s still batch.
>
> -dan
>
>
> On Wed, Apr 16, 2025 at 4:54 PM Abh
failures. I wanted to know if there is an
existing way in spark batch to checkpoint already processed rows of a
partition if using foreachPartition or mapParitions, so that they are not
processed again on rescheduling of task due to failure or retriggering of
job due to failures.
Regards,
Abhish
:
"org.apache.kafka.common.serialization.ByteArraySerializer"
}
"linger.ms" and "batch.size" are successfully set in Producer Config
(verified via startup logs) and are being honored.
kafka connector?
Regards,
Abhishek Singla
offset so that those 100 should not get processed again.
Regards,
Abhishek Singla
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 6
transactional.id = null
value.serializer = class
org.apache.kafka.common.serialization.ByteArraySerializer
On Wed, Apr 16, 2025 at 5:55 PM Abhishek Singla
wrote:
> Hi Daniel and Jungtaek,
>
> I am using Spark
10 records and they are flushed to kafka server immediately,
however kafka producer behaviour when publishing via kafka-clients using
foreachPartition is as expected. Am I missing something here or is
throttling not supported in the kafka connector?
Regards,
Abhishek Singla
On Thu, Mar 27, 2025 at 4:56
Isn't there a way to do it with kafka connector instead of kafka client?
Isn't there any way to throttle kafka connector? Seems like a common
problem.
Regards,
Abhishek Singla
On Mon, Feb 24, 2025 at 7:24 PM daniel williams
wrote:
> I think you should be using a foreachPa
Config:
ProducerConfig values* at runtime that they are not being set. Is there
something I am missing? Is there any other way to throttle kafka writes?
*dataset.write().format("kafka").options(options).save();*
Regards,
Abhishek Singla
allowing it to perform
the real task.
Thanks Abhishek
Sent from Yahoo Mail for iPhone
Hi Team,
Could someone provide some insights into this issue?
Regards,
Abhishek Singla
On Wed, Jan 17, 2024 at 11:45 PM Abhishek Singla <
abhisheksingla...@gmail.com> wrote:
> Hi Team,
>
> Version: 3.2.2
> Java Version: 1.8.0_211
> Scala Version: 2.12.15
> Cluster: S
ionId, appConfig))
.option("checkpointLocation", appConfig.getChk().getPath())
.start()
.awaitTermination();
Regards,
Abhishek Singla
:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising f
t:7077",
"spark.app.name": "app",
"spark.sql.streaming.kafka.useDeprecatedOffsetFetching": false,
"spark.sql.streaming.metricsEnabled": true
}
But these configs do not seem to be working as I can see Spark processing
batches of 3k-15k immediately one after another. Is there something I am
missing?
Ref:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Regards,
Abhishek Singla
Hi Dongjoon Hyun,
Any inputs on the below issue would be helpful. Please let us know if we're
missing anything?
Thanks and Regards,
Abhishek
From: Patidar, Mohanlal (Nokia - IN/Bangalore)
Sent: Thursday, January 20, 2022 11:58 AM
To: user@spark.apache.org
Subject: Suspected SPAM
usage for RDD interfaces?
PS: The question is posted in stackoverflow as well: Link
<https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds>
Regards,
-
Abhishek Shakya
Senior Data Scientist 1,
Contact: +919002319890 | Em
aceImpl.java:434)
[error] at
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2788)
[error] ... 5 more
I am currently using spark-core-3.1.1.jar with hadoop-azure-3.2.2.jar but
this same issue also occurs in hadoop-azure-3.3.1.jar as well. Please
advise how I should solve this issue.
Thanks,
Abhishek
HI Sean,
Thanks for the quick response. We’ll look into this.
Thanks and Regards,
Abhishek
From: Sean Owen
Sent: Wednesday, June 30, 2021 6:30 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: User
Subject: Re: Inclusive terminology usage in Spark
This was covered and mostly done last year
ement ticket to track this.
https://issues.apache.org/jira/browse/SPARK-35952
Thanks and Regards,
Abhishek
Hi Maziyar, Mich
Do we have any ticket to track this? Any idea if this is going to be fixed in
3.1.2?
Thanks and Regards,
Abhishek
From: Mich Talebzadeh
Sent: Friday, April 9, 2021 2:11 PM
To: Maziyar Panahi
Cc: User
Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x
Hi,
Regarding
0,
"op_is_file" : 2,
"S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
},
"diagnostics" : {
"fs.s3a.metadatastore.impl" :
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
"fs.s3a.committer.magic.en
Hi All,
I had a question about modeling a user session kind of analytics use-case
in Spark Structured Streaming. Is there a way to model something like this
using Arbitrary stateful Spark streaming
User session -> reads a few FAQS on a website and then decides to create a
ticket or not
FAQ Deflec
were seeing discrepancy in query execution time on S3 with Spark 3.0.0.
Thanks and Regards,
Abhishek
From: Gourav Sengupta
Sent: Wednesday, August 26, 2020 5:49 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Yeah… Not sure if I’m missing any configurations which is causing this issue.
Any suggestions?
Thanks and Regards,
Abhishek
From: Gourav Sengupta
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user@spark.apache.org
Subject: Re: Spark 3.0 using S3 taking
Hi Gourav,
Yes. We’re using s3a.
Thanks and Regards,
Abhishek
From: Gourav Sengupta
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user@spark.apache.org
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi,
are you using
GB of data whereas in case of HDFS, it
is only 4.5 GB.
Any idea why this difference is there?
Thanks and Regards,
Abhishek
From: Luca Canali
Sent: Monday, August 24, 2020 7:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user@spark.apache.org
Subject: RE: Spark 3.0 using S3 taking long time fo
,
Abhishek
From: Subash K
Sent: Monday, June 22, 2020 9:00 AM
To: user@spark.apache.org
Subject: Spark Thrift Server in Kubernetes deployment
Hi,
We are currently using Spark 2.4.4 with Spark Thrift Server (STS) to expose a
JDBC interface to the reporting tools to generate report from Spark tables.
Now
overlay network.
Thanks and Regards,
Abhishek
From: manish gupta
Sent: 01 October 2019 PM 09:20
To: Prudhvi Chennuru (CONT)
Cc: user
Subject: Re: [External Sender] Spark Executor pod not getting created on
kubernetes cluster
Kube-api server logs are not enabled. I will enable and check and get back
node) for now. There is
option of using nodeport as well. That also works.
Thanks and Regards,
Abhishek
From: Yaniv Harpaz
Sent: Tuesday, August 27, 2019 7:34 PM
To: user@spark.apache.org
Subject: web access to sparkUI on docker or k8s pods
hello guys,
when I launch driver pods or even when I use
I realised that the build instructions in the README.md were not very clear
due to some recent changes. I have updated those now.
Thanks,
Abhishek Somani
On Sun, Jul 28, 2019 at 7:53 AM naresh Goud
wrote:
> Thanks Abhishek.
> I will check it out.
>
> Thank you,
> Naresh
>
u can just use
it as:
spark-shell --packages qubole:spark-acid:0.4.0-s_2.11
...and it will be automatically fetched and used.
Thanks,
Abhishek
On Sun, Jul 28, 2019 at 4:42 AM naresh Goud
wrote:
> It looks there is some internal dependency missing.
>
> libraryDependencies ++= Seq(
>
Hey Naresh,
Thanks for your question. Yes it will work!
Thanks,
Abhishek Somani
On Fri, Jul 26, 2019 at 7:08 PM naresh Goud
wrote:
> Thanks Abhishek.
>
> Will it work on hive acid table which is not compacted ? i.e table having
> base and delta files?
>
> Let’s say hive a
e
tables via Spark as well.
The datasource is also available as a spark package, and instructions on
how to use it are available on the Github page
<https://github.com/qubole/spark-acid>.
We welcome your feedback and suggestions.
Thanks,
Abhishek Somani
case. You could try to
build the container by placing the log4j.properties at some other location and
set the same in spark.driver.extraJavaOptions
Thanks and Regards,
Abhishek
From: Dave Jaffe
Sent: Tuesday, June 11, 2019 6:45 AM
To: user@spark.apache.org
Subject: Spark on Kubernetes
let me know if it is the expected behavior ?
Regards,
Abhishek Jain
for driver and executor/s)
--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/log4j.properties"
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/log4j.properties"
Regards,
Abhishek Jain
From: em...@yeikel.com
Sent: Friday, February 15, 201
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/
log4j.logger.org.apache.spark=
This means log4j will roll the log file by 50MB and keep only 5 recent files.
These files are saved in /var/log/spark directory, with filename mentioned.
Regards,
Abhishek Jain
From
=,
log4j.logger.org.apache.parquet= etc..
These properties can be set in the conf/log4j .properties file.
Hope this helps! 😊
Regards,
Abhishek Jain
From: Deepak Sharma
Sent: Thursday, February 14, 2019 12:10 PM
To: spark users
Subject: Spark streaming filling the disk with logs
Hi All
I am running a spark
spark.eventLog.dir
Thanks and Regards,
Abhishek
From: Battini Lakshman
Sent: Wednesday, January 23, 2019 1:55 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Subject: Re: Spark UI History server on Kubernetes
HI Abhishek,
Thank you for your response. Could you please let me know the properties you
configured
Hi,
We’ve setup spark-history service (based on spark 2.4) on K8S. UI works
perfectly fine when running on NodePort. We’re facing some issues when on
ingress.
Please let us know what kind of inputs do you need?
Thanks and Regards,
Abhishek
From: Battini Lakshman
Sent: Tuesday, January 22
/6f838adf6651491bd4f263956f403c74
Thanks.
Best Regards,
*Abhishek Tripath*
On Thu, Jul 19, 2018 at 10:02 AM Abhishek Tripathi
wrote:
> Hello All!
> I am using spark 2.3.1 on kubernetes to run a structured streaming spark
> job which read stream from Kafka , perform some window aggregation and
> output s
1 (both topic has 20 partition and getting
almost 5k records/s )
Hadoop version (Using hdfs for check pointing) : 2.7.2
Thank you for any help.
Best Regards,
*Abhishek Tripathi*
ata-sources/sql-databases.html>The
problem I am facing that I don't have a numeric column which can be used
for achieving the partition.
Any help would be appreciated.
Thank You
--Abhishek
t; machine configuration).
>>
>> I really don't understand why this is happening since the same
>> configuration but using a Spark 2.0.0 is running fine within Vagrant.
>> Could someone please help?
>>
>> thanks in advance,
>> Richard
>>
>
Hi All,
I have a JavaPairDStream. I want to insert this dstream into multiple
cassandra tables on the basis of key. One approach is to filter each key
and then insert it into cassandra table. But this would call filter
operation "n" times depending on the number of keys.
Is there any better appro
Can it handle state that is large than what memory will hold?
.
Thanks,
Abhishek
I have tried with hdfs/tmp location but it didn't work. Same error.
On 23 Sep 2016 19:37, "Aditya" wrote:
> Hi Abhishek,
>
> Try below spark submit.
> spark-submit --master yarn --deploy-mode cluster --files hdfs://
> abc.com:8020/tmp/abc.drl --class com.abc.Star
wrong, please help to correct it.
Aditya:
I have attached code here for reference. --File option will distributed
reference file to all node but Kie session is not able to pickup it.
Thanks,
Abhishek
On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran
wrote:
>
> On 23 Sep 2016, at 08:33, ABHI
pl.java:58)
... 19 more
--
Cheers,
Abhishek
n use distinct over you data frame or rdd
>>
>> rdd.distinct
>>
>> It will give you distinct across your row.
>>
>> On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand
>> wrote:
>>
>>> I have an rdd which contains 14 different columns. I need to
I have an rdd which contains 14 different columns. I need to find the
distinct across all the columns of rdd and write it to hdfs.
How can I acheive this ?
Is there any distributed data structure that I can use and keep on updating
it as I traverse the new rows ?
Regards,
Abhi
up incorrect path. Did any one encountered similar
problem with spark 2.0?
With Thanks,
Abhishek
catColumns.length; i++) {
> concatColumns[i]=df.col(array[i]);
> }
>
> return functions.concat(concatColumns).alias(fieldName);
> }
>
>
>
> On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand
> wrote:
>
>> Hi Nihed,
>>
>>
< columns.length; i++) {
> selectColumns[i]=df.col(columns[i]);
> }
>
>
> selectColumns[columns.length]=functions.concat(df.col("firstname"),
> df.col("lastname"));
>
> df.select(selectColumns).show();
> --
Hi,
I have a dataframe say having C0,C1,C2 and so on as columns.
I need to create interaction variables to be taken as input for my program.
For eg -
I need to create I1 as concatenation of C0,C3,C5
Similarly, I2 = concat(C4,C5)
and so on ..
How can I achieve this in my Java code for conca
Hi ,
I have a dataframe which i want to convert to labeled point.
DataFrame labeleddf = model.transform(newdf).select("label","features");
How can I convert this to a LabeledPoint to use in my Logistic Regression
model.
I could do this in scala using
val trainData = labeleddf.map(row =>
Labeled
I also tried
jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2")
But, still its not working.
Any ideas why its not working ?
Abhi
On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand
wrote:
> My spark streaming checkpoint directory is being written
My spark streaming checkpoint directory is being written to HDFS with
default replication factor of 3.
In my streaming application where I am listening from kafka and setting the
dfs.replication = 2 as below the files are still being written with
replication factor=3
SparkConf sparkConfig = new
S
type string) will be one-hot
> encoded automatically.
> So pre-processing like `as.factor` is not necessary, you can directly feed
> your data to the model training.
>
> Thanks
> Yanbo
>
> 2016-05-30 2:06 GMT-07:00 Abhishek Anand :
>
>> Hi ,
>>
>> I want to ru
Hi ,
I want to run glm variant of sparkR for my data that is there in a csv file.
I see that the glm function in sparkR takes a spark dataframe as input.
Now, when I read a file from csv and create a spark dataframe, how could I
take care of the factor variables/columns in my data ?
Do I need t
java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks
From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) ;
user@spark.apache.org
Subject: RE: GraphX Java API
GraphX APis are
Hi,
We are trying to consume the Java API for GraphX, but there is no documentation
available online on the usage or examples. It would be great if we could get
some examples in Java.
Thanks and regards,
Abhishek Kumar
Products & Services | iLab
Deloitte Consulting LLP
Block ‘C’, Divya
I am building a ML pipeline for logistic regression.
val lr = new LogisticRegression()
lr.setMaxIter(100).setRegParam(0.001)
val pipeline = new
Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder,
devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,
hourEnc
You can use this function to remove the header from your dataset(applicable
to RDD)
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
Abhi
On Wed, Apr 27, 2016 at 12:5
Hi All,
I am trying to build a logistic regression pipeline in ML.
How can I clear the threshold which by default is 0.5. In mllib I am able
to clear the threshold to get the raw predictions using
model.clearThreshold() function.
Regards,
Abhi
You should be doing something like this:
data = sc.textFile('file:///path1/path/test1.csv')
header = data.first() #extract header
#print header
data = data.filter(lambda x:x !=header)
#print data
Hope it helps.
Sincerely,
Abhishek
+91-7259028700
From: nihed mbarek [mailto:nihe...
Hi ,
Needed inputs for a couple of issue that I am facing in my production
environment.
I am using spark version 1.4.0 spark streaming.
1) It so happens that the worker is lost on a machine and the executor
still shows up in the executor's tab in the UI.
Even when I kill a worker using kill -9
What exactly is timeout in mapWithState ?
I want the keys to get remmoved from the memory if there is no data
received on that key for 10 minutes.
How can I acheive this in mapWithState ?
Regards,
Abhi
(SingleThreadEventExecutor.java:116)
... 1 more
Cheers !!
Abhi
On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand
wrote:
> This is what I am getting in the executor logs
>
> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
> reverting partial writes to file
&
Yu wrote:
> Can you show the stack trace ?
>
> The log message came from
> DiskBlockObjectWriter#revertPartialWritesAndClose().
> Unfortunately, the method doesn't throw exception, making it a bit hard
> for caller to know of the disk full condition.
>
> On Thu, Mar 3
Hi,
Why is it so that when my disk space is full on one of the workers then the
executor on that worker becomes unresponsive and the jobs on that worker
fails with the exception
16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/data/spark-e
I have a spark streaming job where I am aggregating the data by doing
reduceByKeyAndWindow with inverse function.
I am keeping the data in memory for upto 2 hours and In order to output the
reduced data to an external storage I conditionally need to puke the data
to DB say at every 15th minute of
wrote:
> Sorry that I forgot to tell you that you should also call `rdd.count()`
> for "reduceByKey" as well. Could you try it and see if it works?
>
> On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand
> wrote:
>
>> Hi Ryan,
>>
>> I am using mapWithState
our new machines?
>
> On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand
> wrote:
>
>> Any insights on this ?
>>
>> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand
>> wrote:
>>
>>> On changing the default compression codec which is snappy to lzf the
&g
stateDStream.stateSnapshots();
>
>
> On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand
> wrote:
>
>> Hi Ryan,
>>
>> Reposting the code.
>>
>> Basically my use case is something like - I am receiving the web
>> impression logs and may get the noti
Any insights on this ?
On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand
wrote:
> On changing the default compression codec which is snappy to lzf the
> errors are gone !!
>
> How can I fix this using snappy as the codec ?
>
> Is there any downside of using lzf as snappy is the
On changing the default compression codec which is snappy to lzf the errors
are gone !!
How can I fix this using snappy as the codec ?
Is there any downside of using lzf as snappy is the default codec that
ships with spark.
Thanks !!!
Abhi
On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand
Hello All,
If someone has any leads on this please help me.
Sincerely,
Abhishek
From: Mishra, Abhishek
Sent: Wednesday, February 24, 2016 5:11 PM
To: user@spark.apache.org
Subject: LDA topic Modeling spark + python
Hello All,
I am doing a LDA model, please guide me with something.
I
ed status. The topic length
being 2000 and value of k or number of words being 3.
Please, if you can provide me with some link or some code base on spark with
python ; I would be grateful.
Looking forward for a reply,
Sincerely,
Abhishek
alue
grouped=pairs.groupByKey()#grouping values as per key
grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
print grouped_val
Thanks in Advance,
Sincerely,
Abhishek
Is there a way to query the json (or any other format) data stored in kafka
using spark sql by providing the offset range on each of the brokers ?
I just want to be able to query all the partitions in a sq manner.
Thanks !
Abhi
Thank you Everyone.
I am to work on PoC with 2 types of images, that basically will be two PoC’s.
Face recognition and Map data processing.
I am looking to these links and hopefully will get an idea. Thanks again. Will
post the queries as and when I get doubts.
Sincerely,
Abhishek
From: ndj
at 1:25 AM, Shixiong(Ryan) Zhu wrote:
> Hey Abhi,
>
> Could you post how you use mapWithState? By default, it should do
> checkpointing every 10 batches.
> However, there is a known issue that prevents mapWithState from
> checkpointing in some special cases:
> https://issues.ap
Hi ,
I am getting the following exception on running my spark streaming job.
The same job has been running fine since long and when I added two new
machines to my cluster I see the job failing with the following exception.
16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0
Any Insights on this one ?
Thanks !!!
Abhi
On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand
wrote:
> I am now trying to use mapWithState in the following way using some
> example codes. But, by looking at the DAG it does not seem to checkpoint
> the state and when restarting the ap
Hello,
I am working on image processing samples. Was wondering if anyone has worked on
Image processing project in spark. Please let me know if any sample project or
example is available.
Please guide in this.
Sincerely,
Abhishek
I have a spark streaming application running in production. I am trying to
find a solution for a particular use case when my application has a
downtime of say 5 hours and is restarted. Now, when I start my streaming
application after 5 hours there would be considerable amount of data then
in the Ka
Looking for answer to this.
Is it safe to delete the older files using
find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \;
For a window duration of 2 hours how older files can we delete ?
Thanks.
On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand
wrote:
> Hi All,
>
forward to just use the normal cassandra client
> to save them from the driver.
>
> On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand
> wrote:
>
>> I have a kafka rdd and I need to save the offsets to cassandra table at
>> the begining of each batch.
>>
>> Basi
PS - I don't get this behaviour in all the cases. I did many runs of the
same job & i get this behaviour in around 40% of the cases.
Task 4 is the bottom row in the metrics table
Thank you,
Abhishek
e: abshkm...@gmail.com
p: 91-8233540996
On Tue, Feb 16, 2016 at 11:19 PM, Abhishek Mo
Darren: this is not the last task of the stage.
Thank you,
Abhishek
e: abshkm...@gmail.com
p: 91-8233540996
On Tue, Feb 16, 2016 at 6:52 PM, Darren Govoni wrote:
> There were some posts in this group about it. Another person also saw the
> deadlock on next to last or last stag
using 20 executors with 1 core for each
executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs
of the job, there is a task which has an abnormally large deserialisation
time. Screenshot attached
Thank you,
Abhishek
-
I'm doing a mapPartitions on a rdd cached in memory followed by a reduce.
Here is my code snippet
// myRdd is an rdd consisting of Tuple2[Int,Long]
myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2))
//The rangify function
def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterato
I have a kafka rdd and I need to save the offsets to cassandra table at the
begining of each batch.
Basically I need to write the offsets of the type Offsets below that I am
getting inside foreachRD, to cassandra. The javafunctions api to write to
cassandra needs a rdd. How can I create a rdd from
ince release of 1.6.0
>> e.g.
>> SPARK-12591 NullPointerException using checkpointed mapWithState with
>> KryoSerializer
>>
>> which is in the upcoming 1.6.1
>>
>> Cheers
>>
>> On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand > > wrote:
>>
Hi All,
Any ideas on this one ?
The size of this directory keeps on growing.
I can see there are many files from a day earlier too.
Cheers !!
Abhi
On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand
wrote:
> Hi Adrian,
>
> I am running spark in standalone mode.
>
> The spark ve
there. Is there any
other work around ?
Cheers!!
Abhi
On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu
wrote:
> Looks like mapWithState could help you?
> On 11 Feb 2016 8:40 p.m., "Abhishek Anand"
> wrote:
>
>> Hi All,
>>
>> I have an use case like follows
1 - 100 of 148 matches
Mail list logo