from:"Chawla,Sumit "

Re: RepartitionByKey Behavior

2018-06-26 Thread Chawla,Sumit

Thanks everyone. As Nathan suggested, I ended up collecting the distinct keys first and then assigning Ids to each key explicitly. Regards Sumit Chawla On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld < nkronenfeld@uncharted.software> wrote: > On Thu, Jun 21, 2018 at 4:51 PM, Cha

Re: RepartitionByKey Behavior

2018-06-21 Thread Chawla,Sumit

Based on code read it looks like Spark does modulo of key for partition. Keys of c and b end up pointing to same value. Whats the best partitioning scheme to deal with this? Regards Sumit Chawla On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit wrote: > Hi > > I have been trying to th

RepartitionByKey Behavior

2018-06-21 Thread Chawla,Sumit

Hi I have been trying to this simple operation. I want to land all values with one key in same partition, and not have any different key in the same partition. Is this possible? I am getting b and c always getting mixed up in the same partition. rdd = sc.parallelize([('a', 5), ('d', 8), ('b

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-07 Thread Chawla,Sumit

Hi Anybody got any pointers on this one? Regards Sumit Chawla On Tue, Mar 6, 2018 at 8:58 AM, Chawla,Sumit wrote: > No, This is the only Stack trace i get. I have tried DEBUG but didn't > notice much of a log change. > > Yes, I have tried bumping MaxDirectMemorySize t

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-06 Thread Chawla,Sumit

this number to appropriate value. Regards Sumit Chawla On Tue, Mar 6, 2018 at 8:07 AM, Vadim Semenov wrote: > Do you have a trace? i.e. what's the source of `io.netty.*` calls? > > And have you tried bumping `-XX:MaxDirectMemorySize`? > > On Tue, Mar 6, 2018 at 12:45 AM, Chawl

OutOfDirectMemoryError for Spark 2.2

2018-03-05 Thread Chawla,Sumit

Hi All I have a job which processes a large dataset. All items in the dataset are unrelated. To save on cluster resources, I process these items in chunks. Since chunks are independent of each other, I start and shut down the spark context for each chunk. This allows me to keep DAG smaller a

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Chawla,Sumit

ly the same thing. The default for that setting is >> 1h instead of 0. It’s better to have a non-zero default to avoid what >> you’re seeing. >> >> rb >> >> >> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit >> wrote: >> >>> I am seeing a

What is correct behavior for spark.task.maxFailures?

2017-04-21 Thread Chawla,Sumit

I am seeing a strange issue. I had a bad behaving slave that failed the entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all task retries happen on the same slave in case of failure. My expectation was that task will be retried on different slave in case of failure, and

Unique Partition Id per partition

2017-01-31 Thread Chawla,Sumit

Hi All I have a rdd, which i partition based on some key, and then can sc.runJob for each partition. Inside this function, i assign each partition a unique key using following: "%s_%s" % (id(part), int(round(time.time())) This is to make sure that, each partition produces separate bookeeping st

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-28 Thread Chawla,Sumit

Would this work for you? def processRDD(rdd): analyzer = ShortTextAnalyzer(root_dir) rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])) ssc.union(*streams).filter(lambda x: x[1] != None) .foreachRDD(lambda rdd: processRDD(rdd)) Regards Sumit Chawla On Wed, Dec 2

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-26 Thread Chawla,Sumit

will also go through >> mesos, it's better to tune it lower, otherwise mesos could become the >> bottleneck. >> >> spark.task.maxDirectResultSize >> >> On Mon, Dec 19, 2016 at 3:23 PM, Chawla,Sumit >> wrote: >> > Tim, >> > >> &

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit

sticking around solved in Dynamic > >> Resource Allocation? Is there some timeout after which Idle executors > can > >> just shutdown and cleanup its resources. > > > > Yes, that's exactly what dynamic allocation does. But again I have no > idea > > what t

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit

lob/v1.6.3/docs/running-on-mesos.md) > and spark.task.cpus (https://github.com/apache/spark/blob/v1.6.3/docs/ > configuration.md) > > On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit > wrote: > >> Ah thanks. looks like i skipped reading this *"Neither will executors >>

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit

Dec 19, 2016 at 11:26 AM, Timothy Chen wrote: > >> Hi Chawla, >> >> One possible reason is that Mesos fine grain mode also takes up cores >> to run the executor per host, so if you have 20 agents running Fine >> grained executor it will take up 20 cores while it

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit

ng the > fine-grained scheduler, and no one seemed too dead-set on keeping it. I'd > recommend you move over to coarse-grained. > > On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit > wrote: > >> Hi >> >> I am using Spark 1.6. I have one query about Fine Grained

Mesos Spark Fine Grained Execution - CPU count

2016-12-16 Thread Chawla,Sumit

Hi I am using Spark 1.6. I have one query about Fine Grained model in Spark. I have a simple Spark application which transforms A -> B. Its a single stage application. To begin the program, It starts with 48 partitions. When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs a

Re: Spark Batch checkpoint

2016-12-16 Thread Chawla,Sumit

sorry for hijacking this thread. @irving, how do you restart a spark job from checkpoint? Regards Sumit Chawla On Fri, Dec 16, 2016 at 2:24 AM, Selvam Raman wrote: > Hi, > > Acutally my requiremnt is read the parquet file which is 100 partition. > Then i use foreachpartition to read the data

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit

n that you can run > arbitrary code after all. > > > On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit > wrote: > >> Any suggestions on this one? >> >> Regards >> Sumit Chawla >> >> >> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit >> w

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit

Any suggestions on this one? Regards Sumit Chawla On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit wrote: > Hi All > > I have a workflow with different steps in my program. Lets say these are > steps A, B, C, D. Step B produces some temp files on each executor node. > How can i a

Output Side Effects for different chain of operations

2016-12-13 Thread Chawla,Sumit

Hi All I have a workflow with different steps in my program. Lets say these are steps A, B, C, D. Step B produces some temp files on each executor node. How can i add another step E which consumes these files? I understand the easiest choice is to copy all these temp files to any shared locatio

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-11 Thread Chawla,Sumit

on-relay if it serves > your need. > > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Wed, Dec 7, 2016 at 1:10 AM, Chawla,Sumit > wrote: > >> Any pointers on this? >

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-06 Thread Chawla,Sumit

Any pointers on this? Regards Sumit Chawla On Mon, Dec 5, 2016 at 8:30 PM, Chawla,Sumit wrote: > An example implementation i found is : https://github.com/groupon/ > spark-metrics > > Anyone has any experience using this? I am more interested in something > for Pyspark specif

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit

spend some time reading it, but any quick pointers will be appreciated. Regards Sumit Chawla On Mon, Dec 5, 2016 at 8:17 PM, Chawla,Sumit wrote: > Hi Manish > > I am specifically looking for something similar to following: > > https://ci.apache.org/projects/flink/flink-docs

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit

If you are using Mesos as Resource Manager , mesos exposes metrics as well > for the running job. > > Manish > > ~Manish > > > > On Mon, Dec 5, 2016 at 4:17 PM, Chawla,Sumit > wrote: > >> Hi All >> >> I have a long running job which takes hours and hour

Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit

Hi All I have a long running job which takes hours and hours to process data. How can i monitor the operational efficency of this job? I am interested in something like Storm\Flink style User metrics/aggregators, which i can monitor while my job is running. Using these metrics i want to monitor

Re: Executor shutdown hook and initialization

2016-10-27 Thread Chawla,Sumit

Hi Sean Could you please elaborate on how can this be done on a per partition basis? Regards Sumit Chawla On Thu, Oct 27, 2016 at 7:44 AM, Walter rakoff wrote: > Thanks for the info Sean. > > I'm initializing them in a singleton but Scala objects are evaluated > lazily. > So it gets initializ

Re: Task Deserialization Error

2016-09-21 Thread Chawla,Sumit

ation is not being mentioned in your >> program. But rest looks good. >> >> >> >> Output: >> >> >> By the way, I have used the README.md template https://gis >> t.github.com/jxson/1784669 >> >> Thanks & Regards, >> Gokula Krishn

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Chawla,Sumit

+1 Jakob. Thanks for the link Regards Sumit Chawla On Wed, Sep 21, 2016 at 2:54 PM, Jakob Odersky wrote: > One option would be to use Apache Toree. A quick setup guide can be > found here https://toree.incubator.apache.org/documentation/user/ > quick-start > > On Wed, Sep 21, 2016 at 2:02 PM,

Task Deserialization Error

2016-09-19 Thread Chawla,Sumit

Hi All I am trying to test a simple Spark APP using scala. import org.apache.spark.SparkContext object SparkDemo { def main(args: Array[String]) { val logFile = "README.md" // Should be some file on your system // to run in local mode val sc = new SparkContext("local", "Simple Ap

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Chawla,Sumit

How are you producing data? I just tested your code and i can receive the messages from Kafka. Regards Sumit Chawla On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > i am very new to *Spark streaming* and i am implementing small exercise > like sending

Re: RepartitionByKey Behavior

Re: RepartitionByKey Behavior

RepartitionByKey Behavior

Re: OutOfDirectMemoryError for Spark 2.2

Re: OutOfDirectMemoryError for Spark 2.2

OutOfDirectMemoryError for Spark 2.2

Re: What is correct behavior for spark.task.maxFailures?

What is correct behavior for spark.task.maxFailures?

Unique Partition Id per partition

Re: [PySpark - 1.6] - Avoid object serialization

Re: Mesos Spark Fine Grained Execution - CPU count

Re: Mesos Spark Fine Grained Execution - CPU count

Re: Mesos Spark Fine Grained Execution - CPU count

Re: Mesos Spark Fine Grained Execution - CPU count

Re: Mesos Spark Fine Grained Execution - CPU count

Mesos Spark Fine Grained Execution - CPU count

Re: Spark Batch checkpoint

Re: Output Side Effects for different chain of operations

Re: Output Side Effects for different chain of operations

Output Side Effects for different chain of operations

Re: Monitoring the User Metrics for a long running Spark Job

Re: Monitoring the User Metrics for a long running Spark Job

Re: Monitoring the User Metrics for a long running Spark Job

Re: Monitoring the User Metrics for a long running Spark Job

Monitoring the User Metrics for a long running Spark Job

Re: Executor shutdown hook and initialization

Re: Task Deserialization Error

Re: Has anyone installed the scala kernel for Jupyter notebook

Task Deserialization Error

Re: Getting empty values while receiving from kafka Spark streaming

30 matches

Site Navigation

Mail list logo

Footer information