Re: Is it possible to rate limit an UDP?

2019-01-09 Thread Ramandeep Singh
Backpressure is the suggested way out here and is the correct approach, it rate limits at the source itself for safety. Imagine a service with throttling enabled, It can outright reject your calls. Even if you split your df that alone won't achieve your purpose, You can combine that with backpre

Re: Troubleshooting Spark OOM

2019-01-09 Thread Dillon Dukek
I think most spark technical support people would really recommend upgrading to spark 2.0+ for starters. However, I understand that's not always possible. In this case I would double check to make sure that you don't have a situation where you have a join key that has many records associated with i

Re: Troubleshooting Spark OOM

2019-01-09 Thread William Shen
Thank you for the tips. We are running Spark 1.6 (scala), and OOM happens with SparkSQL trying to join a few large dataset together for processing/transformation... On Wed, Jan 9, 2019 at 3:42 PM Ramandeep Singh wrote: > Hi, > > Here are a few suggestions that you can try. > > OOM Issues that, I

Re: Troubleshooting Spark OOM

2019-01-09 Thread Ramandeep Singh
Hi, Here are a few suggestions that you can try. OOM Issues that, I have faced with Spark: *Not enough shuffle partition*s.Increase them. Less memory Overhead settings: Boosting it to around 12 percent. You usually get this as a error message in your executors. *Large Executor Configs*: They can

Re: Troubleshooting Spark OOM

2019-01-09 Thread Dillon Dukek
Hi William, Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing wi

Troubleshooting Spark OOM

2019-01-09 Thread William Shen
Hi there, We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap d

Re: Is it possible to rate limit an UDP?

2019-01-09 Thread Sonal Goyal
Have you tried controlling the number of partitions of the dataframe? Say you have 5 partitions, it means you are making 5 concurrent calls to the web service. The throughput of the web service would be your bottleneck and Spark workers would be waiting for tasks, but if you cant control the REST s

Reading as Parquet a directory created by Spark Structured Streaming - problems

2019-01-09 Thread Phillip Henry
Hi, I write a stream of (String, String) tuples to HDFS partitioned by the first ("_1") member of the pair. Everything looks great when I list the directory via "hadoop fs -ls ...". However, when I try to read all the data as a single dataframe, I get unexpected results (see below). I notice th

P-values logistic regression

2019-01-09 Thread Simon Dirmeier
Dear all, when fitting a logistic regression model, for some data no p-values are computed. I cannot really tell under what circumstances this happpens though.Is there an explanation why and when this might be the case? Thank you, Simon ---