Re: Facing memory leak with Pyarrow enabled and toPandas()

2021-01-22 Thread Gourav Sengupta
Hi Can you please mention the spark version, give us the code for setting up spark session, and the operation you are talking about? It will be good to know the amount of memory that your system has as well and number of executors you are using per system In general I have faced issues when doing g

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Jacek Laskowski
Hi Filip, Care to share the code behind "The only thing I found so far involves using forEachBatch and manually updating my aggregates. "? I'm not completely sure I understand your use case and hope the code could shed more light on it. Thank you. Pozdrawiam, Jacek Laskowski https://about.m

Using same rdd from two threads

2021-01-22 Thread jelmer
HI, I have a piece of code in which an rdd is created from a main method. It then does work on this rdd from 2 different threads running in parallel. When running this code as part of a test with a local master it will sometimes make spark hang ( 1 task will never get completed) If i make a copy

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Filip
Hi, I don't have any code for the forEachBatch approach, I mentioned it due to this response to my question on SO: https://stackoverflow.com/a/65803718/1017130 I have added some very simple code below that I think shows what I'm trying to do: val schema = StructType( Array( StructFiel

Re: Using same rdd from two threads

2021-01-22 Thread Sean Owen
RDDs are immutable, and Spark itself is thread-safe. This should be fine. Something else is going on in your code. On Fri, Jan 22, 2021 at 7:59 AM jelmer wrote: > HI, > > I have a piece of code in which an rdd is created from a main method. > It then does work on this rdd from 2 different thread

K8S spark-submit Loses Successful Driver Completion

2021-01-22 Thread Marshall Markham
Hi, I am running Airflow + Spark + AKS (Azure K8s). Sporadically, when I have a spark job complete, my spark-submit process does not notice that the driver has succeeded and continues to track the job as running. Does anyone know how spark-submit process monitors driver processes on k8s? My exp