RE: Enabling fully disaggregated shuffle on Spark

2019-11-27 Thread Prakhar Jain
Great work Ben. At Microsoft, we are also working on disaggregating shuffle from Spark. Please add me to the invite. From: Felix Cheung Sent: 21 November 2019 07:07 To: Ben Sidhom ; John Zhuge Cc: bo yang ; Amogh Margoor ; Ryan Blue ; Ben Sidhom ; Spark Dev List ; Christopher Crosbie ; Grisel

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Ah yes, right I forgot about the existence. Thanks! I'm aware of some implementations for approximate calculations (I guess what we say approximate median is approximate percentile with 50%) but I didn't know about implementation details like supporting accumulative. Given current source values of

Re: [DISCUSS] PostgreSQL dialect

2019-11-27 Thread Dongjoon Hyun
+1 Bests, Dongjoon. On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro wrote: > Yea, +1, that looks pretty reasonable to me. > > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it > from the codebase before it's too late. Curently we only have 3 features > under PostgreSQL dia

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
Yep, that's clear. That's a reasonable case. There are already approximate median computations that can be done cumulatively as you say, implemented in Spark. I think it's reasonable to consider this for performance, as it can be faster with just a small error tolerance. But yeah up to you if you h

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Thanks all for providing inputs! Maybe I wasn't clear about my intention. The issue I focus on is; there're plenty of metrics being defined in a stage for SQL, and each metric has values for each task and being grouped later to calculate aggregated values. (e.g. metric for "elapsed time" is shown

Debug "Java gateway process exited before sending the driver its port number"

2019-11-27 Thread Li Jin
Dear Spark devs, I am debugging a weird "Java gateway process exited before sending the driver its port number" when creating SparkSession with pyspark. I am running the following simple code with pytest: " from pyspark.sql import SparkSession def test_spark(): spark = SparkSession.builde

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
How big is the overhead, at scale? If it has a non-trivial effect for most jobs, I could imagine reusing the existing approximate quantile support to more efficiently find a pretty-close median. On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim wrote: > > Hi Spark devs, > > The change might be specifi

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Mayur Rustagi
Another option could be to use a sketch to get approx median(extendable to quantiles as well) for a large number of tasks sketch would give accurate value as tasks are few, for larger task the benefit will be good. Regards, Mayur Rustagi Ph: +1 (650) 937 9673 http://www.sigmoid.com

Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Hi Spark devs, The change might be specific to the SQLAppStatusListener, but given it may change the value of metric being shown in UI, so would like to hear some voices on this. When we aggregate the SQL metric between tasks, we apply "sum", "min", "median", "max", which all are cumulative excep