Re: python tests: any reason for a huge tests.py?

2018-09-12 Thread Imran Rashid
So I've had some offline discussion around this, so I'd like to clarify.
SPARK-25344 maybe some non-trivial work to do, as its significant
refactoring.

But can we agree on an *immediate* first step: all new python tests should
go into their own files?  is there some reason to not do that right away?

I understand that in some case, you'll want to add a test case that really
is related to an existing test already in those giant files, and it makes
sense for you to keep them close.   Its fine to decide on a case-by-case
basis whether we should do the relevant refactoring for that relevant bit
at the same or just put it in the same file.  But we should still have this
*goal* in mind, so you should do it in the cases where its really
independent cases.

That avoid us making the problem worse till we get to SPARK-25344, and
furthermore it will allow work on SPARK-25344 to eventually proceed without
never ending merge conflicts with other changes that are also adding new
tests.

On Wed, Sep 5, 2018 at 1:27 PM Imran Rashid  wrote:

> I filed https://issues.apache.org/jira/browse/SPARK-25344
>
> On Fri, Aug 24, 2018 at 11:57 AM Reynold Xin  wrote:
>
>> We should break it.
>>
>> On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid 
>> wrote:
>>
>>> Hi,
>>>
>>> another question from looking more at python recently.  Is there any
>>> reason we've got a ton of tests in one humongous tests.py file, rather than
>>> breaking it out into smaller files?
>>>
>>> Having one huge file doesn't seem great for code organization, and it
>>> also makes the test parallelization in run-tests.py not work as well.  On
>>> my laptop, tests.py takes 150s, and the next longest test file takes only
>>> 20s.
>>>
>>> can we at least try to put new tests into smaller files?
>>>
>>> thanks,
>>> Imran
>>>
>>


Kubernetes Big-Data-SIG notes, September 12

2018-09-12 Thread Erik Erlandson
Spark
Pod template parameters 
is mostly done. The main remaining design discussion is around how (or
whether) to specify which container on the pod template is the driver pod.

HDFS
We had some discussions about the possibility of adding an HDFS operator,
in addition to the system of helm charts that Kimoon Kim has developed.
Yinan Li is looking into working on this.

Link to meeting minutes
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85
Vms4V2uTsGZvSp8MNIA


Re: RE: [VOTE] SPARK 2.3.2 (RC5)

2018-09-12 Thread jonathan . schuff
2.3.2 includes crucial changes for Python 3.7 support as well.

Thanks,
Jon

On 2018/09/07 03:26:30, Sharanabasappa G Keriwaddi  wrote: 
> Hi –
> 
> Are there any blocking issues open for 2.3.2?
> 
> 2.3.1 had few critical issues, I feel it would be better to publish 2.3.2 
> with all those critical bug fixes.
> 
> 
> Thanks and Regards
> Sharan
> 
> 
> 
> From: Saisai Shao [mailto:sai.sai.s...@gmail.com]
> Sent: 07 September 2018 08:30
> To: 441586683 <441586...@qq.com>
> Cc: dev 
> Subject: Re: [VOTE] SPARK 2.3.2 (RC5)
> 
> Hi,
> 
> PMC members asked me to hold a bit while they're dealing with some other 
> things. Please wait for a bit while.
> 
> Thanks
> Saisai
> 
> 
> zzc <441586...@qq.com> 于2018年9月6日周四 下午4:27写道:
> Hi Saisai:
>   Spark 2.4 was cut, and is there any new process on 2.3.2?
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS][CORE] Exposing application status metrics via a source

2018-09-12 Thread Stavros Kontopoulos
Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes
application status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics
like job delay, stages failed, stages completed etc.
>From devops perspective it is good to standardize on a unified way of
gathering metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly
used to scrape metrics for several components such as kafka, cassandra, but
the need is not limited there.

Check comment here
:
"The rest api is great for UI and consolidated analytics, but monitoring
through it is not as straightforward as when the data emits directly from
the source like this. There is all kinds of nice context that we get when
the data from this spark node is collected directly from the node itself,
and not proxied through another collector / reporter. It is easier to build
a monitoring data model across the cluster when node, jmx, pod, resource
manifests, and spark data all align by virtue of coming from the same
collector. Building a similar view of the cluster just from the rest api,
as a comparison, is simply harder and quite challenging to do in general
purpose terms."

The PR is ok to be merged but the major concern here is the mirroring of
the metrics. I think that mirroring is ok since people may dont want to
check the ui and they just want to integrate with jmx only (my use case)
and gather metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition
does not change much and can always be refactored if we come up with a new
plan for the metrics story in the future.

Thanks,
Stavros