Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>
> Yeah, I tend to be positive about leveraging the Python type hints in
> general.
>
> However, just to clarify, I don’t think we should just port the type
> hints into the main codes yet but maybe think about
> having/porting Maciej's work, pyi files as stubs. For now, I tend to
> think adding type hints to the codes make it difficult to backport or
> revert and
>
That's probably one-time overhead so it is not a big issue.  In my
opinion, a bigger one is possible complexity. Annotations tend to
introduce a lot of cyclic dependencies in Spark codebase. This can be
addressed, but don't look great. 

Merging stubs into project structure from the other hand has almost no
overhead.

> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.
>
> It is also interesting to take a look at other projects and how they
> did it. I took a look for the PySpark friends
> such as pandas or NumPy. Seems
>
>   * NumPy case had it as a separate project numpy-stubs and it was
> merged into the main project successfully as pyi files.
>   * pandas case, I don’t see the work being done yet. I found an issue
> related to this but it seems closed.
>
Actually there is quite a lot of ongoing work.
https://github.com/pandas-dev/pandas/issues/28142 is one ticket, but
individual work is handled separately (quite a few core modules already
have decent annotations). That being said, it seems unlikely that this
will be considered stable any time soon.

> Another important concern might be generic typing in Spark’s DataFrame
> as an example. Looks like that’s also one of the concerns from pandas’.
> For instance, how would we support variadic generic typing, for
> example, |DataFrame[int, str, str]| or |DataFrame[a: int, b: str, c:
> str]| ?
> Last time I checked, Python didn’t support this. Presumably at least
> Python from 3.6 to 3.8 wouldn't support.
> I am experimentally trying this in another project that I am working
> on but it requires a bunch of hacks and doesn’t play well with MyPy.
>
It doesn't, but considering the structure of the API, I am not sure how
useful this would be in the first place. Additionally generics are
somewhat limited anyway ‒ even in the best case scenario you can re

In practice, the biggest advantage is actually support for completion,
not type checking (which works in simple cases).

>  
> I currently don't have a strong feeling about it for now though I tend
> to agree.
> If we should do this, I would like to take a more conservative path
> such as having some separation
> for now e.g.) separate repo in Apache if feasible or separate module,
> and then see how it goes and users like it.
>
As said before ‒ I am happy to transfer ownership of the stubs to ASF if
there is a will to maintain these (either as standalone or inlined variant).

However, I am strongly against adding random annotations in the codebase
over prolonged time, as it is likely to break existing type hints (there
is limited support for merging, but it doesn't work well), with no
obvious replacement soon.

If merging or transferring ownership is not an option more involvement
from the contributors would be more than enough to reduce maintanance
overhead and provide some opportunity for KT and such.

>
>
> 2020년 7월 22일 (수) 오전 6:10, Driesprong, Fokko
> 님이 작성:
>
> Fully agree Holden, would be great to include the Outreachy
> project. Adding annotations is a very friendly way to get familiar
> with the codebase.
>
> I've also created a PR to see what's needed to get mypy
> in: https://github.com/apache/spark/pull/29180 From there on we
> can start adding annotations.
>
> Cheers, Fokko
>
>
> Op di 21 jul. 2020 om 21:40 schreef Holden Karau
> mailto:hol...@pigscanfly.ca>>:
>
> Yeah I think this could be a great project now that we're only
> Python 3.5+. One potential is making this an Outreachy project
> to get more folks from different backgrounds involved in Spark.
>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko
>  wrote:
>
> Since we've recently dropped support for Python <=3.5
> , I think it
> would be nice to add support for type annotations. Having
> this in the main repository allows us to do type checking
> using MyPy  in the CI itself.
>
> This is now handled by the Stub
> file: https://www.python.org/dev/peps/pep-0484/#stub-files However
> I think it is nicer to integrate the types with the code
> itself to keep everything in sync, and make it easier for
> the people who work on the codebase itself. A first step
> would be to move the stubs into the codebase. First step
> would be to cover the public API which is the most
> important one. 

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/21/20 9:40 PM, Holden Karau wrote:
> Yeah I think this could be a great project now that we're only Python
> 3.5+. One potential is making this an Outreachy project to get more
> folks from different backgrounds involved in Spark.

I am honestly not sure if that's really the case.

At the moment I maintain almost complete set of annotations for the
project. These could  be ported in a single step with relatively little
effort.

As of the further maintenance ‒ this will have to be done along the
codebase changes to keep things in sync, so if outreach means
low-hanging-fruit, it is uniquely to serve this purpose.

Additionally, there are at least two considerations:

  * At some point (in general when things are heavy in generics, which
is the case here), annotations become somewhat painful to write.
  * In ideal case API design has to be linked (to reasonable extent)
with annotations design ‒ not every signature can be annotated in a
meaningful way, which is already a problem with some chunks of Spark
code.

>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko
>  wrote:
>
> Since we've recently dropped support for Python <=3.5
> , I think it would be
> nice to add support for type annotations. Having this in the main
> repository allows us to do type checking using MyPy
>  in the CI itself.
>
> This is now handled by the Stub
> file: https://www.python.org/dev/peps/pep-0484/#stub-files However
> I think it is nicer to integrate the types with the code itself to
> keep everything in sync, and make it easier for the people who
> work on the codebase itself. A first step would be to move the
> stubs into the codebase. First step would be to cover the public
> API which is the most important one. Having the types with the
> code itself makes it much easier to understand. For example, if
> you can supply a str or column
> here: 
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on Python should
> cover annotations on the public API's. Curious what the rest of
> the community thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323
> mailto:mszymkiew...@gmail.com>>:
>
> Given a discussion related to  SPARK-32320 PR
>    I'd like to
> resurrect this
> thread. Is there any interest in migrating annotations to the main
> repository?
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
>
>
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue.  In my opinion,
a bigger one is possible complexity. Annotations tend to introduce a lot of
cyclic dependencies in Spark codebase. This can be addressed, but don't
look great.


This is not true (anymore). With Python 3.6 you can add string annotations
-> 'DenseVector', and in the future with Python 3.7 this is fixed by having
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

Merging stubs into project structure from the other hand has almost no
overhead.


This feels awkward to me, this is like having the docstring in a separate
file. In my opinion you want to have the signatures and the functions
together for transparency and maintainability.

I think DBT is a very nice project where they use annotations very well:
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in
the annotations itself.

In practice, the biggest advantage is actually support for completion, not
type checking (which works in simple cases).


Agreed.

Would you be interested in writing up the Outreachy proposal for work on
this?


I would be, and also happy to mentor. But, I think we first need to agree
as a Spark community if we want to add the annotations to the code, and in
which extend.

At some point (in general when things are heavy in generics, which is the
case here), annotations become somewhat painful to write.


That's true, but that might also be a pointer that it is time to refactor
the function/code :)

For now, I tend to think adding type hints to the codes make it difficult
to backport or revert and more difficult to discuss about typing only
especially considering typing is arguably premature yet.


This feels a bit weird to me, since you want to keep this in sync right? Do
you provide different stubs for different versions of Python? I had to look
up the literals: https://www.python.org/dev/peps/pep-0586/

Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
mszymkiew...@gmail.com>:

>
> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> > For now, I tend to think adding type hints to the codes make it
> > difficult to backport or revert and
> > more difficult to discuss about typing only especially considering
> > typing is arguably premature yet.
>
> About being premature ‒ since typing ecosystem evolves much faster than
> Spark it might be preferable to keep annotations as a separate project
> (preferably under AST / Spark umbrella). It allows for faster iterations
> and supporting new features (for example Literals proved to be very
> useful), without waiting for the next Spark release.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
>


[DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Kun H .

Hi Spark developers,

My team has an internal storage format. It already has an implementaion of data 
source v2.

Now we want to adapt catalog support for it. I expect each partition can be 
stored in this format and spark catalog can manage partition columns which is 
just like using ORC and Parquet.

After checking the logic of DataSource.resolveRelation, I wonder if introducing 
another FileFormat for my storage spec is the only way to support catalog 
managed partition. Could any expert help to confirm?

Another question is the following comments "now catalog for data source V2 is 
under development". Anyone knows the progress or design of feature?


lazy val providingClass: Class[_] = {
  val cls = DataSource.lookupDataSource(className, 
sparkSession.sessionState.conf)
  // `providingClass` is used for resolving data source relation for catalog 
tables.
  // As now catalog for data source V2 is under development, here we fall back 
all the
  // [[FileDataSourceV2]] to [[FileFormat]] to guarantee the current catalog 
works.
  // [[FileDataSourceV2]] will still be used if we call the load()/save() 
method in
  // [[DataFrameReader]]/[[DataFrameWriter]], since they use method 
`lookupDataSource`
  // instead of `providingClass`.
  cls.newInstance() match {
case f: FileDataSourceV2 => f.fallbackFileFormat
case _ => cls
  }
}

Thanks,
Kun


Re: [DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Russell Spitzer
There is now a full catalog API you can implement which should give you the
control you are looking for. It is in Spark 3.0 and here is an example
implementation for supporting Cassandra.

https://github.com/datastax/spark-cassandra-connector/blob/master/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraCatalog.scala

I would definitely recommend using this api rather than messing with
catalyst directly.

On Wed, Jul 22, 2020, 7:58 AM Kun H.  wrote:

>
> Hi Spark developers,
>
> My team has an internal storage format. It already has an implementaion of
> data source v2.
>
> Now we want to adapt catalog support for it. I expect each partition can
> be stored in this format and spark catalog can manage partition columns
> which is just like using ORC and Parquet.
>
> After checking the logic of DataSource.resolveRelation, I wonder if
> introducing another FileFormat for my storage spec is the only way to
> support catalog managed partition. Could any expert help to confirm?
>
> Another question is the following comments "*now catalog for data source
> V2 is under development*". Anyone knows the progress or design of feature?
>
> lazy val providingClass: Class[_] = {
>   val cls = DataSource.lookupDataSource(className, 
> sparkSession.sessionState.conf)
>   // `providingClass` is used for resolving data source relation for catalog 
> tables.
>   // *As now catalog for data source V2 is under development*, here we fall 
> back all the
>   // [[FileDataSourceV2]] to [[FileFormat]] to guarantee the current catalog 
> works.
>   // [[FileDataSourceV2]] will still be used if we call the load()/save() 
> method in
>   // [[DataFrameReader]]/[[DataFrameWriter]], since they use method 
> `lookupDataSource`
>   // instead of `providingClass`.
>   cls.newInstance() match {
> case f: FileDataSourceV2 => f.fallbackFileFormat
> case _ => cls
>   }
> }
>
>
> Thanks,
> Kun
>


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko 
napisał(a):

> That's probably one-time overhead so it is not a big issue.  In my
> opinion, a bigger one is possible complexity. Annotations tend to introduce
> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
> don't look great.
>
>
> This is not true (anymore). With Python 3.6 you can add string annotations
> -> 'DenseVector', and in the future with Python 3.7 this is fixed by having
> postponed evaluation: https://www.python.org/dev/peps/pep-0563/
>

As far as I recall linked PEP addresses backrferences not cyclic
dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context
depends on pyspark.rdd and the other way around. These dependencies are not
explicit at he moment.



> Merging stubs into project structure from the other hand has almost no
> overhead.
>
>
> This feels awkward to me, this is like having the docstring in a separate
> file. In my opinion you want to have the signatures and the functions
> together for transparency and maintainability.
>
>
I guess that's the matter of preference. From maintainability perspective
it is actually much easier to have separate objects.

For example there are different types of objects that are required for
meaningful checking, which don't really exist in real code (protocols,
aliases, code generated signatures fo let complex overloads) as well as
some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is
separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and
keep it in-sync there with common CI pipeline and transfer ownership of
pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs
to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.


>
> I think DBT is a very nice project where they use annotations very well:
> https://github.com/fishtown-analytics/dbt/blob/dev/marian-
> anderson/core/dbt/graph/queue.py
>
> Also, they left out the types in the docstring, since they are available
> in the annotations itself.
>
>

> In practice, the biggest advantage is actually support for completion, not
> type checking (which works in simple cases).
>
>
> Agreed.
>
> Would you be interested in writing up the Outreachy proposal for work on
> this?
>
>
> I would be, and also happy to mentor. But, I think we first need to agree
> as a Spark community if we want to add the annotations to the code, and in
> which extend.
>





> At some point (in general when things are heavy in generics, which is the
> case here), annotations become somewhat painful to write.
>
>
> That's true, but that might also be a pointer that it is time to refactor
> the function/code :)
>

That might the case, but it is more often a matter capturing useful
properties combined with requirement to keep things in sync with Scala
counterparts.



> For now, I tend to think adding type hints to the codes make it difficult
> to backport or revert and more difficult to discuss about typing only
> especially considering typing is arguably premature yet.
>
>
> This feels a bit weird to me, since you want to keep this in sync right?
> Do you provide different stubs for different versions of Python? I had to
> look up the literals: https://www.python.org/dev/peps/pep-0586/
>

I think it is more about portability between Spark versions

>
>
> Cheers, Fokko
>

> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
> mszymkiew...@gmail.com>:
>
>>
>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>> > For now, I tend to think adding type hints to the codes make it
>> > difficult to backport or revert and
>> > more difficult to discuss about typing only especially considering
>> > typing is arguably premature yet.
>>
>> About being premature ‒ since typing ecosystem evolves much faster than
>> Spark it might be preferable to keep annotations as a separate project
>> (preferably under AST / Spark umbrella). It allows for faster iterations
>> and supporting new features (for example Literals proved to be very
>> useful), without waiting for the next Spark release.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>>
>>

-- 

Best regards,
Maciej Szymkiewicz


Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Imran Rashid
Hi Holden,

thanks for leading this discussion, I'm in favor in general.  I have one
specific question -- these two sections seem to contradict each other
slightly:

> If there is a -1 from a non-committer, multiple committers or the PMC
should be consulted before moving forward.
>
>If the original person who cast the veto can not be reached in a
reasonable time frame given likely holidays, it is up to the PMC to decide
the next steps within the guidelines of the ASF. This must be decided by a
consensus vote under the ASF voting rules.

I think the intent here is that if a *committer* gives a -1, then the PMC
has to have a consensus vote?  And if a non-committer gives a -1, then
multiple committers should be consulted?  How about combining those two
into something like

"All -1s with justification merit discussion.  A -1 from a non-committer
can be overridden only with input from multiple committers.  A -1 from a
committer requires a consensus vote of the PMC under ASF voting rules".


thanks,
Imran


On Tue, Jul 21, 2020 at 3:41 PM Holden Karau  wrote:

> Hi Spark Developers,
>
> There has been a rather active discussion regarding the specific vetoes
> that occured during Spark 3. From that I believe we are now mostly in
> agreement that it would be best to clarify our rules around code vetoes &
> merging in general. Personally I believe this change is important to help
> improve the appearance of a level playing field in the project.
>
> Once discussion settles I'll run this by a copy editor, my grammar isn't
> amazing, and bring forward for a vote.
>
> The current Spark committer guide is at
> https://spark.apache.org/committers.html. I am proposing we add a section
> on when it is OK to merge PRs directly above the section on how to merge
> PRs. The text I am proposing to amend our committer guidelines with is:
>
> PRs shall not be merged during active on topic discussion except for
> issues like critical security fixes of a public vulnerability. Under
> extenuating circumstances PRs may be merged during active off topic
> discussion and the discussion directed to a more appropriate venue. Time
> should be given prior to merging for those involved with the conversation
> to explain if they believe they are on topic.
>
> Lazy consensus requires giving time for discussion to settle, while
> understanding that people may not be working on Spark as their full time
> job and may take holidays. It is believed that by doing this we can limit
> how often people feel the need to exercise their veto.
>
> For the purposes of a -1 on code changes, a qualified voter includes all
> PMC members and committers in the project. For a -1 to be a valid veto it
> must include a technical reason. The reason can include things like the
> change may introduce a maintenance burden or is not the direction of Spark.
>
> If there is a -1 from a non-committer, multiple committers or the PMC
> should be consulted before moving forward.
>
>
> If the original person who cast the veto can not be reached in a
> reasonable time frame given likely holidays, it is up to the PMC to decide
> the next steps within the guidelines of the ASF. This must be decided by a
> consensus vote under the ASF voting rules.
>
> These policies serve to reiterate the core principle that code must not be
> merged with a pending veto or before a consensus has been reached (lazy or
> otherwise).
>
> It is the PMC’s hope that vetoes continue to be infrequent, and when they
> occur all parties take the time to build consensus prior to additional
> feature work.
>
>
> Being a committer means exercising your judgement, while working in a
> community with diverse views. There is nothing wrong in getting a second
> (or 3rd or 4th) opinion when you are uncertain. Thank you for your
> dedication to the Spark project, it is appreciated by the developers and
> users of Spark.
>
>
> It is hoped that these guidelines do not slow down development, rather by
> removing some of the uncertainty that makes it easier for us to reach
> consensus. If you have ideas on how to improve these guidelines, or other
> parts of how the Spark project operates you should reach out on the dev@
> list to start the discussion.
>
>
>
> Kind Regards,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau  wrote:

> Hi Folks,
>
> In Spark SQL there is the ability to have Spark do it's partition
> discovery/file listing in parallel on the worker nodes and also avoid
> locality lookups. I'd like to expose this in core, but given the Hadoop
> APIs it's a bit more complicated to do right. I
>

That's ultimately fixable, if we can sort out what's good from the app side
and reconcile that with 'what is not pathologically bad across both HDFS
and object stores".

Bad: globStatus, anything which returns an array rather than a remote
iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for:
incremental/async fetch of next page of listing, soon: option for iterator,
if cast to IOStatisticsSource, actually serve up stats on IO performance
during the listing. (e.g. #of list calls, mean time to get a list
response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve


> made a quick POC and two potential different paths we could do for
> implementation and wanted to see if anyone had thoughts -
> https://github.com/apache/spark/pull/29179.
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Holden Karau
Wonderful. To be clear the patch is more to start the discussion about how
we want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran  wrote:

>
>
> On Wed, 22 Jul 2020 at 00:51, Holden Karau  wrote:
>
>> Hi Folks,
>>
>> In Spark SQL there is the ability to have Spark do it's partition
>> discovery/file listing in parallel on the worker nodes and also avoid
>> locality lookups. I'd like to expose this in core, but given the Hadoop
>> APIs it's a bit more complicated to do right. I
>>
>
> That's ultimately fixable, if we can sort out what's good from the app
> side and reconcile that with 'what is not pathologically bad across both
> HDFS and object stores".
>
> Bad: globStatus, anything which returns an array rather than a remote
> iterator, encourages treewalk
> Good: deep recursive listings, remote iterator results for:
> incremental/async fetch of next page of listing, soon: option for iterator,
> if cast to IOStatisticsSource, actually serve up stats on IO performance
> during the listing. (e.g. #of list calls, mean time to get a list
> response back., store throttle events)
>
> Also look at LocatedFileStatus to see how it parallelises its work. its
> not perfect because wildcards are supported, which means globStatus gets
> used
>
> happy to talk about this some more, and I'll review the patch
>
> -steve
>
>
>> made a quick POC and two potential different paths we could do for
>> implementation and wanted to see if anyone had thoughts -
>> https://github.com/apache/spark/pull/29179.
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Holden Karau
On Wed, Jul 22, 2020 at 7:39 AM Imran Rashid < iras...@apache.org > wrote:

> Hi Holden,
>
> thanks for leading this discussion, I'm in favor in general.  I have one
> specific question -- these two sections seem to contradict each other
> slightly:
>
> > If there is a -1 from a non-committer, multiple committers or the PMC
> should be consulted before moving forward.
> >
> >If the original person who cast the veto can not be reached in a
> reasonable time frame given likely holidays, it is up to the PMC to decide
> the next steps within the guidelines of the ASF. This must be decided by a
> consensus vote under the ASF voting rules.
>
> I think the intent here is that if a *committer* gives a -1, then the PMC
> has to have a consensus vote?  And if a non-committer gives a -1, then
> multiple committers should be consulted?  How about combining those two
> into something like
>
> "All -1s with justification merit discussion.  A -1 from a non-committer
> can be overridden only with input from multiple committers.  A -1 from a
> committer requires a consensus vote of the PMC under ASF voting rules".
>
I can work with that although it wasn’t quite what I was originally going
for. I didn’t intend to have committer -1s be eligible for override. I
believe committers have demonstrated sufficient merit; they are the same as
PMC member -1s in our project.

My aim was just if something weird happens (like say I had a pending -1
before my motorcycle crash last year) we go to the PMC and take a binding
vote on what to do, and most likely someone on the PMC will reach out to
the ASF for understanding around the guidelines.

What about:

All -1s with justification merit discussion.  A -1 from a non-committer can
be overridden only with input from multiple committers and suitable time
for any committer to raise concerns.  A -1 from a committer who can not be
reached requires a consensus vote of the PMC under ASF voting rules to
determine the next steps within the ASF guidelines for vetos.

>
>
> thanks,
> Imran
>
>
> On Tue, Jul 21, 2020 at 3:41 PM Holden Karau  wrote:
>
>> Hi Spark Developers,
>>
>> There has been a rather active discussion regarding the specific vetoes
>> that occured during Spark 3. From that I believe we are now mostly in
>> agreement that it would be best to clarify our rules around code vetoes &
>> merging in general. Personally I believe this change is important to help
>> improve the appearance of a level playing field in the project.
>>
>> Once discussion settles I'll run this by a copy editor, my grammar isn't
>> amazing, and bring forward for a vote.
>>
>> The current Spark committer guide is at https://spark.apache.org/
>> committers.html. I am proposing we add a section on when it is OK to
>> merge PRs directly above the section on how to merge PRs. The text I am
>> proposing to amend our committer guidelines with is:
>>
>> PRs shall not be merged during active on topic discussion except for
>> issues like critical security fixes of a public vulnerability. Under
>> extenuating circumstances PRs may be merged during active off topic
>> discussion and the discussion directed to a more appropriate venue. Time
>> should be given prior to merging for those involved with the conversation
>> to explain if they believe they are on topic.
>>
>> Lazy consensus requires giving time for discussion to settle, while
>> understanding that people may not be working on Spark as their full time
>> job and may take holidays. It is believed that by doing this we can limit
>> how often people feel the need to exercise their veto.
>>
>> For the purposes of a -1 on code changes, a qualified voter includes all
>> PMC members and committers in the project. For a -1 to be a valid veto it
>> must include a technical reason. The reason can include things like the
>> change may introduce a maintenance burden or is not the direction of Spark.
>>
>> If there is a -1 from a non-committer, multiple committers or the PMC
>> should be consulted before moving forward.
>>
>>
>> If the original person who cast the veto can not be reached in a
>> reasonable time frame given likely holidays, it is up to the PMC to decide
>> the next steps within the guidelines of the ASF. This must be decided by a
>> consensus vote under the ASF voting rules.
>>
>> These policies serve to reiterate the core principle that code must not
>> be merged with a pending veto or before a consensus has been reached (lazy
>> or otherwise).
>>
>> It is the PMC’s hope that vetoes continue to be infrequent, and when they
>> occur all parties take the time to build consensus prior to additional
>> feature work.
>>
>>
>> Being a committer means exercising your judgement, while working in a
>> community with diverse views. There is nothing wrong in getting a second
>> (or 3rd or 4th) opinion when you are uncertain. Thank you for your
>> dedication to the Spark project, it is appreciated by the developers and
>> users of Spark.
>>
>>
>> It is hoped that these guid

[DISCUSS] [Spark confs] Making spark.jars conf take precedence over spark default classpath

2020-07-22 Thread nupurshukla
Hello, 

I am prototyping a change in the behavior of spark.jars conf for my
use-case.  spark.jars conf is used to specify a list of jars to include on
the driver and executor classpaths. 

*Current behavior:*  spark.jars conf value is not read until after the JVM
has already started and the system classloader has already loaded, and hence
the jars added using this conf get “appended” to the spark classpath. This
means that spark looks for the jar in its default classpath first and then
looks at the path specified in spark.jars conf. 

*Proposed prototype:* I am proposing a new behavior where we can have
spark.jars take precedence over spark default classpath in terms of how jars
are discovered. This can be achieved by using
spark.{driver,executor}.extraClassPath conf. This conf modifies the actual
launch command of the driver (or executors), and hence this path is
"prepended" to the classpath and thus takes precedence over the default
classpath. Can the behavior of conf spark.jars be modified by adding the
conf value of spark.jars to conf value of 
spark.{driver,executor}.extraClassPath during argument parsing in 
SparkSubmitArguments.scala

 
, so that we can achieve precedence order of jars specified in spark.jars > 
spark.{driver,executor}.extraClassPath > spark default classpath (left to
right precedence order)

*Pseudo sample code:*
In  loadEnvironmentArguments()

 
:

/if (jars != null) {
  if (driverExtraClassPath != null) {
driverExtraClassPath = driverExtraClassPath + "," + jars
  }
  else {
driverExtraClassPath = jars
  }
}/


*As an example*, consider jars :
sample-jar-1.0.0.jar present in spark’s default classpath
sample-jar-2.0.0.jar present on all nodes of the cluster at path 
//
new-jar-1.0.0.jar present on all nodes of the cluster at path //
(and not in spark default classpath)

And two scenarios 2 spark jobs are submitted with the following – jars conf
values


 


What are your thoughts on this? Could this have any undesired side-effects?
Or has this already been explored and there are some known issues with this
approach?

Thanks,
Nupur



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Felix Cheung
+1


From: Holden Karau 
Sent: Wednesday, July 22, 2020 10:49:49 AM
To: Steve Loughran 
Cc: dev 
Subject: Re: Exposing Spark parallelized directory listing & non-locality 
listing in core

Wonderful. To be clear the patch is more to start the discussion about how we 
want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran 
mailto:ste...@cloudera.com>> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition 
discovery/file listing in parallel on the worker nodes and also avoid locality 
lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit 
more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and 
reconcile that with 'what is not pathologically bad across both HDFS and object 
stores".

Bad: globStatus, anything which returns an array rather than a remote iterator, 
encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async 
fetch of next page of listing, soon: option for iterator, if cast to 
IOStatisticsSource, actually serve up stats on IO performance during the 
listing. (e.g. #of list calls, mean time to get a list response back., store 
throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not 
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve

made a quick POC and two potential different paths we could do for 
implementation and wanted to see if anyone had thoughts - 
https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau