Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Reynold Xin
Late +1

On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote:

> 
> 
> 
> Hi devs,
> 
> 
> 
> Thanks for all the inputs. I think overall there are positive inputs in
> Spark community about having RocksDB state store as external module. Then
> let's go forward with this direction and to improve structured streaming.
> I will keep update to the JIRA SPARK-34198.
> 
> 
> 
> Thanks all again for the inputs and discussion.
> 
> 
> 
> Liang-Chi Hsieh
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Reynold Xin
Enrico - do feel free to reopen the PRs or email people directly, unless you 
are told otherwise.

On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> There is no way to force people to review or commit something of course.
>> And keep in mind we get a lot of, shall we say, unuseful pull requests.
>> There is occasionally some blowback to closing someone's PR, so the path
>> of least resistance is often the timeout / 'soft close'. That is, it takes
>> a lot more time to satisfactorily debate down the majority of PRs that
>> probably shouldn't get merged, and there just isn't that much bandwidth.
>> That said of course it's bad if lots of good PRs are getting lost in the
>> shuffle and I am sure there are some.
>> 
>> 
>> One other aspect is that a committer is taking some degree of
>> responsibility for merging a change, so the ask is more than just a few
>> minutes of eyeballing. If it breaks something the merger pretty much owns
>> resolving it, and, the whole project owns any consequence of the change
>> for the future.
>> 
> 
> 
> 
> +1
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Reynold Xin
+1 Correctness issues are serious!

On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan < mri...@gmail.com > 
wrote:

> 
> That is indeed cause for concern.
> +1 on extending the voting deadline until we finish investigation of this.
> 
> 
> 
> 
> Regards,
> Mridul
> 
> 
> 
> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li < gatorsm...@gmail.com > wrote:
> 
> 
>> -1 Could we extend the voting deadline?
>> 
>> 
>> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
>> results between Spark 3.0 and Spark 3.1. We need a few more days to
>> understand whether these changes are expected.
>> 
>> 
>> Xiao
>> 
>> 
>> Mridul Muralidharan < mri...@gmail.com > 于2021年2月24日周三 上午10:41写道:
>> 
>> 
>>> 
>>> 
>>> Sounds good, thanks for clarifying Hyukjin !
>>> +1 on release.
>>> 
>>> 
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> 
>>> 
>>> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon < gurwls...@gmail.com > wrote:
>>> 
>>> 
>>> 
 
 
 I remember HiveExternalCatalogVersionsSuite was flaky for a while which is
 fixed in 
 https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
 
 and .. KafkaDelegationTokenSuite is flaky ( 
 https://issues.apache.org/jira/browse/SPARK-31250
 ).
 
 
 
 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan < mri...@gmail.com >님이 작성:
 
 
> 
> 
> Signatures, digests, etc check out fine.
> 
> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
> -Phive-thriftserver -Pmesos -Pkubernetes
> 
> 
> I keep getting test failures with
> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
> 
> 
> 
> Removing these suites gets the build through though - does anyone have
> suggestions on how to fix it ? I did not face this with RC1.
> 
> 
> 
> Regards,
> Mridul
> 
> 
> 
> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon < gurwls...@gmail.com >
> wrote:
> 
> 
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>> 
>> 
>> The vote is open until February 24th 11PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> 
>> 
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>> 
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>> 
>> The tag to be voted on is v3.1.1-rc3 (commit
>> 1d550c4e90275ab418b9161925049239227f3dc9):
>> https://github.com/apache/spark/tree/v3.1.1-rc3
>> 
>> 
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> ( https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/ )
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>> 
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1367
>> 
>> 
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>> 
>> 
>> 
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>> 
>> 
>> This release is using the release script of the tag v3.1.1-rc3.
>> 
>> 
>> FAQ
>> 
>> ===
>> What happened to 3.1.0?
>> ===
>> 
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see 
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html
>> for more details.
>> 
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install 
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>> 
>> 
>> ===
>

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Reynold Xin
I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It 
has a pretty large number of users and existing code. Also pretty easy to learn 
by data engineers.

pandas API is a great for data science, but isn't that great for some other 
tasks. It's super wide. Great for data scientists that have learned it, or 
great for copy paste from Stackoverflow.

On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
> 
> 
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> 
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
> 
> 
> 
> 
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon < gurwls...@gmail.com > wrote:
> 
> 
> 
>> 
>> 
>> Firstly my biggest reason is that I would like to promote this more as a
>> built-in support because it is simply
>> important to have it with the impact on the large user group, and the
>> needs are increasing
>> as the charts indicate. I usually think that features or add-ons stay as
>> third parties when it’s rather for a
>> smaller set of users, it addresses a corner case of needs, etc. I think
>> this is similar to the datasources
>> we have added. Spark ported CSV and Avro because more and more people use
>> it, and it became important
>> to have it as a built-in support.
>> 
>> 
>> 
>> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
>> experts from the
>> bigger community. Koalas’ team isn’t experts in all the areas, and there
>> are many missing corner
>> cases to fix, Some require deep expertise from specific areas.
>> 
>> 
>> 
>> One example is the type hints. Koalas uses type hints for schema
>> inference.
>> Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
>> way (
>> https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas
>> ).
>> Fortunately the way Koalas implemented is now partially proposed into
>> Python officially (PEP 646).
>> But Koalas could have been better with interacting with the Python
>> community more and actively
>> joining in the design issues together to lead the best output that
>> benefits both and more projects.
>> 
>> 
>> 
>> Thirdly, I would like to contribute to the growth of PySpark. The growth
>> of the Koalas is very fast given the
>> internal and external stats. The number of users has jumped up twice
>> almost every 4 ~ 6 months.
>> I think Koalas will be a good momentum to keep Spark up.
>> 
>> 
>> Fourthly, PySpark is still not Pythonic enough. For example, I hear
>> complaints such as "why does
>> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
>> APIs are very difficult to change
>> in Spark (as I emphasized above). This set of Koalas APIs will be able to
>> address these concerns
>> in PySpark.
>> 
>> Lastly, I really think PySpark needs its native plotting features. As I
>> emphasized before with
>> elaboration, I do think this is an important feature missing in PySpark
>> that users need.
>> I do think Koalas completes what PySpark is currently missing.
>> 
>> 
>> 
>> 
>> 
>> 2021년 3월 14일 (일) 오후 7:12, Sean Owen < sro...@gmail.com >님이 작성:
>> 
>> 
>>> I like koalas a lot. Playing devil's advocate, why not just let it
>>> continue to live as an add on? Usually the argument is it'll be maintained
>>> better in Spark but it's well maintained. It adds some overhead to
>>> maintaining Spark conversely. On the upside it makes it a little more
>>> discoverable. Are there more 'synergies'?
>>> 
>>> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon < gurwls...@gmail.com > wrote:
>>> 
>>> 
 
 
 Hi all,
 
 
 
 
 I would like to start the discussion on supporting pandas API layer on
 Spark.
 
 
 
 
 
 
 
 If we have a general consensus on having it in PySpark, I will initiate
 and drive an SPIP with a detailed explanation about the implementation’s
 overview and structure.
 
 
 
 I would appreciate it if I can know whether you guys support this or not
 before starting the SPIP.
 
 
 
 
  What do you want to propose?
 
 
 
 
 I have been working on the Koalas ( https://github.com/databricks/koalas )
 project that is essentially: pandas API support on Spark, and I would like
 to propose embracing Koalas in PySpark.
 
 
 
 
 
 
 
 More specifically, I am thinking about adding a separate package, to
 PySpark, for pandas APIs on PySpark Therefore i

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Reynold Xin
+1. Would open up a huge persona for Spark.

On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < cutl...@gmail.com > wrote:

> 
> +1 (non-binding)
> 
> 
> On Fri, Mar 26, 2021 at 9:49 AM Maciej < mszymkiew...@gmail.com > wrote:
> 
> 
>> +1 (nonbinding)
>> 
>> 
>> 
>> On 3/26/21 3:52 PM, Hyukjin Kwon wrote:
>> 
>> 
>>> 
>>> 
>>> Hi all,
>>> 
>>> 
>>> 
>>> I’d like to start a vote for SPIP: Support pandas API layer on PySpark.
>>> 
>>> 
>>> 
>>> The proposal is to embrace Koalas in PySpark to have the pandas API layer
>>> on PySpark.
>>> 
>>> 
>>> 
>>> 
>>> Please also refer to:
>>> 
>>> 
>>> 
>>> * Previous discussion in dev mailing list: [DISCUSS] Support pandas API
>>> layer on PySpark (
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html
>>> ).
>>> * JIRA: SPARK-34849 ( https://issues.apache.org/jira/browse/SPARK-34849 )
>>> * Koalas internals documentation: 
>>> https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Please vote on the SPIP for the next 72 hours:
>>> 
>>> 
>>> 
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Best regards,
>> Maciej Szymkiewicz
>> 
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-07 Thread Reynold Xin
+1

On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang < wgy...@gmail.com > wrote:

> 
> +1 (non-binding).
> 
> 
> On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) > wrote:
> 
> 
>> +1 for Apache Spark 3.2.0 RC7.
>> 
>> 
>> It looks good to me. I tested with EKS 1.21 additionally.
>> 
>> 
>> Cheers,
>> Dongjoon.
>> 
>> 
>> 
>> On Thu, Oct 7, 2021 at 7:46 PM 郑瑞峰 < ruifengz@ foxmail. com (
>> ruife...@foxmail.com ) > wrote:
>> 
>> 
>>> +1 (non-binding)
>>> 
>>> 
>>> 
>>> -- 原始邮件 --
>>> *发件人:* "Sean Owen" < srowen@ gmail. com ( sro...@gmail.com ) >;
>>> *发送时间:* 2021年10月7日(星期四) 晚上10:23
>>> *收件人:* "Gengliang Wang"< ltnwgl@ gmail. com ( ltn...@gmail.com ) >;
>>> *抄送:* "dev"< dev@ spark. apache. org ( dev@spark.apache.org ) >;
>>> *主题:* Re: [VOTE] Release Spark 3.2.0 (RC7)
>>> 
>>> 
>>> +1 again. Looks good in Scala 2.12, 2.13, and in Java 11.
>>> I note that the mem requirements for Java 11 tests seem to need to be
>>> increased but we're handling that separately. It doesn't really affect
>>> users.
>>> 
>>> On Wed, Oct 6, 2021 at 11:49 AM Gengliang Wang < ltnwgl@ gmail. com (
>>> ltn...@gmail.com ) > wrote:
>>> 
>>> 
 Please vote on releasing the following candidate as Apache Spark version
 3.2.0.
 
 
 
 The vote is open until 11:59pm Pacific time October 11 and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 
 
 
 [ ] +1 Release this package as Apache Spark 3.2.0
 
 [ ] -1 Do not release this package because ...
 
 
 
 To learn more about Apache Spark, please see http:/ / spark. apache. org/ (
 http://spark.apache.org/ )
 
 
 
 The tag to be voted on is v3.2.0-rc7 (commit
 5d45a415f3a29898d92380380cfd82bfc7f579ea):
 
 https:/ / github. com/ apache/ spark/ tree/ v3. 2. 0-rc7 (
 https://github.com/apache/spark/tree/v3.2.0-rc7 )
 
 
 
 The release files, including signatures, digests, etc. can be found at:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 2. 0-rc7-bin/ (
 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-bin/ )
 
 
 
 Signatures used for Spark RCs can be found in this file:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ KEYS (
 https://dist.apache.org/repos/dist/dev/spark/KEYS )
 
 
 
 The staging repository for this release can be found at:
 
 https:/ / repository. apache. org/ content/ repositories/ 
 orgapachespark-1394
 ( https://repository.apache.org/content/repositories/orgapachespark-1394 )
 
 
 
 
 The documentation corresponding to this release can be found at:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 2. 0-rc7-docs/ (
 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-docs/ )
 
 
 
 The list of bug fixes going into 3.2.0 can be found at the following URL:
 
 https://issues.apache.org/jira/projects/SPARK/versions/12349407
 ( https://issues.apache.org/jira/projects/SPARK/versions/12349407 )
 
 
 This release is using the release script of the tag v3.2.0-rc7.
 
 
 
 
 
 FAQ
 
 
 
 =
 
 How can I help test this release?
 
 =
 
 If you are a Spark user, you can help us test this release by taking
 
 an existing Spark workload and running on this release candidate, then
 
 reporting any regressions.
 
 
 
 If you're working in PySpark you can set up a virtual env and install
 
 the current RC and see if anything important breaks, in the Java/Scala
 
 you can add the staging repository to your projects resolvers and test
 
 with the RC (make sure to clean up the artifact cache before/after so
 
 you don't end up building with a out of date RC going forward).
 
 
 
 ===
 
 What should happen to JIRA tickets still targeting 3.2.0?
 
 ===
 
 The current list of open tickets targeted at 3.2.0 can be found at:
 
 https:/ / issues. apache. org/ jira/ projects/ SPARK (
 https://issues.apache.org/jira/projects/SPARK ) and search for "Target
 Version/s" = 3.2.0
 
 
 
 Committers should look at those and triage. Extremely important bug
 
 fixes, documentation, and API tweaks that impact compatibility should
 
 be worked on immediately. Everything else please retarget to an
 
 appropriate release.
 
 
 
 ==
 
 But my bug isn't fixed?
 
 ==
 
 In order to make timely releases, we will typically not hold the
 
 release unless the bug in question is a regression fro

Re: spark binary map

2021-10-16 Thread Reynold Xin
Read up on Unsafe here: https://mechanical-sympathy.blogspot.com/

On Sat, Oct 16, 2021 at 12:41 AM, Rohan Bajaj < rohanbaja...@gmail.com > wrote:

> 
> In 2015 Reynold Xin made improvements to Spark and it was basically moving
> some structures that were on the java heap and moving them off heap.
> 
> 
> In particular it seemed like the memory did not require any
> serialization/deserialization.
> 
> 
> How was the performed? Was the data memory mapped? If it was, then to
> avoid serialization/deserialization i'm assuming some sort of wrapper was
> introduced to allow access to that data.
> 
> 
> Something like this:
> 
> 
> struct DataType
> {
> long pointertoData;
> 
> 
> method1();
> method2();
> }
> 
> 
> but if it was done this way there is an extra indirection, and I'm
> assuming it benchmarked positively.
> 
> 
> Just trying to learn.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
tl;dr: there's no easy way to implement aggregate expressions that'd require 
multiple pass over data. It is simply not something that's supported and doing 
so would be very high cost.

Would you be OK using approximate percentile? That's relatively cheap.

On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> No takers here? :)
> 
> 
> I can see now why a median function is not available in most data
> processing systems. It's pretty annoying to implement!
> 
> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas < nicholas. chammas@ gmail.
> com ( nicholas.cham...@gmail.com ) > wrote:
> 
> 
>> I'm trying to create a new aggregate function. It's my first time working
>> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>> 
>> 
>> My goal is to create a function to calculate the median (
>> https://issues.apache.org/jira/browse/SPARK-26589 ).
>> 
>> 
>> As a very simple solution, I could just define median to be an alias of ` 
>> Percentile(col,
>> 0.5)`. However, the leading comment on the Percentile expression (
>> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39
>> ) highlights that it's very memory-intensive and can easily lead to
>> OutOfMemory errors.
>> 
>> 
>> So instead of using Percentile, I'm trying to create an Expression that
>> calculates the median without needing to hold everything in memory at
>> once. I'm considering two different approaches:
>> 
>> 
>> 1. Define Median as a combination of existing expressions: The median can
>> perhaps be built out of the existing expressions for Count (
>> https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48
>> ) and NthValue (
>> https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675
>> ).
>> 
>> 
>> 
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>> 
>> 
>> 
>> 
>> 
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>> 
>> 
>> 
>> 
>> 2. Another memory-light approach to calculating the median requires
>> multiple passes over the data to converge on the answer. The approach is 
>> described
>> here (
>> https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers
>> ). (I posted a sketch implementation of this approach using Spark's
>> user-level API here (
>> https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081
>> ).)
>> 
>> 
>> 
>>> I am also struggling to understand how I would build an aggregate function
>>> like this, since it requires multiple passes over the data. From what I
>>> can see, Catalyst's aggregate functions are designed to work with a single
>>> pass over the data.
>>> 
>>> 
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>> 
>> 
>> 
>> Again, this is my first serious foray into Catalyst. Any specific
>> implementation guidance is appreciated!
>> 
>> 
>> Nick
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin
This is why RoundRobinPartitioning shouldn't be used ...

On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote:

> 
> Hi Spark community,
> 
> I reported a data correctness issue in https:/ / issues. apache. org/ jira/
> browse/ SPARK-38388 ( https://issues.apache.org/jira/browse/SPARK-38388 ).
> In short, non-deterministic data + Repartition + FetchFailure could result
> in incorrect data, this is an issue we run into in production pipelines, I
> have an example to reproduce the bug in the ticket.
> 
> I report here to bring more attention, could you help confirm it's a bug
> and worth effort to further investigate and fix, thank you in advance for
> help!
> 
> Thanks,
> Jason Xu
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Stickers and Swag

2022-06-14 Thread Reynold Xin
Nice! Going to order a few items myself ...

On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote:

> 
> FYI now you can find the shopping information on https:/ / spark. apache. org/
> community ( https://spark.apache.org/community ) as well :)
> 
> 
> 
> Gengliang
> 
> 
> 
> 
> 
> 
>> On Jun 14, 2022, at 7:47 PM, Hyukjin Kwon < gurwls223@ gmail. com (
>> gurwls...@gmail.com ) > wrote:
>> 
>> Woohoo
>> 
>> On Tue, 14 Jun 2022 at 15:04, Xiao Li < gatorsmile@ gmail. com (
>> gatorsm...@gmail.com ) > wrote:
>> 
>> 
>>> Hi, all,
>>> 
>>> 
>>> The ASF has an official store at RedBubble (
>>> https://www.redbubble.com/people/comdev/shop ) that Apache Community
>>> Development (ComDev) runs. If you are interested in buying Spark Swag, 70
>>> products featuring the Spark logo are available: https:/ / www. redbubble.
>>> com/ shop/ ap/ 113203780 ( https://www.redbubble.com/shop/ap/113203780 )
>>> 
>>> 
>>> Go Spark!
>>> 
>>> 
>>> Xiao
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Reynold Xin
+1 super excited about this. I think it'd make Spark a lot more usable in 
application development and cloud setting:

(1) Makes it easier to embed in applications with thinner client dependencies.
(2) Easier to isolate user code vs system code in the driver.

(3) Opens up the potential to upgrade the server side for better performance 
and security updates without application changes.

One thing related to (2) I'd love to discuss, but at a separate thread, is 
whether it'd make sense to expand or work on a project to better specify system 
code vs user code boundary in the executors as well. Then we can really have 
complete system code vs user code isolation in execution.

On Wed, Jun 15, 2022 at 9:21 AM, Xiao Li < gatorsm...@gmail.com > wrote:

> 
> +1
> 
> 
> Xiao
> 
> beliefer < beliefer@ 163. com ( belie...@163.com ) > 于2022年6月14日周二 03:35写道:
> 
> 
> 
>> +1
>> Yeah, I tried to use Apache Livy, so as we can runing interactive query.
>> But the Spark Driver in Livy looks heavy.
>> 
>> 
>> The SPIP may resolve the issue.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> At 2022-06-14 18:11:21, "Wenchen Fan" < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng < ruifengz@ foxmail. com (
>>> ruife...@foxmail.com ) > wrote:
>>> 
>>> 
 +1
 
 
 
 
 -- 原始邮件 --
 *发件人:* "huaxin gao" < huaxin. gao11@ gmail. com ( huaxin.ga...@gmail.com )
 >;
 *发送时间:* 2022年6月14日(星期二) 上午8:47
 *收件人:* "L. C. Hsieh"< viirya@ gmail. com ( vii...@gmail.com ) >;
 *抄送:* "Spark dev list"< dev@ spark. apache. org ( dev@spark.apache.org ) >;
 
 *主题:* Re: [VOTE][SPIP] Spark Connect
 
 
 +1
 
 
 On Mon, Jun 13, 2022 at 5:42 PM L. C. Hsieh < viirya@ gmail. com (
 vii...@gmail.com ) > wrote:
 
 
> +1
> 
> On Mon, Jun 13, 2022 at 5:41 PM Chao Sun < sunchao@ apache. org (
> sunc...@apache.org ) > wrote:
> >
> > +1 (non-binding)
> >
> > On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> >>
> >> +1
> >>
> >> On Tue, 14 Jun 2022 at 08:50, Yuming Wang < wgyumg@ gmail. com (
> wgy...@gmail.com ) > wrote:
> >>>
> >>> +1.
> >>>
> >>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia < matei. zaharia@ gmail.
> com ( matei.zaha...@gmail.com ) > wrote:
> 
>  +1, very excited about this direction.
> 
>  Matei
> 
>  On Jun 13, 2022, at 11:07 AM, Herman van Hovell < herman@ databricks.
> com. INVALID ( her...@databricks.com.INVALID ) > wrote:
> 
>  Let me kick off the voting...
> 
>  +1
> 
>  On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell < herman@ 
>  databricks.
> com ( her...@databricks.com ) > wrote:
> >
> > Hi all,
> >
> > I’d like to start a vote for SPIP: "Spark Connect"
> >
> > The goal of the SPIP is to introduce a Dataframe based client/server
> API for Spark
> >
> > Please also refer to:
> >
> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
> Connect - A client and server interface for Apache Spark.
> > - Design doc: Spark Connect - A client and server interface for
> Apache Spark.
> > - JIRA: SPARK-39375
> >
> > Please vote on the SPIP for the next 72 hours:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don’t think this is a good idea because …
> >
> > Kind Regards,
> > Herman
> 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
 
 
 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Reynold Xin
Spark Connect :)

(It’s work in progress)

On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pings...@gmail.com > wrote:

> 
> Hey there, How can I get the same spark context in two different python
> processes?
> Let’s say I create a context in Process A, and then I want to use python
> subprocess B to get the spark context created by Process A. How can I
> achieve that?
> 
> 
> I've tried
> pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it
> will create a new spark context.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Reynold Xin
+1

On Thu, Jan 12, 2023 at 9:46 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for the proposal (guiding only without any code change).
> 
> 
> Thanks,
> Dongjoon.
> 
> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu < zsxwing@ gmail. com (
> zsxw...@gmail.com ) > wrote:
> 
> 
>> +1
>> 
>> 
>> 
>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das < tathagata. das1565@ gmail.
>> com ( tathagata.das1...@gmail.com ) > wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon < gurwls223@ gmail. com (
>>> gurwls...@gmail.com ) > wrote:
>>> 
>>> 
 +1
 
 
 On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim < kabhwan. opensource@ gmail. 
 com
 ( kabhwan.opensou...@gmail.com ) > wrote:
 
 
> bump for more visibility.
> 
> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim < kabhwan. opensource@ 
> gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
> 
>> Hi dev,
>> 
>> 
>> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
>> promoting Structured Streaming.
>> (Sorry for the late proposal, if we don't make the change in 3.4, we will
>> have to wait for another 6 months.)
>> 
>> 
>> We have been focusing on Structured Streaming for years (across multiple
>> major and minor versions), and during the time we haven't made any
>> improvements for DStream. Furthermore, recently we updated the DStream 
>> doc
>> to explicitly say DStream is a legacy project.
>> https:/ / spark. apache. org/ docs/ latest/ streaming-programming-guide. 
>> html#note
>> (
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>> )
>> 
>> 
>> 
>> The baseline of deprecation is that we don't see a particular use case
>> which only DStream solves. This is a different story with GraphX and
>> MLLIB, as we don't have replacements for that.
>> 
>> 
>> The proposal does not mean we will remove the API soon, as the Spark
>> project has been making deprecation against public API. I don't intend to
>> propose the target version for removal. The goal is to guide users to
>> refrain from constructing a new workload with DStream. We might want to 
>> go
>> with this in future, but it would require a new discussion thread at that
>> time.
>> 
>> 
>> What do you think?
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
> 
> 
 
 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Reynold Xin
+1

This is a great idea.

On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I’d like to start with a +1, better Python testing tools integrated into
> the project make sense.
> 
> On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu < amandastephanieliu@ gmail. com
> ( amandastephanie...@gmail.com ) > wrote:
> 
> 
>> Hi all,
>> 
>> I'd like to start the vote for SPIP: PySpark Test Framework.
>> 
>> The high-level summary for the SPIP is that it proposes an official test
>> framework for PySpark. Currently, there are only disparate open-source
>> repos and blog posts for PySpark testing resources. We can streamline and
>> simplify the testing process by incorporating test features, such as a
>> PySpark Test Base class (which allows tests to share Spark sessions) and
>> test util functions (for example, asserting dataframe and schema
>> equality).
>> 
>> *SPIP doc:* https:/ / docs. google. com/ document/ d/ 
>> 1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/
>> edit#heading=h. f5f0u2riv07v (
>> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>> )
>> 
>> 
>> *JIRA ticket:* https:/ / issues. apache. org/ jira/ browse/ SPARK-44042 (
>> https://issues.apache.org/jira/browse/SPARK-44042 )
>> 
>> *Discussion thread:* https:/ / lists. apache. org/ thread/ 
>> trwgbgn3ycoj8b8k8lkxko2hql23o41n
>> ( https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n )
>> 
>> Please vote on the SPIP for the next 72 hours:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>> 
>> Thank you!
>> 
>> Best,
>> Amanda Liu
>> 
>> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 
> 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Reynold Xin
Personally I'd love this, but I agree with some of the earlier comments that 
this should not be Python specific (meaning I should be able to implement a 
data source in Python and then make it usable across all languages Spark 
supports). I think we should find a way to make this reusable beyond Python 
(especially for SQL).

Python is the most popular programming language by a large margin, in general 
and among Spark users. Many of the organizations that use Spark often don't 
even have a single person that knows Scala. What if they want to implement a 
custom data source to fetch some data? Today we'd have to tell them to learn 
Scala/Java and the fairly complex data source API (v1 or v2).

Maciej - I understand your concern about endpoint throttling etc. And it goes 
much more beyond querying REST endpoints. I personally had that concern too 
when we were adding the JDBC data source (what if somebody launches a 512 node 
Spark cluster to query my single node MySQL cluster?!). But the built-in JDBC 
data source is one of the most popular data sources (I just looked up its usage 
on Databricks and it's by far the #1 data source outside of files, used by > 
1 organizations everyday).

On Sun, Jun 25, 2023 at 1:38 AM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> 
> 
> Thanks for your feedback Martin.
> 
> However, if the primary intended purpose of this API is to provide an
> interface for endpoint querying, then I find this proposal even less
> convincing.
> 
> 
> 
> Neither the Spark execution model nor the data source API (full or
> restricted as proposed here) are a good fit for handling problems arising
> from massive endpoint requests, including, but not limited to, handling
> quotas and rate limiting.
> 
> 
> 
> Consistency and streamlined development are, of course, valuable.
> Nonetheless, they are not sufficient, especially if they cannot deliver
> the expected user experience in terms of reliability and execution cost.
> 
> 
> 
> 
> 
> 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> PGP: A30CEF0C31A501EC
> On 6/24/23 23:42, Martin Grund wrote:
> 
> 
>> Hey,
>> 
>> 
>> I would like to express my strong support for Python Data Sources even
>> though they might not be immediately as powerful as Scala-based data
>> sources. One element that is easily lost in this discussion is how much
>> faster the iteration speed is with Python compared to Scala. Due to the
>> dynamic nature of Python, you can design and build a data source while
>> running in a notebook and continuously change the code until it works as
>> you want. This behavior is unparalleled!
>> 
>> 
>> There exists a litany of Python libraries connecting to all kinds of
>> different endpoints that could provide data that is usable with Spark. I
>> personally can imagine implementing a data source on top of the AWS SDK to
>> extract EC2 instance information. Now I don't have to switch tools and can
>> keep my pipeline consistent.
>> 
>> 
>> Let's say you want to query an API in parallel from Spark using Python, today
>> 's way would be to create a Python RDD and then implement the planning and
>> execution process manually. Finally calling `toDF` in the end. While the
>> actual code of the DS and the RDD-based implementation would be very
>> similar, the abstraction that is provided by the DS is much more powerful
>> and future-proof. Performing dynamic partition elimination, and filter
>> push-down can all be implemented at a later point in time.
>> 
>> 
>> Comparing a DS to using batch calling from a UDF is not great because, the
>> execution pattern would be very brittle. Imagine something like
>> `spark.range(10).withColumn("data",
>> fetch_api).explode(col("data")).collect()`. Here you're encoding
>> partitioning logic and data transformation in simple ways, but you can't
>> reason about the structural integrity of the query and tiny changes in the
>> UDF interface might already cause a lot of downstream issues.
>> 
>> 
>> 
>> 
>> Martin
>> 
>> 
>> 
>> On Sat, Jun 24 , 2023 at 1:44 AM Maciej < mszymkiewicz@ gmail. com (
>> mszymkiew...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> With such limited scope (both language availability and features) do we
>>> have any representative examples of sources that could significantly
>>> benefit from providing this API,  compared other available options, such
>>> as batch imports, direct queries from vectorized  UDFs or even interfacing
>>> sources through 3rd party FDWs?
>>> 
>>> 
>>> Best regards,
>>> Maciej Szymkiewicz
>>> 
>>> Web: https:/ / zero323. net ( https://zero323.net )
>>> PGP: A30CEF0C31A501EC
>>> On 6/20/23 16:23, Wenchen Fan wrote:
>>> 
>>> 
 In an ideal world, every data source you want to connect to already has a
 Spark data source implementation (either v1 or v2), then this Python API
 is useless. But I feel it's common that people want to do quick data
 exploration, and the target data system is not popular eno

Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Reynold Xin
+1!

On Fri, Jul 7 2023 at 11:58 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> +1
> 
> 
> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao < huaxin.ga...@gmail.com > wrote:
> 
> 
> 
>> +1
>> 
>> 
>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh < mich.talebza...@gmail.com
>> > wrote:
>> 
>> 
>>> +1 for me
>>> 
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ** view my Linkedin profile (
>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, 7 Jul 2023 at 11:05, Martin Grund 
>>> wrote:
>>> 
>>> 
 +1 (non-binding)
 
 
 On Fri, Jul 7, 2023 at 12:05 AM Denny Lee < denny.g@gmail.com > wrote:
 
 
 
> +1 (non-binding)
> 
> On Fri, Jul 7, 2023 at 00:50 Maciej < mszymkiew...@gmail.com > wrote:
> 
> 
>> 
>> 
>> +0
>> 
>> 
>> Best regards,
>> Maciej Szymkiewicz
>> 
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>> On 7/6/23 17:41, Xiao Li wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> Xiao
>>> 
>>> Hyukjin Kwon < gurwls...@apache.org > 于2023年7月5日周三 17:28写道:
>>> 
>>> 
 +1.
 
 See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
 
 
 On Thu, 6 Jul 2023 at 09:15, Allison Wang 
 
 ( allison.w...@databricks.com.invalid ) wrote:
 
 
> Hi all,
> 
> I'd like to start the vote for SPIP: Python Data Source API.
> 
> The high-level summary for the SPIP is that it aims to introduce a 
> simple
> API in Python for Data Sources. The idea is to enable Python 
> developers to
> create data sources without learning Scala or dealing with the
> complexities of the current data source APIs. This would make Spark 
> more
> accessible to the wider Python developer community.
> 
> 
> References:
> 
> 
> * SPIP doc (
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
> )
> * JIRA ticket ( https://issues.apache.org/jira/browse/SPARK-44076 )
> * Discussion thread (
> https://lists.apache.org/thread/w621zn14ho4rw61b0s139klnqh900s8y )
> 
> 
> 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
> 
> Thanks,
> Allison
> 
 
 
>>> 
>>> 
>> 
>> 
> 
> 
 
 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Twitter: https://twitter.com/holdenkarau
> 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Reynold Xin
It should be the same as SQL. Otherwise it takes away a lot of potential future 
optimization opportunities.

On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com > 
wrote:

> 
> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
> 
> 
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
> 
> 
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset (
> https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301
> ) implies otherwise.
> 
> 
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory
> a future implementation change may result in “Bob" being the “first” row
> in the DataFrame, rather than “Tom”. That would make the example
> incorrect.
> 
> 
> Is that not the case?
> 
> 
> Nick
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

2023-11-04 Thread Reynold Xin
Why do we need this? The reason data source APIs need it is because it will be 
used by very unsophisticated end users and used all the time (for each 
connection / query). Shuffle is something you set up once, presumably by fairly 
sophisticated admins / engineers.

On Sat, Nov 04, 2023 at 2:42 PM, Alessandro Bellina < abell...@gmail.com > 
wrote:

> 
> Hello devs,
> 
> 
> I would like to start discussion on the SPIP "ShuffleManager short name
> registration via SparkPlugin"
> 
> 
> The idea behind this change is to allow a driver plugin (spark.plugins) to
> export ShuffleManagers via short names, along with sensible default
> configurations. Users can then use this short name to enable this
> ShuffleManager + configs using spark.shuffle.manager.
> 
> 
> SPIP: https:/ / docs. google. com/ document/ d/ 
> 1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/
> edit#heading=h. vqpecs4nrsto (
> https://docs.google.com/document/d/1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/edit#heading=h.vqpecs4nrsto
> )
> JIRA: https:/ / issues. apache. org/ jira/ browse/ SPARK-45792 (
> https://issues.apache.org/jira/browse/SPARK-45792 )
> 
> 
> I look forward to hearing your feedback.
> 
> 
> Thanks
> 
> 
> Alessandro
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] SPIP: Testing Framework for Spark UI Javascript files

2023-11-24 Thread Reynold Xin
+1

On Fri, Nov 24, 2023 at 10:19 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1
> 
> 
> Thanks,
> Dongjoon.
> 
> On Fri, Nov 24, 2023 at 7:14 PM Ye Zhou < zhouyejoe@ gmail. com (
> zhouye...@gmail.com ) > wrote:
> 
> 
>> +1(non-binding)
>> 
>> On Fri, Nov 24, 2023 at 11:16 Mridul Muralidharan < mridul@ gmail. com (
>> mri...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On Fri, Nov 24, 2023 at 8:21 AM Kent Yao < yao@ apache. org (
>>> y...@apache.org ) > wrote:
>>> 
>>> 
 Hi Spark Dev,
 
 Following the discussion [1], I'd like to start the vote for the SPIP [2].
 
 
 The SPIP aims to improve the test coverage and develop experience for
 Spark UI-related javascript codes.
 
 This thread will be open for at least the next 72 hours.  Please vote
 accordingly,
 
 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …
 
 
 Thank you!
 Kent Yao
 
 [1] https:/ / lists. apache. org/ thread/ 5rqrho4ldgmqlc173y2229pfll5sgkff
 ( https://lists.apache.org/thread/5rqrho4ldgmqlc173y2229pfll5sgkff )
 [2] https:/ / docs. google. com/ document/ d/ 
 1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/
 edit?usp=sharing (
 https://docs.google.com/document/d/1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/edit?usp=sharing
 )
 
 -
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin
+1 on doing this in 3.0.

On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I’m +1 if 3.0
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, March 25, 2019 6:48 PM
> *To:* Hyukjin Kwon
> *Cc:* dev; Bryan Cutler; Takuya UESHIN; shane knapp
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>  
> I don't know a lot about Arrow here, but seems reasonable. Is this for
> Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
> seems right.
> 
> On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> >
> > Hi all,
> >
> > We really need to upgrade the minimal version soon. It's actually
> slowing down the PySpark dev, for instance, by the overhead that sometimes
> we need currently to test all multiple matrix of Arrow and Pandas. Also,
> it currently requires to add some weird hacks or ugly codes. Some bugs
> exist in lower versions, and some features are not supported in low
> PyArrow, for instance.
> >
> > Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and
> my opinion as well, we should better increase the minimal version to
> 0.12.x. (Also, note that Pandas <> Arrow is an experimental feature).
> >
> > So, I and Bryan will proceed this roughly in few days if there isn't
> objections assuming we're fine with increasing it to 0.12.x. Please let me
> know if there are some concerns.
> >
> > For clarification, this requires some jobs in Jenkins to upgrade the
> minimal version of PyArrow (I cc'ed Shane as well).
> >
> > PS: I roughly heard that Shane's busy for some work stuff .. but it's
> kind of important in my perspective.
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin
At some point we should celebrate having the larger RC number ever in Spark ...

On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote:

> 
> 
> 
> RC9 was just cut. Will send out another thread once the build is finished.
> 
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Mon, Mar 25, 2019 at 5:10 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> That's all merged now. I think you're clear to start an RC.
>> 
>> 
>> 
>> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai < dbtsai@ dbtsai. com. invalid (
>> dbt...@dbtsai.com.invalid ) > wrote:
>> 
>> 
>>> 
>>> 
>>> I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961 https:/ / 
>>> github.
>>> com/ apache/ spark/ pull/ 24126 (
>>> https://github.com/apache/spark/pull/24126 ) , anything critical that we
>>> have to wait for 2.4.1 release? Thanks!
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ / 
>>> www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Sun, Mar 24, 2019 at 8:19 PM Sean Owen < srowen@ apache. org (
>>> sro...@apache.org ) > wrote:
>>> 
>>> 
 
 
 Still waiting on a successful test - hope this one works.
 
 
 
 On Sun, Mar 24, 2019, 10:13 PM DB Tsai < dbtsai@ dbtsai. com (
 dbt...@dbtsai.com ) > wrote:
 
 
> 
> 
> Hello Sean,
> 
> 
> 
> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we can
> merge it into 2.4 branch soon?
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / 
> www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> I think we can/should get in SPARK-26961 too; it's all but ready to
>> commit.
>> 
>> 
>> 
>> On Sat, Mar 23, 2019 at 2:02 PM DB Tsai < dbtsai@ dbtsai. com (
>> dbt...@dbtsai.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> -1
>>> 
>>> 
>>> 
>>> I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
>>> SPARK-27178, SPARK-27112. Please let me know if there is any critical PR
>>> that has to be back-ported into branch-2.4.
>>> 
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ 
>>> / www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Fri, Mar 22, 2019 at 12:28 AM DB Tsai < dbtsai@ dbtsai. com (
>>> dbt...@dbtsai.com ) > wrote:
>>> 
>>> 
 
 
 Since we have couple concerns and hesitations to release rc8, how 
 about we
 give it couple days, and have another vote on March 25, Monday? In this
 case, I will cut another rc9 in the Monday morning.
 
 
 
 Darcy, as Dongjoon mentioned,
 https:/ / github. com/ apache/ spark/ pull/ 24092 (
 https://github.com/apache/spark/pull/24092 ) is conflict against
 branch-2.4, can you make anther PR against branch-2.4 so we can include
 the ORC fix in 2.4.1?
 
 
 
 Thanks.
 
 
 
 Sincerely,
 
 
 
 DB Tsai
 -- Web: 
 https:/ / www.
 dbtsai. com ( https://www.dbtsai.com/ )
 PGP Key ID: 42E5B25A8F7A82C1
 
 
 
 On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung < felixcheung_m@ hotmail. 
 com
 ( felixcheun...@hotmail.com ) > wrote:
 
 
> 
> 
> Reposting for shane here
> 
> 
> 
> [SPARK-27178]
> https:/ / github. com/ apache/ spark/ commit/ 
> 342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> (
> https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> )
> 
> 
> 
> Is problematic too and it’s not in the rc8 cut
> 
> 
> 
> https:/ / github. com/ apache/ spark/ commits/ branch-2. 4 (
> https://github.com/apache/spark/commits/branch-2.4 )
> 
> 
> 
> (Personally I don’t want to delay 2.4.1 either..)
> 
> 
> 
> 
> From

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have been thinking about some of these issues. Some of them are harder
to do, e.g. Spark DataFrames are fundamentally immutable, and making the
logical plan mutable is a significant deviation from the current paradigm
that might confuse the hell out of some users. We are considering building
a shim layer as a separate project on top of Spark (so we can make rapid
releases based on feedback) just to test this out and see how well it could
work in practice.

On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari 
wrote:

> Hi,
> I was doing some spark to pandas (and vice versa) conversion because some
> of the pandas codes we have don't work on huge data. And some spark codes
> work very slow on small data.
>
> It was nice to see that pyspark had some similar syntax for the common
> pandas operations that the python community is used to.
>
> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
> Column selects: df[['col1', 'col2']]
> Row Filters: df[df['col1'] < 3.0]
>
> I was wondering about a bunch of other functions in pandas which seemed
> common. And thought there must've been a discussion about it in the
> community - hence started this thread.
>
> I was wondering whether there has been discussion on adding the following
> functions:
>
> *Column setters*:
> In Pandas:
> df['col3'] = df['col1'] * 3.0
> While I do the following in PySpark:
> df = df.withColumn('col3', df['col1'] * 3.0)
>
> *Column apply()*:
> In Pandas:
> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
> While I do the following in PySpark:
> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>
> I understand that this one cannot be as simple as in pandas due to the
> output-type that's needed here. But could be done like:
> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>
> Multi column in pandas is:
> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
> directly it would be similar (?):
> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
> 'float')
>
> *Rename*:
> In Pandas:
> df.rename(columns={...})
> While I do the following in PySpark:
> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>
> *To Dictionary*:
> In Pandas:
> df.to_dict(orient='list')
> While I do the following in PySpark:
> {f.name: [row[i] for row in df.collect()] for i, f in
> enumerate(df.schema.fields)}
>
> I thought I'd start the discussion with these and come back to some of the
> others I see that could be helpful.
>
> *Note*: (with the column functions in mind) I understand the concept of
> the DataFrame cannot be modified. And I am not suggesting we change that
> nor any underlying principle. Just trying to add syntactic sugar here.
>
>


Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have some early stuff there but not quite ready to talk about it in
public yet (I hope soon though). Will shoot you a separate email on it.

On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari 
wrote:

> Thanks for the reply Reynold - Has this shim project started ?
> I'd love to contribute to it - as it looks like I have started making a
> bunch of helper functions to do something similar for my current task and
> would prefer not doing it in isolation.
> Was considering making a git repo and pushing stuff there just today
> morning. But if there's already folks working on it - I'd prefer
> collaborating.
>
> Note - I'm not recommending we make the logical plan mutable (as I am
> scared of that too!). I think there are other ways of handling that - but
> we can go into details later.
>
> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> Hi,
>>> I was doing some spark to pandas (and vice versa) conversion because
>>> some of the pandas codes we have don't work on huge data. And some spark
>>> codes work very slow on small data.
>>>
>>> It was nice to see that pyspark had some similar syntax for the common
>>> pandas operations that the python community is used to.
>>>
>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>> Column selects: df[['col1', 'col2']]
>>> Row Filters: df[df['col1'] < 3.0]
>>>
>>> I was wondering about a bunch of other functions in pandas which seemed
>>> common. And thought there must've been a discussion about it in the
>>> community - hence started this thread.
>>>
>>> I was wondering whether there has been discussion on adding the
>>> following functions:
>>>
>>> *Column setters*:
>>> In Pandas:
>>> df['col3'] = df['col1'] * 3.0
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>
>>> *Column apply()*:
>>> In Pandas:
>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
>>> df['col1']))
>>>
>>> I understand that this one cannot be as simple as in pandas due to the
>>> output-type that's needed here. But could be done like:
>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>
>>> Multi column in pandas is:
>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>>> directly it would be similar (?):
>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>>> 'float')
>>>
>>> *Rename*:
>>> In Pandas:
>>> df.rename(columns={...})
>>> While I do the following in PySpark:
>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>
>>> *To Dictionary*:
>>> In Pandas:
>>> df.to_dict(orient='list')
>>> While I do the following in PySpark:
>>> {f.name: [row[i] for row in df.collect()] for i, f in
>>> enumerate(df.schema.fields)}
>>>
>>> I thought I'd start the discussion with these and come back to some of
>>> the others I see that could be helpful.
>>>
>>> *Note*: (with the column functions in mind) I understand the concept of
>>> the DataFrame cannot be modified. And I am not suggesting we change that
>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>
>>>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We just made the repo public: https://github.com/databricks/spark-pandas

On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > 
wrote:

> 
> To add more details to what Reynold mentioned. As you said, there is going
> to be some slight differences in any case between Pandas and Spark in any
> case, simply because Spark needs to know the return types of the
> functions. In your case, you would need to slightly refactor your apply
> method to the following (in python 3) to add type hints:
> 
> 
> ```
> def f(x) -> float: return x * 3.0
> df['col3'] = df['col1'].apply(f)
> ```
> 
> 
> This has the benefit of keeping your code fully compliant with both pandas
> and pyspark. We will share more information in the future.
> 
> 
> Tim
> 
> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> 
> 
>> BTW, I am working on the documentation related with this subject at https:/
>> / issues. apache. org/ jira/ browse/ SPARK-26022 (
>> https://issues.apache.org/jira/browse/SPARK-26022 ) to describe the
>> difference
>> 
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) >님이 작성:
>> 
>> 
>>> We have some early stuff there but not quite ready to talk about it in
>>> public yet (I hope soon though). Will shoot you a separate email on it.
>>> 
>>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < abdealikothari@ gmail. 
>>> com
>>> ( abdealikoth...@gmail.com ) > wrote:
>>> 
>>> 
>>>> Thanks for the reply Reynold - Has this shim project started ?
>>>> I'd love to contribute to it - as it looks like I have started making a
>>>> bunch of helper functions to do something similar for my current task and
>>>> would prefer not doing it in isolation.
>>>> Was considering making a git repo and pushing stuff there just today
>>>> morning. But if there's already folks working on it - I'd prefer
>>>> collaborating.
>>>> 
>>>> 
>>>> Note - I'm not recommending we make the logical plan mutable (as I am
>>>> scared of that too!). I think there are other ways of handling that - but
>>>> we can go into details later.
>>>> 
>>>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> We have been thinking about some of these issues. Some of them are harder
>>>>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>>>>> logical plan mutable is a significant deviation from the current paradigm
>>>>> that might confuse the hell out of some users. We are considering building
>>>>> a shim layer as a separate project on top of Spark (so we can make rapid
>>>>> releases based on feedback) just to test this out and see how well it
>>>>> could work in practice.
>>>>> 
>>>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < abdealikothari@ gmail. 
>>>>> com
>>>>> ( abdealikoth...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi,
>>>>>> I was doing some spark to pandas (and vice versa) conversion because some
>>>>>> of the pandas codes we have don't work on huge data. And some spark codes
>>>>>> work very slow on small data.
>>>>>> 
>>>>>> It was nice to see that pyspark had some similar syntax for the common
>>>>>> pandas operations that the python community is used to.
>>>>>> 
>>>>>> 
>>>>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>>>>> 
>>>>>> Column selects: df[['col1', 'col2']]
>>>>>> 
>>>>>> Row Filters: df[df['col1'] < 3.0]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I was wondering about a bunch of other functions in pandas which seemed
>>>>>> common. And thought there must've been a discussion about it in the
>>>>>> community - hence started this thread.
>>>>>> 
>>>>>> 
>>>>>> I was wondering whether there has been discussion on adding the following
>>>>>> functions

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the 
codebase. Also you can't just add the benefits up this way, because:

- Both vectorization and codegen reduces the overhead in virtual function calls

- Vectorization code is more friendly to compilers / CPUs, but requires 
materializing a lot of data in memory (or cache)

- Codegen reduces the amount of data that flows through memory, but for complex 
queries the generated code might not be very compiler / CPU friendly

I see massive benefits in leveraging GPUs (and other accelerators) for numeric 
workloads (e.g. machine learning), so I think it makes a lot of sense to be 
able to get data out of Spark quickly into UDFs for such workloads.

I don't see as much benefits for general data processing, for a few reasons:

1. GPU machines are much more expensive & difficult to get (e.g. in the cloud 
they are 3 to 5x more expensive, with limited availability based on my 
experience), so it is difficult to build a farm

2. Bandwidth from system to GPUs is usually small, so if you could fit the 
working set in GPU memory and repeatedly work on it (e.g. machine learning), 
it's great, but otherwise it's not great.

3. It's a massive effort.

In general it's a cost-benefit trade-off. I'm not aware of any general 
framework that allows us to write code once and have it work against both GPUs 
and CPUs reliably. If such framework exists, it will change the equation.

On Tue, Mar 26, 2019 at 6:57 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> Cloudera reports a 26% improvement in hive query runtimes by enabling
> vectorization. I would expect to see similar improvements but at the cost
> of keeping more data in memory.  But remember this also enables a number
> of different hardware acceleration techniques.  If the data format is
> arrow compatible and off-heap someone could offload the processing to
> native code which typically results in a 2x improvement over java (and the
> cost of a JNI call would be amortized over processing an entire batch at
> once).  Also, we plan on adding in GPU acceleration and ideally making it
> a standard part of Spark.  In our initial prototype, we saw queries which
> we could make fully columnar/GPU enabled being 5-6x faster.  But that
> really was just a proof of concept and we expect to be able to do quite a
> bit better when we are completely done.  Many commercial GPU enabled SQL
> engines claim to be 20x to 200x faster than Spark, depending on the use
> case. Digging deeply you see that they are not apples to apples
> comparisons, i.e. reading from cached GPU memory and having spark read
> from a file, or using parquet as input but asking spark to read CSV.  That
> being said I would expect that we can achieve something close to the 20x
> range for most queries and possibly more if they are computationally
> intensive.
> 
> 
> Also as a side note, we initially thought that the conversion would not be
> too expensive and that we could just move computationally intensive
> processing onto the GPU piecemeal with conversions on both ends.  In
> practice, we found that the cost of conversion quickly starts to dominate
> the queries we were testing.
> 
> On Mon, Mar 25, 2019 at 11:53 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> Do you have some initial perf numbers? It seems fine to me to remain
>> row-based inside Spark with whole-stage-codegen, and convert rows to
>> columnar batches when communicating with external systems.
>> 
>> On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans < bobby@ apache. org (
>> bo...@apache.org ) > wrote:
>> 
>> 
>>> 
>>> 
>>> This thread is to discuss adding in support for data frame processing
>>> using an in-memory columnar format compatible with Apache Arrow.  My main
>>> goal in this is to lay the groundwork so we can add in support for GPU
>>> accelerated processing of data frames, but this feature has a number of
>>> other benefits.  Spark currently supports Apache Arrow formatted data as
>>> an option to exchange data with python for pandas UDF processing. There
>>> has also been discussion around extending this to allow for exchanging
>>> data with other tools like pytorch, tensorflow, xgboost,... If Spark
>>> supports processing on Arrow compatible data it could eliminate the
>>> serialization/deserialization overhead when going between these systems. 
>>> It also would allow for doing optimizations on a CPU with SIMD
>>> instructions similar to what Hive currently supports. Accelerated
>>> processing using a GPU is something that we will start a separate
>>> discussion thread on, but I wanted to set the context a bit.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Jason Lowe, Tom Graves, and I created a prototype over the past few months
>>> to try and understand how to make this work.  What we are proposing is
>>> based off of lessons learned when building this prototype, but we really
>>> wanted to get feedback early on from the community. 

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Yes this is known and an issue for performance. Do you have any thoughts on
how to fix this?

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson  wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>


Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
They are unfortunately all pretty substantial (which is why this problem 
exists) ...

On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> At a high level, some candidate strategies are:
> 
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried
> doing this and it will compile, but spark's internals only consider a
> predefined universe of DataType classes.
> 
> 
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
> 
> 
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>> 
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>> eerla...@redhat.com ) > wrote:
>> 
>> 
>>> I describe some of the details here:
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>> 
>>> 
>>> 
>>> The short version of the story is that aggregating data structures (UDTs)
>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>> row in a data frame.
>>> Cheers,
>>> Erik
>>> 
>> 
>> 
> 
>

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Not that I know of. We did do some work to make it work faster in the case of 
lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949

On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> BTW, if this is known, is there an existing JIRA I should link to?
> 
> 
> On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson < eerlands@ redhat. com (
> eerla...@redhat.com ) > wrote:
> 
> 
>> 
>> 
>> At a high level, some candidate strategies are:
>> 
>> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
>> trait itself) so that the update method can do the right thing.
>> 2. Expose TypedImperativeAggregate to users for defining their own, since
>> it already does the right thing.
>> 
>> 3. As a workaround, allow users to define their own sub-classes of
>> DataType.  It would essentially allow one to define the sqlType of the UDT
>> to be the aggregating object itself and make ser/de a no-op.  I tried
>> doing this and it will compile, but spark's internals only consider a
>> predefined universe of DataType classes.
>> 
>> 
>> All of these options are likely to have implications for the catalyst
>> systems. I'm not sure if they are minor more substantial.
>> 
>> 
>> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Yes this is known and an issue for performance. Do you have any thoughts
>>> on how to fix this?
>>> 
>>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>>> eerla...@redhat.com ) > wrote:
>>> 
>>> 
>>>> I describe some of the details here:
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>>> 
>>>> 
>>>> 
>>>> The short version of the story is that aggregating data structures (UDTs)
>>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>>> row in a data frame.
>>>> Cheers,
>>>> Erik
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin
We tried enabling blacklisting for some customers and in the cloud, very 
quickly they end up having 0 executors due to various transient errors. So 
unfortunately I think the current implementation is terrible for cloud 
deployments, and shouldn't be on by default. The heart of the issue is that the 
current implementation is not great at dealing with transient errors vs 
catastrophic errors.

+Chris who was involved with those tests.

On Thu, Mar 28, 2019 at 3:32 PM, Ankur Gupta < ankur.gu...@cloudera.com.invalid 
> wrote:

> 
> Hi all,
> 
> 
> This is a follow-on to my PR: https:/ / github. com/ apache/ spark/ pull/ 
> 24208
> ( https://github.com/apache/spark/pull/24208 ) , where I aimed to enable
> blacklisting for fetch failure by default. From the comments, there is
> interest in the community to enable overall blacklisting feature by
> default. I have listed down 3 different things that we can do and would
> like to gather feedback and see if anyone has objections with regards to
> this. Otherwise, I will just create a PR for the same.
> 
> 
> 1. *Enable blacklisting feature by default*. The blacklisting feature was
> added as part of SPARK-8425 and is available since 2.2.0. This feature was
> deemed experimental and was disabled by default. The feature blacklists an
> executor/node from running a particular task, any task in a particular
> stage or all tasks in application based on number of failures. There are
> various configurations available which control those thresholds.
> Additionally, the executor/node is only blacklisted for a configurable
> time period. The idea is to enable blacklisting feature with existing
> defaults, which are following:
> * spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
> 
> * spark.blacklist.task.maxTaskAttemptsPerNode = 2
> 
> * spark.blacklist.stage.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.stage.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.application.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.application.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.timeout = 1 hour
> 
> 2. *Kill blacklisted executors/nodes by default*. This feature was added
> as part of SPARK-16554 and is available since 2.2.0. This is a follow-on
> feature to blacklisting, such that if an executor/node is blacklisted for
> the application, then it also terminates all running tasks on that
> executor for faster failure recovery.
> 
> 
> 3. *Remove legacy blacklisting timeout config* :
> spark.scheduler.executorTaskBlacklistTime
> 
> 
> Thanks,
> Ankur
>

Do you use single-quote syntax for the DataFrame API?

2019-03-30 Thread Reynold Xin
As part of evolving the Scala language, the Scala team is considering removing 
single-quote syntax for representing symbols. Single-quote syntax is one of the 
ways to represent a column in Spark's DataFrame API. While I personally don't 
use them (I prefer just using strings for column names, or using expr 
function), I see them used quite a lot by other people's code, e.g.

df.select ( http://df.select/ ) ('id, 'name).show()

I want to bring this to more people's attention, in case they are depending on 
this. The discussion thread is: 
https://contributors.scala-lang.org/t/proposal-to-deprecate-and-remove-symbol-literals/2953

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin
I just realized I didn't make it very clear my stance here ... here's another 
try:

I think it's a no brainer to have a good columnar UDF interface. This would 
facilitate a lot of high performance applications, e.g. GPU-based accelerations 
for machine learning algorithms.

On rewriting the entire internals of Spark SQL to leverage columnar processing, 
I don't see enough evidence to suggest that's a good idea yet.

On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> Kazuaki Ishizaki,
> 
> 
> Yes, ColumnarBatchScan does provide a framework for doing code generation
> for the processing of columnar data.  I have to admit that I don't have a
> deep understanding of the code generation piece, so if I get something
> wrong please correct me.  From what I had seen only input formats
> currently inherent from ColumnarBatchScan, and from comments in the trait
> 
> 
>   /**
>    * Generate [[ColumnVector]] expressions for our parent to consume as
> rows.
>    * This is called once per [[ColumnarBatch]].
>    */
> https:/ / github. com/ apache/ spark/ blob/ 
> 956b52b1670985a67e49b938ac1499ae65c79f6e/
> sql/ core/ src/ main/ scala/ org/ apache/ spark/ sql/ execution/ 
> ColumnarBatchScan.
> scala#L42-L43 (
> https://github.com/apache/spark/blob/956b52b1670985a67e49b938ac1499ae65c79f6e/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L42-L43
> )
> 
> 
> 
> It appears that ColumnarBatchScan is really only intended to pull out the
> data from the batch, and not to process that data in a columnar fashion. 
> The Loading stage that you mentioned.
> 
> 
> > The SIMDzation or GPUization capability depends on a compiler that
> translates native code from the code generated by the whole-stage codegen.
> 
> To be able to support vectorized processing Hive stayed with pure java and
> let the JVM detect and do the SIMDzation of the code.  To make that happen
> they created loops to go through each element in a column and remove all
> conditionals from the body of the loops.  To the best of my knowledge that
> would still require a separate code path like I am proposing to make the
> different processing phases generate code that the JVM can compile down to
> SIMD instructions.  The generated code is full of null checks for each
> element which would prevent the operations we want.  Also, the
> intermediate results are often stored in UnsafeRow instances.  This is
> really fast for row-based processing, but the complexity of how they work
> I believe would prevent the JVM from being able to vectorize the
> processing.  If you have a better way to take java code and vectorize it
> we should put it into OpenJDK instead of spark so everyone can benefit
> from it.
> 
> 
> Trying to compile directly from generated java code to something a GPU can
> process is something we are tackling but we decided to go a different
> route from what you proposed.  From talking with several compiler experts
> here at NVIDIA my understanding is that IBM in partnership with NVIDIA
> attempted in the past to extend the JVM to run at least partially on GPUs,
> but it was really difficult to get right, especially with how java does
> memory management and memory layout.
> 
> 
> To avoid that complexity we decided to split the JITing up into two
> separate pieces.  I didn't mention any of this before because this
> discussion was intended to just be around the memory layout support, and
> not GPU processing.  The first part would be to take the Catalyst AST and
> produce CUDA code directly from it.  If properly done we should be able to
> do the selection and projection phases within a single kernel.  The
> biggest issue comes with UDFs as they cannot easily be vectorized for the
> CPU or GPU.  So to deal with that we have a prototype written by the
> compiler team that is trying to tackle SPARK-14083 which can translate
> basic UDFs into catalyst expressions.  If the UDF is too complicated or
> covers operations not yet supported it will fall back to the original UDF
> processing.  I don't know how close the team is to submit a SPIP or a
> patch for it, but I do know that they have some very basic operations
> working.  The big issue is that it requires java 11+ so it can use
> standard APIs to get the byte code of scala UDFs.  
> 
> 
> We split it this way because we thought it would be simplest to implement,
> and because it would provide a benefit to more than just GPU accelerated
> queries.
> 
> 
> Thanks,
> 
> 
> Bobby
> 
> On Tue, Mar 26, 2019 at 11:59 PM Kazuaki Ishizaki < ISHIZAKI@ jp. ibm. com
> ( ishiz...@jp.ibm.com ) > wrote:
> 
> 
>> Looks interesting discussion.
>> Let me describe the current structure and remaining issues. This is
>> orthogonal to cost-benefit trade-off discussion.
>> 
>> The code generation basically consists of three parts.
>> 1. Loading
>> 2. Selection (map, filter, ...)
>> 3. Projection
>> 
>> 1. Columnar storage (e.g. Parquet, Orc, Arrow , and table cache) is 

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
I just realized we had an earlier SPIP on a similar topic: 
https://issues.apache.org/jira/browse/SPARK-24579

Perhaps we should tie the two together. IIUC, you'd want to expose the existing 
ColumnBatch API, but also provide utilities to directly convert from/to Arrow.

On Thu, Apr 11, 2019 at 7:13 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> The SPIP has been up for almost 6 days now with really no discussion on
> it.  I am hopeful that means it's okay and we are good to call a vote on
> it, but I want to give everyone one last chance to take a look and
> comment.  If there are no comments by tomorrow I this we will start a vote
> for this.
> 
> 
> Thanks,
> 
> 
> Bobby
> 
> On Fri, Apr 5, 2019 at 2:24 PM Bobby Evans < bobby@ apache. org (
> bo...@apache.org ) > wrote:
> 
> 
>> I just filed SPARK-27396 as the SPIP for this proposal.  Please use that
>> JIRA for further discussions.
>> 
>> 
>> Thanks for all of the feedback,
>> 
>> 
>> Bobby
>> 
>> On Wed, Apr 3, 2019 at 7:15 PM Bobby Evans < bobby@ apache. org (
>> bo...@apache.org ) > wrote:
>> 
>> 
>>> I am still working on the SPIP and should get it up in the next few days. 
>>> I have the basic text more or less ready, but I want to get a high-level
>>> API concept ready too just to have something more concrete.  I have not
>>> really done much with contributing new features to spark so I am not sure
>>> where a design document really fits in here because from http:/ / spark. 
>>> apache.
>>> org/ improvement-proposals. html (
>>> http://spark.apache.org/improvement-proposals.html ) and http:/ / spark. 
>>> apache.
>>> org/ contributing. html ( http://spark.apache.org/contributing.html ) it
>>> does not mention a design anywhere.  I am happy to put one up, but I was
>>> hoping the API concept would cover most of that.
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> Bobby
>>> 
>>> On Tue, Apr 2, 2019 at 9:16 PM Renjie Liu < liurenjie2008@ gmail. com (
>>> liurenjie2...@gmail.com ) > wrote:
>>> 
>>> 
>>>> Hi, Bobby:
>>>> Do you have design doc? I'm also interested in this topic and want to help
>>>> contribute.
>>>> 
>>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans < bobby@ apache. org (
>>>> bo...@apache.org ) > wrote:
>>>> 
>>>> 
>>>>> Thanks to everyone for the feedback.
>>>>> 
>>>>> 
>>>>> Overall the feedback has been really positive for exposing columnar as a
>>>>> processing option to users.  I'll write up a SPIP on the proposed changes
>>>>> to support columnar processing (not necessarily implement it) and then
>>>>> ping the list again for more feedback and discussion.
>>>>> 
>>>>> 
>>>>> Thanks again,
>>>>> 
>>>>> 
>>>>> Bobby
>>>>> 
>>>>> On Mon, Apr 1, 2019 at 5:09 PM Reynold Xin < rxin@ databricks. com (
>>>>> r...@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> I just realized I didn't make it very clear my stance here ... here's
>>>>>> another try:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I think it's a no brainer to have a good columnar UDF interface. This
>>>>>> would facilitate a lot of high performance applications, e.g. GPU-based
>>>>>> accelerations for machine learning algorithms.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On rewriting the entire internals of Spark SQL to leverage columnar
>>>>>> processing, I don't see enough evidence to suggest that's a good idea 
>>>>>> yet.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans < bobby@ apache. org (
>>>>>> bo...@apache.org ) > wrote:
>>>>>> 
>>>>>>> Kazuaki Ishizaki,
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, ColumnarBatchScan does provide a framework for doing code 
>>>>>>> generation
>>>>>>> for the p

Re: pyspark.sql.functions ide friendly

2019-04-17 Thread Reynold Xin
Are you talking about the ones that are defined in a dictionary? If yes, that 
was actually not that great in hindsight (makes it harder to read & change), so 
I'm OK changing it.

E.g.

_functions = {

    'lit': _lit_doc,

    'col': 'Returns a :class:`Column` based on the given column name.',

    'column': 'Returns a :class:`Column` based on the given column name.',

    'asc': 'Returns a sort expression based on the ascending order of the given 
column name.',

    'desc': 'Returns a sort expression based on the descending order of the 
given column name.',

}

On Wed, Apr 17, 2019 at 4:35 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> I use IntelliJ and have never seen an issue parsing the pyspark
> functions... you're just saying the linter has an optional inspection to
> flag it? just disable that?
> I don't think we want to complicate the Spark code just for this. They are
> declared at runtime for a reason.
> 
> 
> 
> On Wed, Apr 17, 2019 at 6:27 AM educhana@ gmail. com ( educh...@gmail.com )
> < educhana@ gmail. com ( educh...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Hi,
>> 
>> 
>> 
>> I'm aware of various workarounds to make this work smoothly in various
>> IDEs, but wouldn't better to solve the root cause?
>> 
>> 
>> 
>> I've seen the code and don't see anything that requires such level of
>> dynamic code, the translation is 99% trivial.
>> 
>> 
>> 
>> On 2019/04/16 12:16:41, 880f0464 < 880f0464@ protonmail. com. INVALID (
>> 880f0...@protonmail.com.INVALID ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Hi.
>>> 
>>> 
>>> 
>>> That's a problem with Spark as such and in general can be addressed on IDE
>>> to IDE basis - see for example https:/ / stackoverflow. com/ q/ 40163106 (
>>> https://stackoverflow.com/q/40163106 ) for some hints.
>>> 
>>> 
>>> 
>>> Sent with ProtonMail Secure Email.
>>> 
>>> 
>>> 
>>> ‐‐‐ Original Message ‐‐‐
>>> On Tuesday, April 16, 2019 2:10 PM, educhana < educhana@ gmail. com (
>>> educh...@gmail.com ) > wrote:
>>> 
>>> 
 
 
 Hi,
 
 
 
 Currently using pyspark.sql.functions from an IDE like PyCharm is causing
 the linters complain due to the functions being declared at runtime.
 
 
 
 Would a PR fixing this be welcomed? Is there any problems/difficulties I'm
 unaware?
 
 
 
 --
 
 
 
 
 Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
 ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
 
 
 
 --
 
 
 
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
 
 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
>> 
>> - To
>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>> dev-unsubscr...@spark.apache.org )
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
For Jackson - are you worrying about JSON parsing for users or internal
Spark functionality breaking?

On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:

> There's only one other item on my radar, which is considering updating
> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> behavior non-trivially. That said back-porting the update PR to 2.4
> worked out OK locally. Any strong opinions on this one?
>
> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan  wrote:
> >
> > I volunteer to be the release manager for 2.4.2, as I was also going to
> propose 2.4.2 because of the reverting of SPARK-25250. Is there any other
> ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the
> release process today (CST).
> >
> > Thanks,
> > Wenchen
> >
> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
> >>
> >> I think the 'only backport bug fixes to branches' principle remains
> sound. But what's a bug fix? Something that changes behavior to match what
> is explicitly supposed to happen, or implicitly supposed to happen --
> implied by what other similar things do, by reasonable user expectations,
> or simply how it worked previously.
> >>
> >> Is this a bug fix? I guess the criteria that matches is that behavior
> doesn't match reasonable user expectations? I don't know enough to have a
> strong opinion. I also don't think there is currently an objection to
> backporting it, whatever it's called.
> >>
> >>
> >> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
> >>
> >>
> >> The meta question remains: is a 'bug fix' definition even agreed, and
> being consistently applied? There aren't correct answers, only best guesses
> from each person's own experience, judgment and priorities. These can
> differ even when applied in good faith.
> >>
> >> Sometimes the variance of opinion comes because people have different
> info that needs to be surfaced. Here, maybe it's best to share what about
> that offline conversation was convincing, for example.
> >>
> >> I'd say it's also important to separate what one would prefer from what
> one can't live with(out). Assuming one trusts the intent and experience of
> the handful of others with an opinion, I'd defer to someone who wants X and
> will own it, even if I'm moderately against it. Otherwise we'd get little
> done.
> >>
> >> In that light, it seems like both of the PRs at issue here are not
> _wrong_ to backport. This is a good pair that highlights why, when there
> isn't a clear reason to do / not do something (e.g. obvious errors,
> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
> >>
> >>
> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
> >>>
> >>> Sorry, I should be more clear about what I'm trying to say here.
> >>>
> >>> In the past, Xiao has taken the opposite stance. A good example is PR
> #21060 that was a very similar situation: behavior didn't match what was
> expected and there was low risk. There was a long argument and the patch
> didn't make it into 2.3 (to my knowledge).
> >>>
> >>> What we call these low-risk behavior fixes doesn't matter. I called it
> a bug on #21060 but I'm applying Xiao's previous definition here to make a
> point. Whatever term we use, we clearly have times when we want to allow a
> patch because it is low risk and helps someone. Let's just be clear that
> that's perfectly fine.
> >>>
> >>> On Wed, Apr 17, 2019 at 9:34 AM Ryan Blue  wrote:
> 
>  How is this a bug fix?
> 
>  On Wed, Apr 17, 2019 at 9:30 AM Xiao Li 
> wrote:
> >
> > Michael and I had an offline discussion about this PR
> https://github.com/apache/spark/pull/24365. He convinced me that this is
> a bug fix. The code changes of this bug fix are very tiny and the risk is
> very low. To avoid any behavior change in the patch releases, this PR also
> added a legacy flag whose default value is off.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
We should have shaded all Spark’s dependencies :(

On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:

> For users that would inherit Jackson and use it directly, or whose
> dependencies do. Spark itself (with modifications) should be OK with
> the change.
> It's risky and normally wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
>
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
> >
> > For Jackson - are you worrying about JSON parsing for users or internal
> Spark functionality breaking?
> >
> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
> >>
> >> There's only one other item on my radar, which is considering updating
> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> >> behavior non-trivially. That said back-porting the update PR to 2.4
> >> worked out OK locally. Any strong opinions on this one?
> >>
> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
> wrote:
> >> >
> >> > I volunteer to be the release manager for 2.4.2, as I was also going
> to propose 2.4.2 because of the reverting of SPARK-25250. Is there any
> other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
> start the release process today (CST).
> >> >
> >> > Thanks,
> >> > Wenchen
> >> >
> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
> >> >>
> >> >> I think the 'only backport bug fixes to branches' principle remains
> sound. But what's a bug fix? Something that changes behavior to match what
> is explicitly supposed to happen, or implicitly supposed to happen --
> implied by what other similar things do, by reasonable user expectations,
> or simply how it worked previously.
> >> >>
> >> >> Is this a bug fix? I guess the criteria that matches is that
> behavior doesn't match reasonable user expectations? I don't know enough to
> have a strong opinion. I also don't think there is currently an objection
> to backporting it, whatever it's called.
> >> >>
> >> >>
> >> >> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
> >> >>
> >> >>
> >> >> The meta question remains: is a 'bug fix' definition even agreed,
> and being consistently applied? There aren't correct answers, only best
> guesses from each person's own experience, judgment and priorities. These
> can differ even when applied in good faith.
> >> >>
> >> >> Sometimes the variance of opinion comes because people have
> different info that needs to be surfaced. Here, maybe it's best to share
> what about that offline conversation was convincing, for example.
> >> >>
> >> >> I'd say it's also important to separate what one would prefer from
> what one can't live with(out). Assuming one trusts the intent and
> experience of the handful of others with an opinion, I'd defer to someone
> who wants X and will own it, even if I'm moderately against it. Otherwise
> we'd get little done.
> >> >>
> >> >> In that light, it seems like both of the PRs at issue here are not
> _wrong_ to backport. This is a good pair that highlights why, when there
> isn't a clear reason to do / not do something (e.g. obvious errors,
> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
> >> >>
> >> >>
> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
> >> >>>
> >> >>> Sorry, I should be more clear about what I'm trying to say here.
> >> >>>
> >> >>> In the past, Xiao has taken the opposite stance. A good example is
> PR #21060 that was a very similar situation: behavior didn't match what was
> expected and there was low risk. There was a long argument and the patch
> didn't make it into 2.3 (to my know

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
"if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < reva...@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ 
> f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> men...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves...@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( reva...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan C

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-26 Thread Reynold Xin
I do feel it'd be better to not switch default Scala versions in a minor 
release. I don't know how much downstream this impacts. Dotnet is a good data 
point. Anybody else hit this issue?

On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote:

> 
> 
> 
> Very much interested in hearing what you folks decide. We currently have a
> couple asking us questions at https:/ / github. com/ dotnet/ spark/ issues
> ( https://github.com/dotnet/spark/issues ).
> 
> 
> 
> Thanks,
> Terry
> 
> 
> 
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13 
upgrade.

Are there breaking changes that would require us to have two different source 
code for 2.12 vs 2.13?

On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> While that's not happening soon (2.13 isn't out), note that some of the
> changes to collections will be fairly breaking changes.
> 
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
> https://issues.apache.org/jira/browse/SPARK-25075 )
> https:/ / docs. scala-lang. org/ overviews/ core/ collections-migration-213.
> html (
> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> )
> 
> 
> 
> Some of this may impact a public API, so may need to start proactively
> fixing stuff for 2.13 before 3.0 comes out where possible.
> 
> 
> 
> Here's an example: Traversable goes away. We have a method
> SparkConf.setAll(Traversable). We can't support 2.13 while that still
> exists. Of course, we can decide to deprecate it with replacement (use
> Iterable) and remove it in the version that supports 2.13. But that would
> mean a little breaking change, and we either have to accept that for a
> future 3.x release, or it waits until 4.x.
> 
> 
> 
> I wanted to put that on the radar now to gather opinions about whether
> this will probably be acceptable, or whether we really need to get methods
> like that changed before 3.0.
> 
> 
> 
> Also: there's plenty of straightforward but medium-sized changes we can
> make now in anticipation of 2.13 support, like, make the type of Seq we
> use everywhere explicit (will be good for a like 1000 file change I'm
> sure). Or see if we can swap out Traversable everywhere. Remove
> MutableList, etc.
> 
> 
> 
> I was going to start fiddling with that unless it just sounds too
> disruptive.
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Yea my main point is when we do support 2.13, it'd be great if we don't have to 
break APIs. That's why doing the prep work in 3.0 would be great.

On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < iras...@cloudera.com > wrote:

> 
> +1 on making whatever api changes we can now for 3.0.
> 
> 
> I don't think that is making any commitments to supporting scala 2.13 in
> any specific version.  We'll have to deal with all the other points you
> raised when we do cross that bridge, but hopefully those are things we can
> cover in a minor release.
> 
> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> I really hope we don't have to have separate source trees for some files,
>> but yeah it's an option too. OK, will start looking into changes we can
>> make now that don't break things now, and deprecations we need to make now
>> proactively.
>> 
>> 
>> I should also say that supporting Scala 2.13 will mean dependencies have
>> to support Scala 2.13, and that could take a while, because there are a
>> lot.
>> In particular, I think we'll find our SBT 0.13 build won't make it,
>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>> and it seemed to need quite a lot of rewrite, again in part due to how
>> newer plugin versions changed. I failed and gave up.
>> 
>> 
>> At some point maybe we figure out whether we can remove the SBT-based
>> build if it's super painful, but only if there's not much other choice.
>> That is for a future thread.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>> 2.13 upgrade.
>>> 
>>> 
>>> 
>>> Are there breaking changes that would require us to have two different
>>> source code for 2.12 vs 2.13?
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>>> 
>>>> 
>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>> changes to collections will be fairly breaking changes.
>>>> 
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>> collections-migration-213.
>>>> html (
>>>> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
>>>> )
>>>> 
>>>> 
>>>> 
>>>> Some of this may impact a public API, so may need to start proactively
>>>> fixing stuff for 2.13 before 3.0 comes out where possible.
>>>> 
>>>> 
>>>> 
>>>> Here's an example: Traversable goes away. We have a method
>>>> SparkConf.setAll(Traversable). We can't support 2.13 while that still
>>>> exists. Of course, we can decide to deprecate it with replacement (use
>>>> Iterable) and remove it in the version that supports 2.13. But that would
>>>> mean a little breaking change, and we either have to accept that for a
>>>> future 3.x release, or it waits until 4.x.
>>>> 
>>>> 
>>>> 
>>>> I wanted to put that on the radar now to gather opinions about whether
>>>> this will probably be acceptable, or whether we really need to get methods
>>>> like that changed before 3.0.
>>>> 
>>>> 
>>>> 
>>>> Also: there's plenty of straightforward but medium-sized changes we can
>>>> make now in anticipation of 2.13 support, like, make the type of Seq we
>>>> use everywhere explicit (will be good for a like 1000 file change I'm
>>>> sure). Or see if we can swap out Traversable everywhere. Remove
>>>> MutableList, etc.
>>>> 
>>>> 
>>>> 
>>>> I was going to start fiddling with that unless it just sounds too
>>>> disruptive.
>>>> 
>>>> 
>>>> 
>>>> - To
>>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>> dev-unsubscr...@spark.apache.org )
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-11 Thread Reynold Xin
If the number of changes that would require two source trees are small, another 
thing we can do is to reach out to the Scala team and kindly ask them whether 
they could change Scala 2.13 itself so it'd be easier to maintain compatibility 
with Scala 2.12.

On Sat, May 11, 2019 at 4:25 PM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> For those interested, here's the first significant problem I see that will
> require separate source trees or a breaking change: https:/ / issues. apache.
> org/ jira/ browse/ SPARK-27683?focusedCommentId=16837967&page=com. atlassian.
> jira. plugin. system. issuetabpanels%3Acomment-tabpanel#comment-16837967 (
> https://issues.apache.org/jira/browse/SPARK-27683?focusedCommentId=16837967&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16837967
> )
> 
> 
> 
> Interested in thoughts on how to proceed on something like this, as there
> will probably be a few more similar issues.
> 
> 
> 
> On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> 
>> 
>> Yea my main point is when we do support 2.13, it'd be great if we don't
>> have to break APIs. That's why doing the prep work in 3.0 would be great.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < irashid@ cloudera. com (
>> iras...@cloudera.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1 on making whatever api changes we can now for 3.0.
>>> 
>>> 
>>> 
>>> I don't think that is making any commitments to supporting scala 2.13 in
>>> any specific version. We'll have to deal with all the other points you
>>> raised when we do cross that bridge, but hopefully those are things we can
>>> cover in a minor release.
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> I really hope we don't have to have separate source trees for some files,
>>>> but yeah it's an option too. OK, will start looking into changes we can
>>>> make now that don't break things now, and deprecations we need to make now
>>>> proactively.
>>>> 
>>>> 
>>>> 
>>>> I should also say that supporting Scala 2.13 will mean dependencies have
>>>> to support Scala 2.13, and that could take a while, because there are a
>>>> lot. In particular, I think we'll find our SBT 0.13 build won't make it,
>>>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>>>> and it seemed to need quite a lot of rewrite, again in part due to how
>>>> newer plugin versions changed. I failed and gave up.
>>>> 
>>>> 
>>>> 
>>>> At some point maybe we figure out whether we can remove the SBT-based
>>>> build if it's super painful, but only if there's not much other choice.
>>>> That is for a future thread.
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>>>> 2.13 upgrade.
>>>>> 
>>>>> 
>>>>> 
>>>>> Are there breaking changes that would require us to have two different
>>>>> source code for 2.12 vs 2.13?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>>>> changes to collections will be fairly breaking changes.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>>>> collections-migration-213.
>>>>>> html (
>>>>>> https://docs.scala-lang.org/overviews/core/coll

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-25 Thread Reynold Xin
Can we push this to June 1st? I have been meaning to read it but
unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun 
wrote:

> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
>
>> +1 on exposing the APIs for columnar processing support.
>>
>> I understand that the scope of this SPIP doesn't cover AI / ML
>> use-cases. But I saw a good performance gain when I converted data
>> from rows to columns to leverage on SIMD architectures in a POC ML
>> application.
>>
>> With the exposed columnar processing support, I can imagine that the
>> heavy lifting parts of ML applications (such as computing the
>> objective functions) can be written as columnar expressions that
>> leverage on SIMD architectures to get a good speedup.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>> >
>> > It would allow for the columnar processing to be extended through the
>> shuffle.  So if I were doing say an FPGA accelerated extension it could
>> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
>> input instead of a Row. The extended version of the ShuffleExchangeExec
>> could then do the partitioning on the incoming batch and instead of
>> producing a ShuffleRowRDD for the exchange they could produce something
>> like a ShuffleBatchRDD that would let the serializing and deserializing
>> happen in a column based format for a faster exchange, assuming that
>> columnar processing is also happening after the exchange. This is just like
>> providing a columnar version of any other catalyst operator, except in this
>> case it is a bit more complex of an operator.
>> >
>> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid
>>  wrote:
>> >>
>> >> sorry I am late to the discussion here -- the jira mentions using this
>> extensions for dealing with shuffles, can you explain that part?  I don't
>> see how you would use this to change shuffle behavior at all.
>> >>
>> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves 
>> wrote:
>> >>>
>> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
>> >>> and other people feedback who haven't had time to look at it.
>> >>>
>> >>> Tom
>> >>>
>> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
>> wrote:
>> >>> >
>> >>> > I’d like to ask this vote period to be extended, I’m interested but
>> I don’t have the cycles to review it in detail and make an informed vote
>> until the 25th.
>> >>> >
>> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
>> wrote:
>> >>> >>
>> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I
>> don't feel strongly about it. I would still suggest doing the following:
>> >>> >>
>> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC
>> result.
>> >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
>> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
>> following public. People who are familiar with SQL internals should help
>> assess the risk.
>> >>> >> * ColumnarArray
>> >>> >> * ColumnarMap
>> >>> >> * unsafe.types.CaledarInterval
>> >>> >> * ColumnarRow
>> >>> >> * UTF8String
>> >>> >> * ArrayData
>> >>> >> * ...
>> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't
>> match the purpose of this SPIP. It does make some code cleaner. But I guess
>> for ETL use cases, it won't bring much value.
>> >>> >>
>> >>> > --
>> >>> > Twitter: https://twitter.com/holdenkarau
>> >>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [RESULT][VOTE] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Reynold Xin
Thanks Tom.

I finally had time to look at the updated SPIP 10 mins ago. I support the high 
level idea and +1 on the SPIP.

That said, I think the proposed API is too complicated and invasive change to 
the existing internals. A much simpler API would be to expose a columnar batch 
iterator interface, i.e. an uber column oriented UDF with ability to manage 
life cycle. Once we have that, we can also refactor the existing Python UDFs to 
use that interface.

As I said earlier (couple months ago when this was first surfaced?), I support 
the idea to enable *external* column oriented processing logic, but not 
changing Spark itself to have two processing mode, which is simply very 
complicated and would create very high maintenance burden for the project.

On Wed, May 29, 2019 at 9:49 PM, Thomas graves < tgra...@apache.org > wrote:

> 
> 
> 
> Hi all,
> 
> 
> 
> The vote passed with 9 +1's (4 binding) and 1 +0 and no -1's.
> 
> 
> 
> +1s (* = binding) :
> Bobby Evans*
> Thomas Graves*
> DB Tsai*
> Felix Cheung*
> Bryan Cutler
> Kazuaki Ishizaki
> Tyson Condie
> Dongjoon Hyun
> Jason Lowe
> 
> 
> 
> +0s:
> Xiangrui Meng
> 
> 
> 
> Thanks,
> Tom Graves
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>
> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first?

On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers with
> more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
> PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the result
> right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to apologize
> for that. I hope we can take advantage of the existing GitHub features to
> serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used for
> some purposes, but these feature are out of the scope in this email.
>


Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Reynold Xin
That's a good idea. We should only be using squash.

On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Apache Spark PMC members and committers.
> 
> 
> We are using GitHub `Merge Button` in `spark-website` repository
> because it's very convenient.
> 
> 
>     1. https:/ / github. com/ apache/ spark-website/ commits/ asf-site (
> https://github.com/apache/spark-website/commits/asf-site )
> 
>     2. https:/ / github. com/ apache/ spark/ commits/ master (
> https://github.com/apache/spark/commits/master )
> 
> 
> In order to be consistent with our previous behavior,
> can we disable `Allow Merge Commits` from GitHub `Merge Button` setting
> explicitly?
> 
> 
> I hope we can enforce it in both `spark-website` and `spark` repository
> consistently.
> 
> 
> Bests,
> Dongjoon.
>

Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin
Hi all,

In the past two years, the pandas UDFs are perhaps the most important changes 
to Spark for Python data science. However, these functionalities have evolved 
organically, leading to some inconsistencies and confusions among users. I 
created a ticket and a document summarizing the issues, and a concrete proposal 
to fix them (the changes are pretty small). Thanks Xiangrui for initially 
bringing this to my attention, and Li Jin, Hyukjin, for offline discussions.

Please take a look: 

https://issues.apache.org/jira/browse/SPARK-28264

https://docs.google.com/document/u/1/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will 
also at some point run out of memory with too big of a query plan tree or 
become incredibly slow due to query planning complexity. I've seen queries that 
are tens of MBs in size.

On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmont...@126.com > wrote:

> 
> I have a question about the limit(biggest) of SQL's length that is
> supported in SparkSQL. I can't find the answer in the documents of Spark.
> 
> 
> Maybe Interger.MAX_VALUE or not ?
> 
> 
> 
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code.

On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > 
wrote:

> 
> Hi Reynold,
> 
> 
> I am genuinely curious about queries which are more than 1 MB and am
> stunned by tens of MB's. Any samples to share :) 
> 
> 
> Regards,
> Gourav
> 
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also at some point run out of memory with too big of a query plan
>> tree or become incredibly slow due to query planning complexity. I've seen
>> queries that are tens of MBs in size.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmontree@ 126. com (
>> alemmont...@126.com ) > wrote:
>> 
>>> I have a question about the limit(biggest) of SQL's length that is
>>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>> 
>>> 
>>> Maybe Interger.MAX_VALUE or not ?
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Reynold Xin
I like the spirit, but not sure about the exact proposal. Take a look at
k8s':
https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md



On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon  wrote:

> (Plus, it helps to track history too. Spark's commit logs are growing and
> now it's pretty difficult to track the history and see what change
> introduced a specific behaviour)
>
> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>
> Hi all,
>
> I would like to discuss about some new sections under "## What changes
> were proposed in this pull request?":
>
> ### Do the changes affect _any_ user/dev-facing input or output?
>
> (Please answer yes or no. If yes, answer the questions below)
>
> ### What was the previous behavior?
>
> (Please provide the console output, description and/or reproducer about the 
> previous behavior)
>
> ### What is the behavior the changes propose?
>
> (Please provide the console output, description and/or reproducer about the 
> behavior the changes propose)
>
> See
> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>  .
>
> From my experience so far in Spark community, and assuming from the
> interaction with other
> committers and contributors, It is pretty critical to know before/after
> behaviour changes even if it
> was a bug. In addition, I think this is requested by reviewers often.
>
> The new sections will make review process much easier, and we're able to
> quickly judge how serious the changes are.
> Given that Spark community still suffer from open PRs just queueing up
> without review, I think this can help
> both reviewers and PR authors.
>
> I do describe them often when I think it's useful and possible.
> For instance see https://github.com/apache/spark/pull/24927 - I am sure
> you guys have clear idea what the
> PR fixes.
>
> I cc'ed some guys I can currently think of for now FYI. Please let me know
> if you guys have any thought on this!
>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
Matt what do you mean by maximizing 3, while allowing not throwing errors when 
any operations overflow? Those two seem contradicting.

On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> I’m -1, simply from disagreeing with the premise that we can afford to not
> be maximal on standard 3. The correctness of the data is non-negotiable,
> and whatever solution we settle on cannot silently adjust the user’s data
> under any circumstances.
> 
> 
> 
>  
> 
> 
> 
> I think the existing behavior is fine, or perhaps the behavior can be
> flagged by the destination writer at write time.
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Hyukjin Kwon < gurwls...@gmail.com >
> *Date:* Monday, July 29, 2019 at 11:33 PM
> *To:* Wenchen Fan < cloud0...@gmail.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> From my look, +1 on the proposal, considering ASCI and other DBMSes in
> general.
> 
> 
> 
> 
>  
> 
> 
> 
> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
> 
> 
> 
> 
>> 
>> 
>> We can add a config for a certain behavior if it makes sense, but the most
>> important thing we want to reach an agreement here is: what should be the
>> default behavior?
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Let's explore the solution space of table insertion behavior first:
>> 
>> 
>> 
>> 
>> At compile time,
>> 
>> 
>> 
>> 
>> 1. always add cast
>> 
>> 
>> 
>> 
>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>> int is forbidden but long to int is allowed)
>> 
>> 
>> 
>> 
>> 3. only add cast if it's 100% safe
>> 
>> 
>> 
>> 
>> At runtime,
>> 
>> 
>> 
>> 
>> 1. return null for invalid operations
>> 
>> 
>> 
>> 
>> 2. throw exceptions at runtime for invalid operations
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The standards to evaluate a solution:
>> 
>> 
>> 
>> 
>> 1. How robust the query execution is. For example, users usually don't
>> want to see the query fails midway.
>> 
>> 
>> 
>> 
>> 2. how tolerant to user queries. For example, a user would like to write
>> long values to an int column as he knows all the long values won't exceed
>> int range.
>> 
>> 
>> 
>> 
>> 3. How clean the result is. For example, users usually don't want to see
>> silently corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V1 tables: always add cast and
>> return null for invalid operations. This maximizes standard 1 and 2, but
>> the result is least clean and users are very likely to see silently
>> corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>> queries may fail to compile, even if these queries can run on other SQL
>> systems. Note that, people can still see silently corrupted data because
>> cast is not the only one that can return corrupted data. Simple operations
>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The proposal here: add cast following ANSI SQL store assignment rule, and
>> return null for invalid operations. This maximizes standard 1, and also
>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>> compile in other mainstream databases as well. I think that's tolerant
>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>> many invalid operations already.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Technically we can't make the result 100% clean at compile-time, we have
>> to handle things like overflow at runtime. I think the new proposal makes
>> more sense as the default behavior.
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer < russell.spit...@gmail.com
>> > wrote:
>> 
>> 
>> 
>>> 
>>> 
>>> I understand spark is making the decisions, i'm say the actual final
>>> effect of the null decision would be different depending on the insertion
>>> target if the target has different behaviors for null.
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>> 
>>> 
>>> 
>>> 
 
 
 > I'm a big -1 on null values for invalid casts.
 
 
 
  
 
 
 
 
 This is why we want to introduce the ANSI mode, so that invalid cast fails
 at runtime. But we have to keep the null behavior for a while, to keep
 backward compatibility. Spark returns null for invalid cast since the
 first day of Spark SQL, we c

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
OK to push back: "disagreeing with the premise that we can afford to not be 
maximal on standard 3. The correctness of the data is non-negotiable, and 
whatever solution we settle on cannot silently adjust the user’s data under any 
circumstances."

This blanket statement sounds great on surface, but there are a lot of 
subtleties. "Correctness" is absolutely important, but engineering/prod 
development are often about tradeoffs, and the industry has consistently traded 
correctness for performance or convenience, e.g. overflow checks, null 
pointers, consistency in databases ...

It all depends on the use cases and to what degree use cases can tolerate. For 
example, while I want my data engineering production pipeline to throw any 
error when the data doesn't match my expectations (e.g. type widening, 
overflow), if I'm doing some quick and dirty data science, I don't want the job 
to just fail due to outliers.

On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> Sorry I meant the current behavior for V2, which fails the query
> compilation if the cast is not safe.
> 
> 
> 
>  
> 
> 
> 
> Agreed that a separate discussion about overflow might be warranted. I’m
> surprised we don’t throw an error now, but it might be warranted to do so.
> 
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Wednesday, July 31, 2019 at 9:58 AM
> *To:* Matt Cheah < mch...@palantir.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >, Hyukjin Kwon < gurwls...@gmail.com
> >, Wenchen Fan < cloud0...@gmail.com >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Matt what do you mean by maximizing 3, while allowing not throwing errors
> when any operations overflow? Those two seem contradicting.
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:
> 
> 
> 
>> 
>> 
>> I’m -1, simply from disagreeing with the premise that we can afford to not
>> be maximal on standard 3. The correctness of the data is non-negotiable,
>> and whatever solution we settle on cannot silently adjust the user’s data
>> under any circumstances.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> I think the existing behavior is fine, or perhaps the behavior can be
>> flagged by the destination writer at write time.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> -Matt Cheah
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> *From:* Hyukjin Kwon < gurwls...@gmail.com >
>> *Date:* Monday, July 29, 2019 at 11:33 PM
>> *To:* Wenchen Fan < cloud0...@gmail.com >
>> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
>> linguin@gmail.com
>> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
>> >rb...@netflix.com
>> >, Spark dev list < dev@spark.apache.org >
>> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> From my look, +1 on the proposal, considering ASCI and other DBMSes in
>> general.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
>> 
>> 
>> 
>> 
>>> 
>>> 
>>> We can add a config for a certain behavior if it makes sense, but the most
>>> important thing we want to reach an agreement here is: what should be the
>>> default behavior?
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> Let's explore the solution space of table insertion behavior first:
>>> 
>>> 
>>> 
>>> 
>>> At compile time,
>>> 
>>> 
>>> 
>>> 
>>> 1. always add cast
>>> 
>>> 
>>> 
>>> 
>>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>>> int is forbidden but long to int is allowed)
>>> 
>>> 
>>> 
>>> 
>>&

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Reynold Xin
We can also just write using one partition, which will be sufficient for
most use cases.

On Mon, Aug 5, 2019 at 7:48 PM Matt Cheah  wrote:

> There might be some help from the staging table catalog as well.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Wenchen Fan 
> *Date: *Monday, August 5, 2019 at 7:40 PM
> *To: *Shiv Prashant Sood 
> *Cc: *Ryan Blue , Jungtaek Lim ,
> Spark Dev List 
> *Subject: *Re: DataSourceV2 : Transactional Write support
>
>
>
> I agree with the temp table approach. One idea is: maybe we only need one
> temp table, and each task writes to this temp table. At the end we read the
> data from the temp table and write it to the target table. AFAIK JDBC can
> handle concurrent table writing very well, and it's better than creating
> thousands of temp tables for one write job(assume the input RDD has
> thousands of partitions).
>
>
>
> On Tue, Aug 6, 2019 at 7:57 AM Shiv Prashant Sood 
> wrote:
>
> Thanks all for the clarification.
>
>
>
> Regards,
>
> Shiv
>
>
>
> On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue 
> wrote:
>
> > What you could try instead is intermediate output: inserting into
> temporal table in executors, and move inserted records to the final table
> in driver (must be atomic)
>
>
>
> I think that this is the approach that other systems (maybe sqoop?) have
> taken. Insert into independent temporary tables, which can be done quickly.
> Then for the final commit operation, union and insert into the final table.
> In a lot of cases, JDBC databases can do that quickly as well because the
> data is already on disk and just needs to added to the final table.
>
>
>
> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim  wrote:
>
> I asked similar question for end-to-end exactly-once with Kafka, and
> you're correct distributed transaction is not supported. Introducing
> distributed transaction like "two-phase commit" requires huge change on
> Spark codebase and the feedback was not positive.
>
>
>
> What you could try instead is intermediate output: inserting into temporal
> table in executors, and move inserted records to the final table in driver
> (must be atomic).
>
>
>
> Thanks,
>
> Jungtaek Lim (HeartSaVioR)
>
>
>
> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood 
> wrote:
>
> All,
>
>
>
> I understood that DataSourceV2 supports Transactional write and wanted to
> implement that in JDBC DataSource V2 connector ( PR#25211 [github.com]
> 
> ).
>
>
>
> Don't see how this is feasible for JDBC based connector.  The FW suggest
> that EXECUTOR send a commit message  to DRIVER, and actual commit should
> only be done by DRIVER after receiving all commit confirmations. This will
> not work for JDBC  as commits have to happen on the JDBC Connection which
> is maintained by the EXECUTORS and JDBCConnection  is not serializable that
> it can be sent to the DRIVER.
>
>
>
> Am i right in thinking that this cannot be supported for JDBC? My goal is
> to either fully write or roll back the dataframe write operation.
>
>
>
> Thanks in advance for your help.
>
>
>
> Regards,
>
> Shiv
>
>
>
>
> --
>
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior [medium.com]
> 
> Twitter : http://twitter.com/heartsavior [twitter.com]
> 
>
> LinkedIn : http://www.linkedin.com/in/heartsavior [linkedin.com]
> 
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>


Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Would it be possible to have one build that works for both?

On Mon, Aug 26, 2019 at 10:22 AM Dongjoon Hyun 
wrote:

> Thank you all!
>
> Let me add more explanation on the current status.
>
> - If you want to run on JDK8, you need to build on JDK8
> - If you want to run on JDK11, you need to build on JDK11.
>
> The other combinations will not work.
>
> Currently, we have two Jenkins jobs. (1) is the one I pointed, and (2) is
> the one for the remaining community work.
>
> 1) Build and test on JDK11 (spark-master-test-maven-hadoop-3.2-jdk-11)
> 2) Build on JDK8 and test on JDK11
> (spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing)
>
> To keep JDK11 compatibility, the following is merged today.
>
> [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11
> support for pull request builds
>
> But, we still have many stuffs to do for Jenkins/Release and we need your
> support about JDK11. :)
>
> Bests,
> Dongjoon.
>
>
> On Sun, Aug 25, 2019 at 10:30 PM Takeshi Yamamuro 
> wrote:
>
>> Cool, congrats!
>>
>> Bests,
>> Takeshi
>>
>> On Mon, Aug 26, 2019 at 1:01 PM Hichame El Khalfi 
>> wrote:
>>
>>> That's Awesome !!!
>>>
>>> Thanks to everyone that made this possible :cheers:
>>>
>>> Hichame
>>>
>>> *From:* cloud0...@gmail.com
>>> *Sent:* August 25, 2019 10:43 PM
>>> *To:* lix...@databricks.com
>>> *Cc:* felixcheun...@hotmail.com; ravishankar.n...@gmail.com;
>>> dongjoon.h...@gmail.com; dev@spark.apache.org; u...@spark.apache.org
>>> *Subject:* Re: JDK11 Support in Apache Spark
>>>
>>> Great work!
>>>
>>> On Sun, Aug 25, 2019 at 6:03 AM Xiao Li  wrote:
>>>
 Thank you for your contributions! This is a great feature for Spark
 3.0! We finally achieve it!

 Xiao

 On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> That’s great!
>
> --
> *From:* ☼ R Nair 
> *Sent:* Saturday, August 24, 2019 10:57:31 AM
> *To:* Dongjoon Hyun 
> *Cc:* dev@spark.apache.org ; user @spark/'user
> @spark'/spark users/user@spark 
> *Subject:* Re: JDK11 Support in Apache Spark
>
> Finally!!! Congrats
>
> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Thanks to your many many contributions,
>> Apache Spark master branch starts to pass on JDK11 as of today.
>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>> (JDK11 is used for building and testing.)
>>
>> We already verified all UTs (including PySpark/SparkR) before.
>>
>> Please feel free to use JDK11 in order to build/test/run `master`
>> branch and
>> share your experience including any issues. It will help Apache Spark
>> 3.0.0 release.
>>
>> For the follow-ups, please follow
>> https://issues.apache.org/jira/browse/SPARK-24417 .
>> The next step is `how to support JDK8/JDK11 together in a single
>> artifact`.
>>
>> Bests,
>> Dongjoon.
>>
>

 --
 [image: Databricks Summit - Watch the talks]
 

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Exactly - I think it's important to be able to create a single binary build. 
Otherwise downstream users (the 99.99% won't be building their own Spark but 
just pull it from Maven) will have to deal with the mess, and it's even worse 
for libraries.

On Mon, Aug 26, 2019 at 10:51 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Oh, right. If you want to publish something to Maven, it will inherit the
> situation.
> Thank you for feedback. :)
> 
> On Mon, Aug 26, 2019 at 10:37 AM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
> 
> 
>> That is not true for any downstream users who also provide a library. 
>> Whatever build mess you create in Apache Spark, we'll have to inherit it. 
>> ;)
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Aug 26, 2019, at 12:32 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> As Shane wrote, not yet.
>>> 
>>> 
>>> `one build for works for both` is our aspiration and the next step
>>> mentioned in the first email.
>>> 
>>> 
>>> 
>>> > The next step is `how to support JDK8/JDK11 together in a single
>>> artifact`.
>>> 
>>> 
>>> For the downstream users who build from the Apache Spark source, that will
>>> not be a blocker because they will prefer a single JDK.
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Aug 26, 2019 at 10:28 AM Shane Knapp < sknapp@ berkeley. edu (
>>> skn...@berkeley.edu ) > wrote:
>>> 
>>> 
>>>> maybe in the future, but not right now as the hadoop 2.7 build is broken.
>>>> 
>>>> 
>>>> also, i busted dev/ run-tests. py ( http://dev/run-tests.py ) in my changes
>>>> to support java11 in PRBs:
>>>> https:/ / github. com/ apache/ spark/ pull/ 25585 (
>>>> https://github.com/apache/spark/pull/25585 )
>>>> 
>>>> 
>>>> 
>>>> quick fix, testing now.
>>>> 
>>>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Would it be possible to have one build that works for both?
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> +1
> 
> 
> To be honest I don't like the legacy policy. It's too loose and easy for
> users to make mistakes, especially when Spark returns null if a function
> hit errors like overflow.
> 
> 
> The strict policy is not good either. It's too strict and stops valid use
> cases like writing timestamp values to a date type column. Users do expect
> truncation to happen without adding cast manually in this case. It's also
> weird to use a spark specific policy that no other database is using.
> 
> 
> The ANSI policy is better. It stops invalid use cases like writing string
> values to an int type column, while keeping valid use cases like timestamp
> -> date.
> 
> 
> I think it's no doubt that we should use ANSI policy instead of legacy
> policy for v1 tables. Except for backward compatibility, ANSI policy is
> literally better than the legacy policy.
> 
> 
> The v2 table is arguable here. Although the ANSI policy is better than
> strict policy to me, this is just the store assignment policy, which only
> partially controls the table insertion behavior. With Spark's "return null
> on error" behavior, the table insertion is more likely to insert invalid
> null values with the ANSI policy compared to the strict policy.
> 
> 
> I think we should use ANSI policy by default for both v1 and v2 tables,
> because
> 1. End-users don't care how the table is implemented. Spark should provide
> consistent table insertion behavior between v1 and v2 tables.
> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
> compatibility issue. That said, the baseline to judge which policy is
> better should be the table insertion behavior in Spark 2.x, which is the
> legacy policy + "return null on error". ANSI policy is better than the
> baseline.
> 3. We expect more and more uses to migrate their data sources to the V2
> API. The strict policy can be a stopper as it's a too big breaking change,
> which may break many existing queries.
> 
> 
> Thanks,
> Wenchen 
> 
> 
> 
> 
> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks.
> com ( gengliang.w...@databricks.com ) > wrote:
> 
> 
>> Hi everyone,
>> 
>> I'd like to call for a vote on SPARK-28885 (
>> https://issues.apache.org/jira/browse/SPARK-28885 ) "Follow ANSI store
>> assignment rules in table insertion by default".  
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the type
>> coercion rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice,
>> the behavior is mostly the same as PostgreSQL. It disallows certain
>> unreasonable type conversions such as converting `string` to `int` and
>> `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`,
>> which is very loose. E.g., converting either `string` to `int` or `double`
>> to `boolean` is allowed. It is the current behavior in Spark 2.x for
>> compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in type coercion, e.g., converting either `double` to `int` or
>> `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>> 
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both
>> V1 and V2 in Spark 3.0.
>> 
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in
>> the dev mailing list.
>> 
>> This vote is open until next Thurs (Sept. 12nd).
>> 
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ... Thank you! Gengliang
>> 
>> 
> 
>

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Reynold Xin
+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> j...@jgp.net ) > wrote:
>>> 
>>> 
 As a user/non committer, +1
 
 
 I love the idea of an early 3.0.0 so we can test current dev against it, I
 know the final 3.x will probably need another round of testing when it
 gets out, but less for sure... I know I could checkout and compile, but
 having a “packaged” preversion is great if it does not take too much time
 to the team...
 
 jg
 
 
 
 On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
 gurwls...@gmail.com ) > wrote:
 
 
 
> +1 from me too but I would like to know what other people think too.
> 
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) >님이 작성:
> 
> 
>> Thank you, Sean.
>> 
>> 
>> I'm also +1 for the following three.
>> 
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps 
>> it
>> a lot.
>> 
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>> 
>> 
>> - https:/ / spark. apache. org/ versioning-policy. html (
>> https://spark.apache.org/versioning-policy.html )
>> 
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>> heue...@gmail.com ) > wrote:
>> 
>> 
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility 
>>> problems
>>> resolved, e.g.
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>> 
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far 
>>> as
>>> I know, Parquet has not cut a release based on this new version.
>>> 
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>> https://github.com/apache/spark/pull/24851 )
>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>> https://github.com/apache/spark/pull/24297 )
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
DSv2 is far from stable right? All the actual data types are unstable and you 
guys have completely ignored that. We'd need to work on that and that will be a 
breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that 
seems too invasive of a change to backport once you consider the parts needed 
to make dsv2 stable.

On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> Hi everyone,
> 
> 
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> 
> 
> A Spark 2.5 release with these two additions will help people migrate to
> Spark 3.0 when it is released because they will be able to use a single
> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
> upgrading to 3.0 won't also require also updating to Java 11 because users
> could update to Java 11 with the 2.5 release and have fewer major changes.
> 
> 
> 
> Another reason to consider a 2.5 release is that many people are
> interested in a release with the latest DSv2 API and support for DSv2 SQL.
> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
> it makes sense to share this work with the community.
> 
> 
> This release line would just consist of backports like DSv2 and Java 11
> that assist compatibility, to keep the scope of the release small. The
> purpose is to assist people moving to 3.0 and not distract from the 3.0
> release.
> 
> 
> Would a Spark 2.5 release help anyone else? Are there any concerns about
> this plan?
> 
> 
> 
> 
> rb
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
To push back, while I agree we should not drastically change "InternalRow", 
there are a lot of changes that need to happen to make it stable. For example, 
none of the publicly exposed interfaces should be in the Catalyst package or 
the unsafe package. External implementations should be decoupled from the 
internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to 
work towards making it stable in the future, assuming we will start with an 
unstable API temporarily. You can't just make a bunch internal APIs tightly 
coupled with other internal pieces public and stable and call it a day, just 
because it happen to satisfy some use cases temporarily assuming the rest of 
Spark doesn't change.

On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rb...@netflix.com > wrote:

> 
> > DSv2 is far from stable right?
> 
> 
> No, I think it is reasonably stable and very close to being ready for a
> release.
> 
> 
> > All the actual data types are unstable and you guys have completely
> ignored that.
> 
> 
> I think what you're referring to is the use of `InternalRow`. That's a
> stable API and there has been no work to avoid using it. In any case, I
> don't think that anyone is suggesting that we delay 3.0 until a
> replacement for `InternalRow` is added, right?
> 
> 
> While I understand the motivation for a better solution here, I think the
> pragmatic solution is to continue using `InternalRow`.
> 
> 
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
> invasive of a change to backport once you consider the parts needed to
> make dsv2 stable.
> 
> 
> I believe that those of us working on DSv2 are confident about the current
> stability. We set goals for what to get into the 3.0 release months ago
> and have very nearly reached the point where we are ready for that
> release.
> 
> 
> I don't think instability would be a problem in maintaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are ready to release 3.0, we can release a
> 2.5 with the same API.
> 
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@ netflix. com. invalid
>> ( rb...@netflix.com.invalid ) > wrote:
>> 
>>> Hi everyone,
>>> 
>>> 
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>> 
>>> 
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>> 
>>> 
>>> 
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>> 
>>> 
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>> 
>>> 
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>> 
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
I don't think we need to gate a 3.0 release on making a more stable version of 
InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change 
and progress towards more stable, but will have to break certain APIs) and 2.x 
seems like a false premise.

To point out some problems with InternalRow that you think are already 
pragmatic and stable:

The class is in catalyst, which states: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

/**

* Catalyst is a library for manipulating relational query plans.  All classes 
in catalyst are

* considered an internal API to Spark SQL and are subject to change between 
minor releases.

*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled 
with internal implementations. For example, 

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

/**

* A UTF-8 String for internal Spark use.

* 

* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,

* search, see http://en.wikipedia.org/wiki/UTF-8 for details.

* 

* Note: This is not designed for general use cases, should not be used outside 
SQL.

*/

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala

(which again is in catalyst package)

If you want to argue this way, you might as well argue we should make the 
entire catalyst package public to be pragmatic and not allow any changes.

On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue < rb...@netflix.com > wrote:

> 
> 
>> 
>> 
>> When you created the PR to make InternalRow public
>> 
>> 
> 
> 
> 
> This isn’t quite accurate. The change I made was to use InternalRow instead
> of UnsafeRow , which is a specific implementation of InternalRow. Exposing
> this API has always been a part of DSv2 and while both you and I did some
> work to avoid this, we are still in the phase of starting with that API.
> 
> 
> 
> Note that any change to InternalRow would be very costly to implement
> because this interface is widely used. That is why I think we can
> certainly consider it stable enough to use here, and that’s probably why 
> UnsafeRow
> was part of the original proposal.
> 
> 
> 
> In any case, the goal for 3.0 was not to replace the use of InternalRow ,
> it was to get the majority of SQL working on top of the interface added
> after 2.4. That’s done and stable, so I think a 2.5 release with it is
> also reasonable.
> 
> 
> 
> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin < r...@databricks.com > wrote:
> 
> 
> 
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations
>> should be decoupled from the internal implementations, with cheap ways to
>> convert back and forth.
>> 
>> 
>> 
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start
>> with an unstable API temporarily. You can't just make a bunch internal
>> APIs tightly coupled with other internal pieces public and stable and call
>> it a day, just because it happen to satisfy some use cases temporarily
>> assuming the rest of Spark doesn't change.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rb...@netflix.com > wrote:
>> 
>>> > DSv2 is far from stable right?
>>> 
>>> 
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>> 
>>> 
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>> 
>>> 
>>> I think what you're referring to is the use of `InternalRow`. That's a
>>> stable API and there has been no work to avoid using it. In any case, I
>>> don't think that anyone is suggesting that we delay 3.0 until a
>>> replacement for `InternalRow` is added, right?
>>> 
>>> 
>>> While I understand the motivation for a better solution here, I think the
>>> pragmatic solution is to continue using `InternalRow`.
>>> 
>>> 
>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>> invasive of a change to backport once you consider the parts needed 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
How would you not make incompatible changes in 3.x? As discussed the
InternalRow API is not stable and needs to change.

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:

> > Making downstream to diverge their implementation heavily between minor
> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>
> You're right that the API has been evolving in the 2.x line. But, it is
> now reasonably stable with respect to the current feature set and we should
> not need to break compatibility in the 3.x line. Because we have reached
> our goals for the 3.0 release, we can backport at least those features to
> 2.x and confidently have an API that works in both a 2.x release and is
> compatible with 3.0, if not 3.1 and later releases as well.
>
> > I'd rather say preparation of Spark 2.5 should be started after Spark
> 3.0 is officially released
>
> The reason I'm suggesting this is that I'm already going to do the work to
> backport the 3.0 release features to 2.4. I've been asked by several people
> when DSv2 will be released, so I know there is a lot of interest in making
> this available sooner than 3.0. If I'm already doing the work, then I'd be
> happy to share that with the community.
>
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
> about complete so we can easily release the same set of features and API in
> 2.5 and 3.0.
>
> If we decide for some reason to wait until after 3.0 is released, I don't
> know that there is much value in a 2.5. The purpose is to be a step toward
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
> wouldn't get these features out any sooner than 3.0, as a 2.5 release
> probably would, given the work needed to validate the incompatible changes
> in 3.0.
>
> > DSv2 change would be the major backward incompatibility which Spark 2.x
> users may hesitate to upgrade
>
> As I pointed out, DSv2 has been changing in the 2.x line, so this is
> expected. I don't think it will need incompatible changes in the 3.x line.
>
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>>> wrote:
>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only su

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
Because for example we'd need to move the location of InternalRow, breaking the 
package name. If you insist we shouldn't change the unstable temporary API in 
3.x to maintain compatibility with 3.0, which is totally different from my 
understanding of the situation when you exposed it, then I'd say we should gate 
3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested 
by others in the thread, DSv2 would be one of the main reasons people upgrade 
to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 
3.0 entirely and backport all the features to 2.x?

On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue < rb...@netflix.com > wrote:

> 
> Why would that require an incompatible change?
> 
> 
> We *could* make an incompatible change and remove support for InternalRow,
> but I think we would want to carefully consider whether that is the right
> decision. And in any case, we would be able to keep 2.5 and 3.0
> compatible, which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
> 
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change. 
>> 
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> > Making downstream to diverge their implementation heavily between minor
>>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>> 
>>> 
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we
>>> should not need to break compatibility in the 3.x line. Because we have
>>> reached our goals for the 3.0 release, we can backport at least those
>>> features to 2.x and confidently have an API that works in both a 2.x
>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>> 
>>> 
>>> 
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>> 
>>> 
>>> The reason I'm suggesting this is that I'm already going to do the work to
>>> backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>> 
>>> 
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API
>>> in 2.5 and 3.0.
>>> 
>>> 
>>> If we decide for some reason to wait until after 3.0 is released, I don't
>>> know that there is much value in a 2.5. The purpose is to be a step toward
>>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>>> probably would, given the work needed to validate the incompatible changes
>>> in 3.0.
>>> 
>>> 
>>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>>> users may hesitate to upgrade
>>> 
>>> 
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim < kabh...@gmail.com > wrote:
>>> 
>>> 
>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>>> deal with this as the change made confusion on my PRs...), but my bet is
>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>> for custom DataSource. Making downstream to diverge their implementation
>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>> experience - especially we are not completely closed the chance to further
>>>> modify DSv2, and the change could be backward incompatible.
>>>> 
>>>> 
>>>> If we really want to bring the DSv2 change to 2.x version line to let end
>>>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>>>> preparation

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
A while ago we changed it so the task gets broadcasted too, so I think the two 
are fairly similar.

On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > 
wrote:

> 
> I was wondering if anyone could help with this question.
> 
> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, < dhruba. work@ gmail. com
> ( dhruba.w...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> I have a question regarding passing a dictionary from driver to executors
>> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>> 
>> 
>> As I understand this can be passed in two ways:
>> 
>> 
>> 1. Broadcast the variable and then use it in the udfs
>> 
>> 
>> 2. Pass the dictionary in the udf itself, in something like this:
>> 
>> 
>>   def udf1(col1, dict):
>>    ..
>>   def udf 1 _ fn (dict):
>>     return udf(lambda col_ data : udf1( col_data, dict ))
>> 
>> 
>>   df.withColumn("column_new", udf 1 _ fn (dict)("old_column"))
>> 
>> 
>> Well I have tested with both the ways and it works both ways.
>> 
>> 
>> Now I am wondering what is fundamentally different between the two. I
>> understand how broadcast work but I am not sure how the data is passed
>> across in the 2nd way. Is the dictionary passed to each executor every
>> time when new task is running on that executor or they are passed only
>> once. Also how the data is passed to the python processes. They are python
>> udfs so I think they are executed natively in python.(Plz correct me if I
>> am wrong). So the data will be serialised and passed to python.
>> 
>> So in summary my question is which will be better/efficient way to write
>> the whole thing and why?
>> 
>> 
>> Thank you!
>> 
>> 
>> R egards,
>> Dhrub
>> 
> 
>

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
Whoever created the JIRA years ago didn't describe dpp correctly, but the 
linked jira in Hive was correct (which unfortunately is much more terse than 
any of the patches we have in Spark 
https://issues.apache.org/jira/browse/HIVE-9152 ). Henry R's description was 
also correct.

On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> Where can I find a design doc for dynamic partition pruning that explains
> how it works?
> 
> 
> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
> pruning (as pointed out by Henry R.) and doesn't have any comments about
> the implementation's approach. And the PR description also doesn't have
> much information. It lists 3 cases for how a filter is built, but nothing
> about the overall approach or design that helps when trying to find out
> where it should be placed in the optimizer rules. It also isn't clear why
> this couldn't be part of spark-catalyst.
> 
> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote:
> 
> 
>> dynamic partition pruning rule generates "hidden" filters that will be
>> converted to real predicates at runtime, so it doesn't matter where we run
>> the rule.
>> 
>> 
>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better
>> to run it before join reorder.
>> 
>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue < rb...@netflix.com.invalid >
>> wrote:
>> 
>> 
>>> 
>>> 
>>> Hi everyone,
>>> 
>>> 
>>> 
>>> I have been working on a PR that moves filter and projection pushdown into
>>> the optimizer for DSv2, instead of when converting to physical plan. This
>>> will make DSv2 work with optimizer rules that depend on stats, like join
>>> reordering.
>>> 
>>> 
>>> 
>>> While adding the optimizer rule, I found that some rules appear to be out
>>> of order. For example, PruneFileSourcePartitions that handles filter
>>> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will
>>> run after all of the batches in Optimizer (spark-catalyst) including 
>>> CostBasedJoinReorder
>>> .
>>> 
>>> 
>>> 
>>> SparkOptimizer also adds the new “dynamic partition pruning” rules after 
>>> both
>>> the cost-based join reordering and the v1 partition pruning rule. I’m not
>>> sure why this should run after join reordering and partition pruning,
>>> since it seems to me like additional filters would be good to have before
>>> those rules run.
>>> 
>>> 
>>> 
>>> It looks like this might just be that the rules were written in the
>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>> pushdown, which is altering physical plan details ( FileIndex ) that have
>>> leaked into the logical plan. I’m not sure why the dynamic partition
>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>> pushdown.
>>> 
>>> 
>>> 
>>> Can someone more familiar with these rules clarify why they appear to be
>>> out of order?
>>> 
>>> 
>>> 
>>> Assuming that this is an accident, I think it’s something that should be
>>> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning
>>> may still need to be addressed.
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
No there is no separate write up internally.

On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue  wrote:

> Thanks for the pointers, but what I'm looking for is information about the
> design of this implementation, like what requires this to be in spark-sql
> instead of spark-catalyst.
>
> Even a high-level description, like what the optimizer rules are and what
> they do would be great. Was there one written up internally that you could
> share?
>
> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue 
> wrote:
>
>> > It lists 3 cases for how a filter is built, but nothing about the
>> overall approach or design that helps when trying to find out where it
>> should be placed in the optimizer rules.
>>
>> The overall idea/design of DPP can be simply put as using the result of
>> one side of the join to prune partitions of a scan on the other side. The
>> optimal situation is when the join is a broadcast join and the table being
>> partition-pruned is on the probe side. In that case, by the time the probe
>> side starts, the filter will already have the results available and ready
>> for reuse.
>>
>> Regarding the place in the optimizer rules, it's preferred to happen late
>> in the optimization, and definitely after join reorder.
>>
>>
>> Thanks,
>> Maryann
>>
>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin  wrote:
>>
>>> Whoever created the JIRA years ago didn't describe dpp correctly, but
>>> the linked jira in Hive was correct (which unfortunately is much more terse
>>> than any of the patches we have in Spark
>>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description
>>> was also correct.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue 
>>> wrote:
>>>
>>>> Where can I find a design doc for dynamic partition pruning that
>>>> explains how it works?
>>>>
>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>> the implementation's approach. And the PR description also doesn't have
>>>> much information. It lists 3 cases for how a filter is built, but
>>>> nothing about the overall approach or design that helps when trying to find
>>>> out where it should be placed in the optimizer rules. It also isn't clear
>>>> why this couldn't be part of spark-catalyst.
>>>>
>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan  wrote:
>>>>
>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>> converted to real predicates at runtime, so it doesn't matter where we run
>>>>> the rule.
>>>>>
>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>>>> better to run it before join reorder.
>>>>>
>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have been working on a PR that moves filter and projection pushdown
>>>>>> into the optimizer for DSv2, instead of when converting to physical plan.
>>>>>> This will make DSv2 work with optimizer rules that depend on stats, like
>>>>>> join reordering.
>>>>>>
>>>>>> While adding the optimizer rule, I found that some rules appear to be
>>>>>> out of order. For example, PruneFileSourcePartitions that handles
>>>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a
>>>>>> batch that will run after all of the batches in Optimizer
>>>>>> (spark-catalyst) including CostBasedJoinReorder.
>>>>>>
>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>>>>>> *after* both the cost-based join reordering and the v1 partition
>>>>>> pruning rule. I’m not sure why this should run after join reordering and
>>>>>> partition pruning, since it seems to me like additional filters would be
>>>>>> good to have before those rules run.
>>>>>>
>>>>>> It looks like this might just be that the rules were written in the
>>>>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>>>>> pushdown, which is altering physical plan details (FileIndex) that
>>>>>> have leaked into the logical plan. I’m not sure why the dynamic partition
>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>>>> pushdown.
>>>>>>
>>>>>> Can someone more familiar with these rules clarify why they appear to
>>>>>> be out of order?
>>>>>>
>>>>>> Assuming that this is an accident, I think it’s something that should
>>>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” 
>>>>>> pruning
>>>>>> may still need to be addressed.
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
I just looked at the PR. I think there are some follow up work that needs to be 
done, e.g. we shouldn't create a top level package 
org.apache.spark.sql.dynamicpruning.

On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue < maryann@databricks.com > 
wrote:

> 
> There is no internal write up, but I think we should at least give some
> up-to-date description on that JIRA entry.
> 
> On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
>> No there is no separate write up internally.
>> 
>> On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> Thanks for the pointers, but what I'm looking for is information about the
>>> design of this implementation, like what requires this to be in spark-sql
>>> instead of spark-catalyst.
>>> 
>>> 
>>> Even a high-level description, like what the optimizer rules are and what
>>> they do would be great. Was there one written up internally that you could
>>> share?
>>> 
>>> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue < maryann@databricks.com >
>>> wrote:
>>> 
>>> 
>>>> > It lists 3 cases for how a filter is built, but nothing about the
>>>> overall approach or design that helps when trying to find out where it
>>>> should be placed in the optimizer rules.
>>>> 
>>>> 
>>>> The overall idea/design of DPP can be simply put as using the result of
>>>> one side of the join to prune partitions of a scan on the other side. The
>>>> optimal situation is when the join is a broadcast join and the table being
>>>> partition-pruned is on the probe side. In that case, by the time the probe
>>>> side starts, the filter will already have the results available and ready
>>>> for reuse.
>>>> 
>>>> 
>>>> Regarding the place in the optimizer rules, it's preferred to happen late
>>>> in the optimization, and definitely after join reorder.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Maryann
>>>> 
>>>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin < r...@databricks.com > wrote:
>>>> 
>>>> 
>>>> 
>>>>> Whoever created the JIRA years ago didn't describe dpp correctly, but the
>>>>> linked jira in Hive was correct (which unfortunately is much more terse
>>>>> than any of the patches we have in Spark 
>>>>> https://issues.apache.org/jira/browse/HIVE-9152
>>>>> ). Henry R's description was also correct.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> Where can I find a design doc for dynamic partition pruning that explains
>>>>>> how it works?
>>>>>> 
>>>>>> 
>>>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>>>> the implementation's approach. And the PR description also doesn't have
>>>>>> much information. It lists 3 cases for how a filter is built, but nothing
>>>>>> about the overall approach or design that helps when trying to find out
>>>>>> where it should be placed in the optimizer rules. It also isn't clear why
>>>>>> this couldn't be part of spark-catalyst.
>>>>>> 
>>>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>>>>> 
>>>>>> 
>>>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>>>> converted to real predicates at runtime, so it doesn't matter where we 
>>>>>>> run
>>>>>>> the rule.
>>>>>>> 
>>>>>>> 
>>>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's 
>>>>>>> better
>>>>>>> to run it before join reorder.
>>>>>>> 
>>>>>>> On Sun, Sep 29, 2019 a

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Reynold Xin
Can we just tag master?

On Wed, Oct 16, 2019 at 12:34 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> Does anybody remember what we did for 2.0 preview? Personally I'd like to
> avoid cutting branch-3.0 right now, otherwise we need to merge PRs into
> two branches in the following several months.
> 
> 
> Thanks,
> Wenchen
> 
> On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang < jiangxb1987@ gmail. com (
> jiangxb1...@gmail.com ) > wrote:
> 
> 
>> Hi Dongjoon,
>> 
>> 
>> I'm not sure about the best practice of maintaining a preview release
>> branch, since new features might still go into Spark 3.0 after preview
>> release, I guess it might make more sense to have separated  branches for
>> 3.0.0 and 3.0-preview.
>> 
>> 
>> However, I'm open to both solutions, if we really want to reuse the branch
>> to also release Spark 3.0.0, then I would be happy to create a new one.
>> 
>> 
>> Thanks!
>> 
>> 
>> Xingbo
>> 
>> Dongjoon Hyun < dongjoon. hyun@ gmail. com ( dongjoon.h...@gmail.com ) >
>> 于2019年10月16日周三 上午6:26写道:
>> 
>> 
>>> Hi, 
>>> 
>>> 
>>> It seems that we have `branch-3.0-preview` branch.
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ commits/ branch-3. 0-preview (
>>> https://github.com/apache/spark/commits/branch-3.0-preview )
>>> 
>>> 
>>> 
>>> Can we have `branch-3.0` instead of `branch-3.0-preview`?
>>> 
>>> 
>>> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
>>> `v3.0.0` later.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>> 
>> 
> 
>

Re: Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-16 Thread Reynold Xin
Just curious - did we discuss why this shouldn't be another Apache sister 
project?

On Wed, Oct 16, 2019 at 10:21 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> We don't all have to agree on whether to add this -- there are like 10
> people with an opinion -- and I certainly would not veto it. In practice a
> medium-sized changes needs someone to review/merge it all the way through,
> and nobody strongly objecting. I too don't know what to make of the
> situation; what happened to the supporters here?
> 
> 
> 
> I am concerned about maintenance, as inevitably any new module falls on
> everyone to maintain to some degree, and people come and go despite their
> intentions. But that isn't the substance of why I personally wouldn't
> merge it. Just doesn't seem like it must live in Spark. But again this is
> my opinion; you don't need to convince me, just need to
> (re?)-convince a shepherd, sponsor for this change.
> 
> 
> 
> Voting on the dependency part or whatever is also not important. It's a
> detail, and already merged even.
> 
> 
> 
> The issue to hand is: if nobody supports reviewing and merging the rest of
> the change, what then? we can't leave it half implemented. The fallback
> plan is just to back it out and reconsider later. This would be a poor
> outcome process-wise, but better than leaving it incomplete.
> 
> 
> 
> On Wed, Oct 16, 2019 at 3:15 AM Martin Junghanns
> < martin. junghanns@ neo4j. com ( martin.jungha...@neo4j.com ) > wrote:
> 
> 
>> 
>> 
>> I'm slightly confused about this discussion. I worked on all of the
>> aforementioned PRs: the module PR that has been merged, the current PR
>> that introduces the Graph API and the PoC PR that contains the full
>> implementation. The issues around shading were addressed and the module PR
>> eventually got merged. Two PMC members including the SPIP shepherd are
>> working with me (and others) on the current API PR. The SPIP to bring
>> Spark Graph into Apache Spark itself has been successfully voted on
>> earlier this year. I presented this work at Spark Summit in San Fransisco
>> in May and was asked by the organizers to present the topic at the
>> European Spark Summit. I'm currently sitting in the speakers room of that
>> conference preparing for the talk and reading this thread. I hope you
>> understand my confusion.
>> 
>> 
>> 
>> I admit - and Xiangrui pointed this out in the other thread, too - that we
>> made the early mistake of not bringing more Spark committers on board
>> which lead to a stagnation period during summer when Xiangrui wasn't
>> around to help review and bring progress to the project.
>> 
>> 
>> 
>> Sean, if your concern is the lack of maintainers of that module, I
>> personally would like to volunteer to maintain Spark Graph. I'm also a
>> contributor to the Okapi stack and am able to work on whatever issue might
>> come up on that end including updating dependencies etc. FWIW, Okapi is
>> actively maintained by a team at Neo4j.
>> 
>> 
>> 
>> Best, Martin
>> 
>> 
>> 
>> On Wed, 16 Oct 2019, 4:35 AM Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> I do not have a very informed opinion here, so take this with a grain of
>>> salt.
>>> 
>>> 
>>> 
>>> I'd say that we need to either commit a coherent version of this for Spark
>>> 3, or not at all. If it doesn't have support, I'd back out the existing
>>> changes.
>>> I was initially skeptical about how much this needs to be in Spark vs a
>>> third-party package, and that still stands.
>>> 
>>> 
>>> 
>>> The addition of another dependency isn't that big a deal IMHO, but, yes,
>>> it does add something to the maintenance overhead. But that's all the more
>>> true of a new module.
>>> 
>>> 
>>> 
>>> I don't feel strongly about it, but if this isn't obviously getting
>>> support from any committers, can we keep it as a third party library for
>>> now?
>>> 
>>> 
>>> 
>>> On Tue, Oct 15, 2019 at 8:53 PM Weichen Xu < weichen. xu@ databricks. com (
>>> weichen...@databricks.com ) > wrote:
>>> 
>>> 
 
 
 Hi Mats Rydberg,
 
 
 
 Although this dependency "org.opencypher:okapi-shade.okapi" was added into
 spark, but Xiangrui raised two concerns (see above mail) about it, so we'd
 better rethink on this and consider whether this is a good choice, so I
 call this vote.
 
 
 
 Thanks!
 
 
 
 On Tue, Oct 15, 2019 at 10:56 PM Mats Rydberg < mats@ neo4j. org. invalid (
 m...@neo4j.org.invalid ) > wrote:
 
 
> 
> 
> Hello Weichen, community
> 
> 
> 
> I'm sorry, I'm feeling a little bit confused about this vote. Is this
> about the PR ( https:/ / github. com/ apache/ spark/ pull/ 24490 (
> https://github.com/apache/spark/pull/24490 ) ) that was merged in early
> June and introduced the spark-graph module including the okapi-shade
> dependency?
> 
> 
> 
> Regarding the okapi-shade dependency which was deve

Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Reynold Xin
Does the description make sense? This is a preview release so there is no
need to retarget versions.

On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0-preview.
>
> The vote is open until November 2 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.0-preview
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.0.0-preview-rc1 (commit
> 5eddbb5f1d9789696927f435c55df887e50a1389):
> https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1334/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
>
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: Why Spark generates Java code and not Scala?

2019-11-09 Thread Reynold Xin
It’s mainly due to compilation speed. Scala compiler is known to be slow.
Even javac is quite slow. We use Janino which is a simpler compiler to get
faster compilation speed at runtime.

Also for low level code we can’t use (due to perf concerns) any of the
edges scala has over java, eg we can’t use the scala collection library,
functional programming, map/flatMap. So using scala doesn’t really buy
anything even if there is no compilation speed concerns.

On Sat, Nov 9, 2019 at 9:52 AM Holden Karau  wrote:

>
> Switching this from user to dev
>
> On Sat, Nov 9, 2019 at 9:47 AM Bartosz Konieczny 
> wrote:
>
>> Hi there,
>>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>>
> From the discussions when I was doing some code gen (in MLlib not SQL) I
> think this is the primary reason why.
>
>>
>> 
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>>
> this was also given as a secondary reason
>
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>>
> no one brought up this point. Maybe it was a consideration and it just
> wasn’t raised.
>
>> ?
>>
>> Thank you for your help.
>>
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Spark 3.0 preview release 2?

2019-12-08 Thread Reynold Xin
If the cost is low, why don't we just do monthly previews until we code freeze? 
If it is high, maybe we should discuss and do it when there are people that 
volunteer 

On Sun, Dec 08, 2019 at 10:32 PM, Xiao Li < gatorsm...@gmail.com > wrote:

> 
> 
> 
> I got many great feedbacks from the community about the recent 3.0 preview
> release. Since the last 3.0 preview release, we already have 353 commits [
> https://github.com/apache/spark/compare/v3.0.0-preview...master (
> https://github.com/apache/spark/compare/v3.0.0-preview...master ) ]. There
> are various important features and behavior changes we want the community
> to try before entering the official release candidates of Spark 3.0. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch: 
> 
> 
> 
> 
> 
> 
> 
> * Support JDK 11 with Hadoop 2.7
> * Spark SQL will respect its own default format (i.e., parquet) when users
> do CREATE TABLE without USING or STORED AS clauses
> * Enable Parquet nested schema pruning and nested pruning on expressions
> by default
> * Add observable Metrics for Streaming queries
> * Column pruning through nondeterministic expressions
> * RecordBinaryComparator should check endianness when compared by long 
> * Improve parallelism for local shuffle reader in adaptive query execution
> 
> * Upgrade Apache Arrow to version 0.15.1
> * Various interval-related SQL support
> * Add a mode to pin Python thread into JVM's
> * Provide option to clean up completed files in streaming query
> 
> 
> 
> 
> 
> 
> 
> 
> I am wondering if we can have another preview release for Spark 3.0? This
> can help us find the design/API defects as early as possible and avoid the
> significant delay of the upcoming Spark 3.0 release
> 
> 
> 
> 
> 
> 
> 
> 
> Also, any committer is willing to volunteer as the release manager of the
> next preview release of Spark 3.0, if we have such a release? 
> 
> 
> 
> 
> 
> 
> 
> 
> Cheers,
> 
> 
> 
> 
> 
> 
> 
> 
> Xiao
> 
> 
>

Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Reynold Xin
We've pushed out 3.0 multiple times. The latest release window documented on 
the website ( http://spark.apache.org/versioning-policy.html ) says we'd code 
freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from 
the tragedy of the commons, that nobody is pushing for getting the release out. 
I understand the natural tendency for each individual is to finish or extend 
the feature/bug that the person has been working on. At some point we need to 
say "this is it" and get the release out. I'm happy to help drive this process.

To be realistic, I don't think we should just code freeze * today *. Although 
we have updated the website, contributors have all been operating under the 
assumption that all active developments are still going on. I propose we *cut 
the branch on* *Jan 31* *, and code freeze and switch over to bug squashing 
mode, and try to get the 3.0 official release out in Q1*. That is, by default 
no new features can go into the branch starting Jan 31.

What do you think?

And happy holidays everybody.

Re: [SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Reynold Xin
Can this perhaps exist as an utility function outside Spark?

On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev > 
wrote:

> 
> 
> 
> Hi Devs,
> 
> 
> 
> I'd like to get your thoughts on this Dataset feature proposal. Comparing
> datasets is a central operation when regression testing your code changes.
> 
> 
> 
> 
> It would be super useful if Spark's Datasets provide this transformation
> natively.
> 
> 
> 
> https:/ / github. com/ apache/ spark/ pull/ 26936 (
> https://github.com/apache/spark/pull/26936 )
> 
> 
> 
> Regards,
> Enrico
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-10 Thread Reynold Xin
Introducing a new data type has high overhead, both in terms of internal 
complexity and users' cognitive load. Introducing two data types would have 
even higher overhead.

I looked quickly and looks like both Redshift and Snowflake, two of the most 
recent SQL analytics successes, have only one interval type, and don't support 
storing that. That gets me thinking in reality storing interval type is not 
that useful.

Do we really need to do this? One of the worst things we can do as a community 
is to introduce features that are almost never used, but at the same time have 
high internal complexity for maintenance.

On Fri, Jan 10, 2020 at 10:45 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for clarification.
> 
> 
> Bests,
> Dongjoon.
> 
> On Fri, Jan 10, 2020 at 10:07 AM Kent Yao < yaooqinn@ qq. com (
> yaooq...@qq.com ) > wrote:
> 
> 
>> 
>> Hi Dongjoon,
>> 
>> 
>> Yes, As we want make CalenderIntervalType deprecated and so far, we just
>> find
>> 1. The make_interval function that produces legacy CalenderIntervalType
>> values, 
>> 2. `interval` -> CalenderIntervalType support in the parser
>> 
>> 
>> Thanks
>> 
>> 
>> *Kent Yao*
>> Data Science Center, Hangzhou Research Institute, Netease Corp.
>> PHONE: (86) 186-5715-3499
>> EMAIL: hzyaoqin@ corp. netease. com ( hzyao...@corp.netease.com )
>> 
>> 
>> On 01/11/2020 01:57 , Dongjoon Hyun (
>> dongjoon.h...@gmail.com ) wrote:
>> 
>>> Hi, Kent. 
>>> 
>>> 
>>> Thank you for the proposal.
>>> 
>>> 
>>> Does your proposal need to revert something from the master branch?
>>> I'm just asking because it's not clear in the proposal document.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yao < yaooqinn@ qq. com (
>>> yaooq...@qq.com ) > wrote:
>>> 
>>> 
 Hi, Devs
 
 I’d like to propose to add two new interval types which are year-month and
 
 day-time intervals for better ANSI support and future improvements. We
 will
 keep the current CalenderIntervalType but mark it as deprecated until we
 find the right time to remove it completely. The backward compatibility of
 
 the old interval type usages in 2.4 will be guaranteed.
 
 Here is the design doc:
 
 [SPIP] Support Year-Month and Day-Time Intervals -
 https:/ / docs. google. com/ document/ d/ 
 1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/
 edit?usp=sharing (
 https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing
 )
 
 All comments are welcome!
 
 Thanks,
 
 Kent Yao
 
 
 
 
 --
 Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
 ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
 
 -
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Reynold Xin
This seems reasonable!

On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1, I'm supporting the following proposal.
> 
> 
> > this mirror as the primary repo in the build, falling back to Central if
> needed.
> 
> 
> Thanks,
> Dongjoon.
> 
> 
> 
> On Tue, Jan 21, 2020 at 14:37 Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> See https:/ / github. com/ apache/ spark/ pull/ 27307 (
>> https://github.com/apache/spark/pull/27307 ) for some context. We've
>> had to add, in at least one place, some settings to resolve artifacts
>> from a mirror besides Maven Central to work around some build
>> problems.
>> 
>> Now, we find it might be simpler to just use this mirror as the
>> primary repo in the build, falling back to Central if needed.
>> 
>> The question is: any objections to that?
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin
Thanks for writing this up. 

Usually when people talk about push-based shuffle, they are motivating it 
primarily to reduce the latency of short queries, by pipelining the map phase, 
shuffle phase, and the reduce phase (which this design isn't going to address). 
It's interesting you are targeting throughput by optimizing for random reads 
instead.

My main questions are ...

1. This is designing for HDDs. But SSD prices have gone lower than HDDs this 
year, so most new data center storage will be using SSDs from now on. Are we 
introducing a lot of complexity to address a problem that only exists with 
legacy that will be phased out soon?

2. Is there a simpler way to address this? E.g. you can simply merge map 
outputs for each node locally, without involving any type of push. It seems to 
me you'd address the same issues you have, with the same limitations (of memory 
buffer limiting the number of concurrent streams you can write to).

On Tue, Jan 21, 2020 at 6:13 PM, mshen < ms...@apache.org > wrote:

> 
> 
> 
> I'd like to start a discussion on enabling push-based shuffle in Spark.
> This is meant to address issues with existing shuffle inefficiency in a
> large-scale Spark compute infra deployment.
> Facebook's previous talks on SOS shuffle
> < https:/ / databricks. com/ session/ sos-optimizing-shuffle-i-o (
> https://databricks.com/session/sos-optimizing-shuffle-i-o ) > and Cosco
> shuffle service
> < https:/ / databricks. com/ session/ 
> cosco-an-efficient-facebook-scale-shuffle-service
> (
> https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service
> ) > are solutions dealing with a similar problem.
> Note that this is somewhat orthogonal to the work in SPARK-25299
> < https:/ / issues. apache. org/ jira/ browse/ SPARK-25299 (
> https://issues.apache.org/jira/browse/SPARK-25299 ) > , which is to use
> remote storage to store shuffle data.
> More details of our proposed design is in SPARK-30602
> < https:/ / issues. apache. org/ jira/ browse/ SPARK-30602 (
> https://issues.apache.org/jira/browse/SPARK-30602 ) > , with SPIP attached.
> Would appreciate comments and discussions from the community.
> 
> 
> 
> -
> Min Shen
> Staff Software Engineer
> LinkedIn
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Reynold Xin
If your UDF itself is very CPU intensive, it probably won't make that much of 
difference, because the UDF itself will dwarf the serialization/deserialization 
overhead.

If your UDF is cheap, it will help tremendously.

On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I read online[1] that for a best UDF performance it is possible to
> implement them using internal Spark expressions, and I also saw a couple
> of pull requests such as [2] and [3] where this was put to practice (not
> sure if for that reason or just to extend the API).
> 
> 
> 
>  
> 
> 
> 
> We have an algorithm that computes a score similar to what the Levenshtein
> distance does and it takes about 30%-40% of the overall time of our job.
> We are looking for ways to improve it without adding more resources.
> 
> 
> 
>  
> 
> 
> 
> I was wondering if it would be advisable to implement it extending 
> BinaryExpression
> like[1] and if i t would result in any performance gains.
> 
> 
> 
>  
> 
> 
> 
> Thanks for your help!
> 
> 
> 
>  
> 
> 
> 
> [1] https:/ / hackernoon. com/ 
> apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> (
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> )
> 
> 
> 
> [2] https:/ / github. com/ apache/ spark/ pull/ 7214 (
> https://github.com/apache/spark/pull/7214 )
> 
> 
> 
> [3] https:/ / github. com/ apache/ spark/ pull/ 7236 (
> https://github.com/apache/spark/pull/7236 )
> 
> 
>

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-01-29 Thread Reynold Xin
t;>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>>> default
>>>>> SPARK-27296 Efficient User Defined Aggregators
>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>> cause driver pods to hang
>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>> barrier stage
>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>> Aggregate
>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>> SPARK-26425 Add more constraint checks in file streaming source to avoid
>>>>> checkpoint corruption
>>>>> SPARK-25843 Redesign rangeBetween API
>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>> SPARK-25752 Add trait to easily whitelist logical operators that produce
>>>>> named output from CleanupAliases
>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>> aggregate
>>>>> SPARK-25531 new write APIs for data source v2
>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>>>> 
>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>> MesosFineGrainedSchedulerBackend
>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>> execution mode
>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>> Spec
>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>> Spark
>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list
>>>>> of structures
>>>>> SPARK-22386 Data Source V2 improvements
>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin < rxin@ databricks. com (
>>>>> r...@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> We've pushed out 3.0 multiple times. The latest release window documented
>>>>>> on the website ( http://spark.apache.org/versioning-policy.html ) says
>>>>>> we'd code freeze and cut branch-3.0 early Dec. It looks like we are
>>>>>> suffering a bit from the tragedy of the commons, that nobody is pushing
>>>>>> for getting the release out. I understand the natural tendency for each
>>>>>> individual is to finish or extend the feature/bug that the person has 
>>>>>> been
>>>>>> working on. At some point we need to say "this is it" and get the release
>>>>>> out. I'm happy to help drive this process.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> To be realistic, I don't think we should just code freeze * today *.
>>>>>> Although we have updated the website, contributors have all been 
>>>>>> operating
>>>>>> under the assumption that all active developments are still going on. I
>>>>>> propose we *cut the branch on* *Jan 31* *, and code freeze and switch 
>>>>>> over
>>>>>> to bug squashing mode, and try to get the 3.0 official release out in 
>>>>>> Q1*.
>>>>>> That is, by default no new features can go into the branch starting Jan 
>>>>>> 31
>>>>>> .
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> What do you think?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> And happy holidays everybody.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Databricks Summit - Watch the talks (
>>>> https://databricks.com/sparkaisummit/north-america ) 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> ---
>> Takeshi Yamamuro
>> 
> 
>

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-01 Thread Reynold Xin
Note that branch-3.0 was cut. Please focus on testing, polish, and let's get 
the release out!

On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> Just a reminder - code freeze is coming this Fri !
> 
> 
> 
> There can always be exceptions, but those should be exceptions and
> discussed on a case by case basis rather than becoming the norm.
> 
> 
> 
> 
> 
> 
> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
>> Jan 31 sounds good to me.
>> 
>> 
>> Just curious, do we allow some exception on code freeze? One thing came
>> into my mind is that some feature could have multiple subtasks and part of
>> subtasks have been merged and other subtask(s) are in reviewing. In this
>> case do we allow these subtasks to have more days to get reviewed and
>> merged later?
>> 
>> 
>> Happy Holiday!
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
>> ( linguin@gmail.com ) > wrote:
>> 
>> 
>>> Looks nice, happy holiday, all!
>>> 
>>> 
>>> Bests,
>>> Takeshi
>>> 
>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1 for January 31st.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li < lixiao@ databricks. com (
>>>> lix...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Jan 31 is pretty reasonable. Happy Holidays! 
>>>>> 
>>>>> 
>>>>> Xiao
>>>>> 
>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all 
>>>>>> arbitrary
>>>>>> but indeed this has been in progress for a while, and there's a downside
>>>>>> to not releasing it, to making the gap to 3.0 larger. 
>>>>>> On my end I don't know of anything that's holding up a release; is it
>>>>>> basically DSv2?
>>>>>> 
>>>>>> BTW these are the items still targeted to 3.0.0, some of which may not
>>>>>> have been legitimately tagged. It may be worth reviewing what's still 
>>>>>> open
>>>>>> and necessary, and what should be untargeted.
>>>>>> 
>>>>>> 
>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>> SPARK-29345 Add an API that allows a user to define and observe arbitrary
>>>>>> metrics on streaming queries
>>>>>> SPARK-29348 Add observable metrics
>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames after
>>>>>> some operations
>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>> relation table properly
>>>>>> SPARK-27986 Support Aggregate Expressions with filter
>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable 
>>>>>> smoother
>>>>>> upgrade
>>>>>> SPARK-27714 Support Jo

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Reynold Xin
This is really cool. We should also be more opinionated about how we specify 
time and intervals.

On Wed, Feb 12, 2020 at 3:15 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Wenchen.
> 
> 
> The new policy looks clear to me. +1 for the explicit policy.
> 
> 
> So, are we going to revise the existing conf names before 3.0.0 release?
> 
> 
> Or, is it applied to new up-coming configurations from now?
> 
> 
> Bests,
> Dongjoon.
> 
> On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> Hi all,
>> 
>> 
>> I'd like to discuss the naming policy of Spark configs, as for now it
>> depends on personal preference which leads to inconsistent namings.
>> 
>> 
>> In general, the config name should be a noun that describes its meaning
>> clearly.
>> Good examples:
>> spark.sql.session.timeZone
>> 
>> spark.sql.streaming.continuous.executorQueueSize
>> 
>> spark.sql.statistics.histogram.numBins
>> 
>> Bad examples:
>> spark.sql.defaultSizeInBytes (default size for what?)
>> 
>> 
>> 
>> Also note that, config name has many parts, joined by dots. Each part is a
>> namespace. Don't create namespace unnecessarily.
>> Good example:
>> spark.sql.execution.rangeExchange.sampleSizePerPartition
>> 
>> spark.sql.execution.arrow.maxRecordsPerBatch
>> 
>> Bad examples:
>> spark. sql. windowExec. buffer. in. memory. threshold (
>> http://spark.sql.windowexec.buffer.in.memory.threshold/ ) (" in" is not a
>> useful namespace, better to be.buffer.inMemoryThreshold )
>> 
>> 
>> 
>> For a big feature, usually we need to create an umbrella config to turn it
>> on/off, and other configs for fine-grained controls. These configs should
>> share the same namespace, and the umbrella config should be named like 
>> featureName.enabled
>> . For example:
>> spark.sql.cbo.enabled
>> 
>> spark.sql.cbo.starSchemaDetection
>> 
>> spark.sql.cbo.starJoinFTRatio
>> spark.sql.cbo.joinReorder.enabled
>> spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace)
>> 
>> spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace)
>> 
>> 
>> 
>> 
>> For boolean configs, in general it should end with a verb, e.g. 
>> spark.sql.join.preferSortMergeJoin
>> . If the config is for a feature and you can't find a good verb for the
>> feature, featureName.enabled is also good.
>> 
>> 
>> I'll update https:/ / spark. apache. org/ contributing. html (
>> https://spark.apache.org/contributing.html ) after we reach a consensus
>> here. Any comments are welcome!
>> 
>> 
>> Thanks,
>> Wenchen
>> 
> 
>

Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Reynold Xin
It's a good discussion to have though: should we deprecate dstream, and what do 
we need to do to make that happen? My experience working with a lot of Spark 
users is that in general I recommend them staying away from dstream, due to a 
lot of design and architectural issues.

On Mon, Mar 02, 2020 at 9:32 AM, Prashant Sharma < scrapco...@gmail.com > wrote:

> 
> I may have speculated, or believed the unauthorised sources, nevertheless
> I am happy to be corrected. 
> 
> On Mon, Mar 2, 2020 at 8:05 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> Er, who says it's deprecated? I have never heard anything like that.
>> Why would it be?
>> 
>> On Mon, Mar 2, 2020 at 4:52 AM Prashant Sharma < scrapcodes@ gmail. com (
>> scrapco...@gmail.com ) > wrote:
>> >
>> > Hi All,
>> >
>> > It is noticed that some of the users of Spark streaming do not
>> immediately realise that it is a deprecated component and it would be
>> scary, if they end up with it in production. Now that we are in a position
>> to release about Spark 3.0.0, may be we should discuss - should the spark
>> streaming carry an explicit notice? That it is deprecated and not under
>> active development.
>> >
>> > I have opened an issue already, but I think a mailing list discussion
>> would be more appropriate. https:/ / issues. apache. org/ jira/ browse/ 
>> SPARK-31006
>> ( https://issues.apache.org/jira/browse/SPARK-31006 )
>> >
>> > Thanks,
>> > Prashant.
>> >
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Reynold Xin
+1

On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzh...@apache.org > wrote:

> 
> +1 (non-binding)
> 
> 
> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> I am disappointed however that this only mentions API and not dependencies
>> and transitive dependencies.
>> 
>> 
>> As Spark does not provide separation between its runtime classpath and the
>> classpath used by applications, I believe Spark's dependencies and
>> transitive dependencies should be considered part of the API for this
>> policy.  Breaking dependency upgrades and incompatible dependency versions
>> are the source of much frustration.
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN < ueshin@ happy-camper. st (
>>> ues...@happy-camper.st ) > wrote:
>>> 
>>> +1 (binding)
>>> 
>>> 
>>> 
>>> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang < jiangxb1987@ gmail. com (
>>> jiangxb1...@gmail.com ) > wrote:
>>> 
>>> 
 +1 (non-binding)
 
 
 Cheers,
 
 
 Xingbo
 
 On Mon, Mar 9, 2020 at 9:35 AM Xiao Li < lixiao@ databricks. com (
 lix...@databricks.com ) > wrote:
 
 
> +1 (binding)
> 
> 
> Xiao
> 
> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee < denny. g. lee@ gmail. com (
> denny.g@gmail.com ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon < gurwls223@ gmail. com (
>> gurwls...@gmail.com ) > wrote:
>> 
>> 
>>> The proposal itself seems good as the factors to consider, Thanks 
>>> Michael.
>>> 
>>> 
>>> Several concerns mentioned look good points, in particular:
>>> 
>>> > ... assuming that this is for public stable APIs, not APIs that are
>>> marked as unstable, evolving, etc. ...
>>> I would like to confirm this. We already have API annotations such as
>>> Experimental, Unstable, etc. and the implication of each is still
>>> effective. If it's for stable APIs, it makes sense to me as well.
>>> 
>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>> proposing to diverge from semver. ...
>>> 
>>> I think this is a good point. If we're proposing to divert from semver,
>>> the delta compared to semver will have to be clarified to avoid 
>>> different
>>> personal interpretations of the somewhat general principles.
>>> 
>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>> Apache Spark 3.0+? ...
>>> 
>>> Assuming these concerns will be addressed, +1 (binding).
>>> 
>>>  
>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro < linguin. m. s@ gmail. com (
>>> linguin@gmail.com ) >님이 작성:
>>> 
>>> 
 +1 (non-binding)
 
 
 Bests,
 Takeshi
 
 On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang < gengliang. wang@ 
 databricks.
 com ( gengliang.w...@databricks.com ) > wrote:
 
 
> +1 (non-binding)
> 
> 
> Gengliang
> 
> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia < matei. zaharia@ 
> gmail. com
> ( matei.zaha...@gmail.com ) > wrote:
> 
> 
>> +1 as well.
>> 
>> 
>> Matei
>> 
>> 
>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0...@gmail.com ) > wrote:
>>> 
>>> +1 (binding), assuming that this is for public stable APIs, not 
>>> APIs that
>>> are marked as unstable, evolving, etc.
>>> 
>>> 
>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía < iemejia@ gmail. com (
>>> ieme...@gmail.com ) > wrote:
>>> 
>>> 
 +1 (non-binding)
 
 Michael's section on the trade-offs of maintaining / removing an 
 API are
 one of
 the best reads I have seeing in this mailing list. Enthusiast +1
 
 On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun < dongjoon. hyun@ 
 gmail. com (
 dongjoon.h...@gmail.com ) > wrote:
 >
 > This new policy has a good indention, but can we narrow down on 
 > the
 migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
 >
 > I saw that there already exists a reverting PR to bring back 
 > Spark 1.4
 and 1.5 APIs based on this AS-IS suggestion.
 >
 > The AS-IS policy is clearly mentioning that JVM/Scala-level 
 > difficulty,
 and it's nice.
 >
 > However, for the other cases, it sounds like `recommending older 
 > APIs as
 much as possible` due to the following.
 >
 >      

Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of
both new and old users?

For old users, their old code that was working for char(3) would now stop
working.

For new users, depending on whether the underlying metastore char(3) is
either supported but different from ansi Sql (which is not that big of a
deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on `CHAR`
> type behavior among its usages and configurations. However, the evolution
> direction has been gradually moving forward to be consistent inside Apache
> Spark because we don't have `CHAR` offically. The following is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
> behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
> consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a 2
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
> following syntax to be safe.
>
> CREATE TABLE t(a CHAR(3));
> https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html ( 
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
 ) : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql

On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> 
> 
> Please see the following for the context.
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
> https://issues.apache.org/jira/browse/SPARK-31136 )
> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax"
> 
> 
> I raised the above issue according to the new rubric, and the banning was
> the proposed alternative to reduce the potential issue.
> 
> 
> Please give us your opinion since it's still PR.
> 
> 
> Bests,
> Dongjoon.
> 
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>> of both new and old users?
>> 
>> 
>> For old users, their old code that was working for char(3) would now stop
>> working. 
>> 
>> 
>> For new users, depending on whether the underlying metastore char(3) is
>> either supported but different from ansi Sql (which is not that big of a
>> deal if we explain it) or not supported. 
>> 
>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>> 
>>> Hi, All.
>>> 
>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>> type behavior among its usages and configurations. However, the evolution
>>> direction has been gradually moving forward to be consistent inside Apache
>>> Spark because we don't have `CHAR` offically. The following is the
>>> summary.
>>> 
>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>> Hive behavior.)
>>> 
>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>> 
>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>> behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>> consistent.
>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>> fallback to Hive behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>> following syntax to be safe.
>>> 
>>>     CREATE TABLE t(a CHAR(3));
>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>> https://github.com/apache/spark/pull/27902 )
>>> 
>>> This email is sent out to inform you based on the new policy we voted.
>>> The recommendation is always using Apache Spark's native type `String`.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> References:
>>> 1. "CHAR implementation?", 2017/09/15
>>>      https:/ / lists. apache. org/ thread. html/ 
>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>> )
>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I haven't spent enough time thinking about it to give a strong opinion, but 
this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently 
swapped the two arguments. And trim is also quite commonly used from a long 
time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not 
great that we are changing the value here, but the value is already fucked up. 
It depends on the underlying data source, and random configs that are seemingly 
unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengu...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. 
>>> html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>> why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Hi, Reynold.
>>>> 
>>>> 
>>>> Please see the following for the context.
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>> https://issues.apache.org/jira/browse/SPARK-31136 )
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>> 
>>>> 
>>>> I raised the above issue according to the new rubric, and the banning was
>>>> the proposed alternative to reduce the potential issue.
>>>> 
>>>> 
>>>> Please give us your opinion since it's still PR.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>> of both new and old users?
>>>>> 
>>>>> 
>>>>> For old users, their old code that was working for char(3) would now stop
>>>>> working. 
>>>>> 
>>>>> 
>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported. 
>>>>> 
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> Apache Spark has been suffered from a known consistency issue on `

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I looked up our usage logs (sorry I can't share this publicly) and trim has at 
least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
> LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> s...@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>>

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was 
merely pointing out that if we deviate away from SQL standard in any way we are 
considered "wrong" or "incorrect". That argument itself is flawed when plenty 
of other popular database systems also deviate away from the standard on this 
specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>> LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> s...@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>&g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
−User

char barely showed up (honestly negligible). I was comparing select vs select.

On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
> statements with `CHAR`?
> 
> > I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> We need to discuss more about what to do. This thread is what I expected
> exactly. :)
> 
> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
> it). I was merely pointing out that if we deviate away from SQL standard
> in any way we are considered "wrong" or "incorrect". That argument itself
> is flawed when plenty of other popular database systems also deviate away
> from the standard on this specific behavior.
> 
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>> I was merely pointing out that if we deviate away from SQL standard in any
>> way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away
>> from the standard on this specific behavior.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Thank you, Stephen and Reynold.
>>>> 
>>>> 
>>>> To Reynold.
>>>> 
>>>> 
>>>> The way I see the following is a little different.
>>>> 
>>>> 
>>>>       > CHAR is an undocumented data type without clearly defined
>>>> semantics.
>>>> 
>>>> Let me describe in Apache Spark User's View point.
>>>> 
>>>> 
>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>>>> Apache Spark 1.x without much documentation. In addition, there still
>>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>> 
>>>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>>>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>>>        Add back HiveContext and createExternalTable
>>>> 
>>>> Historically, we tried to make many SQL-based customer migrate their
>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>> 
>>>> Although Apache Spark didn't have a good document about the inconsistent
>>>> behavior among its data sources, Apache Hive has been providing its
>>>> documentation and many customers rely the behavior.
>>>> 
>>>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>>>> LanguageManual+Types
>>>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>>>> 
>>>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>>>> many existing huge tables were created by Apache Hive, not Apache Spark.
>>>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>>>> This was true because Apache Spark was added into the Hadoop-vendor
>>>> products later than Apache Hive.
>>>> 
>>>> 
>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>> Apache Spark because two SQL engines share the same tables.
>>>> 
>>>> For the following, technically, while Apache Hive doesn't changed its
>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>> away from the original Apache Spark old behaviors one-by-one.
>>>> 
>>>> 
>>>>       >  the value is already fucked up
>>>> 
>>>> 
>>>> The following is the change log.
>>

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
For sure.

There's another reason I feel char is not that important and it's more 
important to be internally consistent (e.g. all data sources support it with 
the same behavior, vs one data sources do one behavior and another do the 
other). char was created at a time when cpu was slow and storage was expensive, 
and being able to pack things nicely at fixed length was highly useful. The 
fact that it was padded was initially done for performance, not for the padding 
itself. A lot has changed since char was invented, and with modern technologies 
(columnar, dictionary encoding, etc) there is little reason to use a char data 
type for anything. As a matter of fact, Spark internally converts char type to 
string to work with.

I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. 
This would ban all the native data sources, which are the primarily way people 
are using Spark. This leaves only char support for tables going through Hive 
serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's 
suggestion. This turns char support into a compatibility feature only for some 
Hive tables that cannot be converted into Spark native data sources. This has 
confusing end-user behavior because depending on whether that Hive table is 
converted into Spark native data sources, we might or might not support char 
type.

An extension to the above is to introduce padding for char type across the 
board, and make char type a first class data type. There are a lot of work to 
introduce another data type, especially for one that has virtually no usage ( 
https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string ) 
and its usage will likely continue to decline in the future (just reason from 
first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the 
case (e.g. just a simple rule to insert padding at planning time), then maybe 
it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not 
have a plan to address this problem systematically.

2. Just forget about padding, like what Snowflake and MySQL have done. Document 
that char(x) is just an alias for string. And then move on. Almost no work 
needs to be done...

On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for sharing and confirming.
> 
> 
> We had better consider all heterogeneous customers in the world. And, I
> also have experiences with the non-negligible cases in on-prem.
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> −User
>> 
>> 
>> 
>> char barely showed up (honestly negligible). I was comparing select vs
>> select.
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>>> statements with `CHAR`?
>>> 
>>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> We need to discuss more about what to do. This thread is what I expected
>>> exactly. :)
>>> 
>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard
>>> in any way we are considered "wrong" or "incorrect". That argument itself
>>> is flawed when plenty of other popular database systems also deviate away
>>> from the standard on this specific behavior.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
>>> r...@databricks.com ) > wrote:
>>> 
>>> 
>>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>>>> I was merely pointing out that if we deviate away from SQL standard in any
>>>> way we are considered "wrong" or "incorrect". That argument itself is
>>>> flawed when plenty of other popular database systems also deviate away
>>>> from the standard on this specific behavior.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>&g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
You are joking when you said " informed widely and discussed in many ways 
twice" right?

This thread doesn't even talk about char/varchar: 
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

(Yes it talked about changing the default data source provider, but that's just 
one of the ways we are exposing this char/varchar issue).

On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for Wenchen's suggestion.
> 
> I believe that the difference and effects are informed widely and
> discussed in many ways twice.
> 
> First, this was shared on last December.
> 
>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>    https:/ / lists. apache. org/ thread. html/ 
> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> Second (at this time in this thread), this has been discussed according to
> the new community rubric.
> 
>     - https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
> When Breaking APIs")
> 
> 
> Thank you all.
> 
> 
> Bests,
> Dongjoon.
> 
> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> OK let me put a proposal here:
>> 
>> 
>> 1. Permanently ban CHAR for native data source tables, and only keep it
>> for Hive compatibility.
>> It's OK to forget about padding like what Snowflake and MySQL have done.
>> But it's hard for Spark to require consistent behavior about CHAR type in
>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>> just ban it. Another way is to document that the padding of CHAR type is
>> data source dependent, but it's a bit weird to leave this inconsistency in
>> Spark.
>> 
>> 
>> 2. Leave VARCHAR unchanged in 3.0
>> VARCHAR type is so widely used in databases and it's weird if Spark
>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>> when the length limitation is not hit, and I'm fine to temporarily leave
>> this flaw in 3.0 and users may hit behavior changes when the string values
>> hit the VARCHAR length limitation.
>> 
>> 
>> 3. Finalize the VARCHAR behavior in 3.1
>> For now I have 2 ideas:
>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>> should support VARCHAR, and CREATE TABLE should fail if a column is
>> VARCHAR type and the underlying data source doesn't support it (e.g.
>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>> updated as well.
>> b) Simply document that, the underlying data source may or may not enforce
>> the length limitation of VARCHAR(x).
>> 
>> 
>> Please let me know if you have different ideas.
>> 
>> 
>> Thanks,
>> Wenchen
>> 
>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>> ( mich...@databricks.com ) > wrote:
>> 
>> 
>>> 
 What I'd oppose is to just ban char for the native data sources, and do
 not have a plan to address this problem systematically.
 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>>  
>>> 
 Just forget about padding, like what Snowflake and MySQL have done.
 Document that char(x) is just an alias for string. And then move on.
 Almost no work needs to be done...
 
>>> 
>>> 
>>> 
>>> +1 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
I agree it sucks. We started with some decision that might have made sense back 
in 2013 (let's use Hive as the default source, and guess what, pick the slowest 
possible serde by default). We are paying that debt ever since.

Thanks for bringing this thread up though. We don't have a clear solution yet, 
but at least it made a lot of people aware of the issues.

On Thu, Mar 19, 2020 at 8:56 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Technically, I has been suffered with (1) `CREATE TABLE` due to many
> difference for a long time (since 2017). So, I had a wrong assumption for
> the implication of that "(2) FYI: SPARK-30098 Use default datasource as
> provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
> in the similar way. However, it was a lot to me. Also, switching
> `convertMetastoreOrc` at 2.4 was a big change to me although there will be
> no difference for Parquet-only users.
> 
> 
> Dongjoon.
> 
> 
> > References:
> > 1. "CHAR implementation?", 2017/09/15
> >      https:/ / lists. apache. org/ thread. html/ 
> > 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> )
> > 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
> >    https:/ / lists. apache. org/ thread. html/ 
> > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> 
> 
> 
> On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> You are joking when you said " informed widely and discussed in many ways
>> twice" right?
>> 
>> 
>> 
>> This thread doesn't even talk about char/varchar: https:/ / lists. apache.
>> org/ thread. html/ 
>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>> spark. apache. org%3E (
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>> )
>> 
>> 
>> 
>> (Yes it talked about changing the default data source provider, but that's
>> just one of the ways we are exposing this char/varchar issue).
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> +1 for Wenchen's suggestion.
>>> 
>>> I believe that the difference and effects are informed widely and
>>> discussed in many ways twice.
>>> 
>>> First, this was shared on last December.
>>> 
>>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>>> Second (at this time in this thread), this has been discussed according to
>>> the new community rubric.
>>> 
>>>     - https:/ / spark. apache. org/ versioning-policy. html (
>>> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
>>> When Breaking APIs")
>>> 
>>> 
>>> Thank you all.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0...@gmail.com ) > wrote:
>>> 
>>> 
>>>> OK let me put a proposal here:
>>>> 
>>>> 
>>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>>> for Hive compatibility.
>>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>>> just ban it. Another way is to document that the padding of CHAR type is
>>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>>> Spark.
>>

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-24 Thread Reynold Xin
I actually think we should start cutting RCs. We can cut RCs even with blockers.

On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, All.
> 
> First of all, always "Community Over Code"!
> I wish you the best health and happiness.
> 
> As we know, we are still working on QA period, we didn't reach RC stage.
> It seems that we need to make website up-to-date once more.
> 
>    https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html )
> 
> If possible, it would be really great if we can get `3.0.0` release
> manager's official `branch-3.0` assessment because we have only 1 week
> before the end of March.
> 
> 
> Cloud you, the 3.0.0 release manager, share your thought and update the
> website, please?
> 
> 
> Bests
> Dongjoon.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: results of taken(3) not appearing in console window

2020-03-26 Thread Reynold Xin
bcc dev, +user

You need to print out the result. Take itself doesn't print. You only got the 
results printed to the console because the Scala REPL automatically prints the 
returned value from take.

On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote:

> 
> I am running the same code with the same libraries but not getting same
> output.
> scala>  case class flight (DEST_COUNTRY_NAME: String,
>      |                      ORIGIN_COUNTRY_NAME:String,
>      |                      count: BigInt)
> defined class flight
> 
> scala>     val flightDf = spark. read. parquet ( http://spark.read.parquet/
> ) ("/data/flight-data/parquet/2010-summary.parquet/")
> flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string,
> ORIGIN_COUNTRY_NAME: string ... 1 more field]
> 
> scala> val flights = flightDf. as ( http://flightdf.as/ ) [flight]
> flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME:
> string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
> 
> scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
> "Canada").map(flight_row => flight_row).take(3)
> *res0: Array[flight] = Array(flight(United States,Romania,1),
> flight(United States,Ireland,264), flight(United States,India,69))
> *
> 
> 
>  
> 20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2
> 20/03/26 19:09:00 INFO ResourceUtils:
> ==
> 20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver:
> 
> 20/03/26 19:09:00 INFO ResourceUtils:
> ==
> 20/03/26 19:09:00 INFO SparkContext: Submitted application:  chapter2
> 20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19
> 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19
> 20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to:
> 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to:
> 20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(kub19);
> groups with view permissions: Set(); users  with modify permissions:
> Set(kub19); groups with modify permissions: Set()
> 20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver'
> on port 46817.
> 20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker
> 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster
> 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using org. apache. spark.
> storage. DefaultTopologyMapper (
> http://org.apache.spark.storage.defaulttopologymapper/ ) for getting
> topology information
> 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up
> 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at
> /tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c
> 20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity
> 127.2 MiB
> 20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator
> 20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http:/
> / localhost:4041 ( http://localhost:4041 )
> 20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host
> localhost
> 20/03/26 19:09:01 INFO Utils: Successfully started service ' org. apache. 
> spark.
> network. netty. NettyBlockTransferService (
> http://org.apache.spark.network.netty.nettyblocktransferservice/ ) ' on
> port 38135.
> 20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on
> localhost:38135
> 20/03/26 19:09:01 INFO BlockManager: Using org. apache. spark. storage. 
> RandomBlockReplicationPolicy
> ( http://org.apache.spark.storage.randomblockreplicationpolicy/ ) for block
> replication policy
> 20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver,
> localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir
> ('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse').
> 
> 20/03/26 19:09:01 INFO SharedState: Warehouse path is
> 'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'.
> 
> 20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at
> chapter2.sc

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Reynold Xin
Let's start cutting RC next week.

On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> I'm also curious - there no open blockers for 3.0 but I know a few are
> still floating around open to revert changes. What is the status there?
> From my field of view I'm not aware of other blocking issues.
> 
> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
> 
>> Now the end of March is just around the corner. I'm not qualified to say
>> (and honestly don't know) where we are, but if we were intended to be in
>> blocker mode it doesn't seem to work; lots of developments still happen,
>> and priority/urgency doesn't seem to be applied to the sequence of
>> reviewing.
>> 
>> 
>> How about listing (or linking to epic, or labelling) JIRA issues/PRs which
>> are blockers (either from priority or technically) for Spark 3.0 release,
>> and make clear we should try to review these blockers first? Github PR
>> label may help here to filter out other PRs and concentrate these things.
>> 
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 
>> 
>> On Wed, Mar 25, 2020 at 1:52 PM Xiao Li < lixiao@ databricks. com (
>> lix...@databricks.com ) > wrote:
>> 
>> 
>>> Let us try to finish the remaining major blockers in the next few days.
>>> For example, https:/ / issues. apache. org/ jira/ browse/ SPARK-31085 (
>>> https://issues.apache.org/jira/browse/SPARK-31085 )
>>> 
>>> 
>>> +1 to cut the RC even if we still have the blockers that will fail the
>>> RCs. 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Xiao
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1
>>>> 
>>>> 
>>>> Thanks,
>>>> Dongjoon.
>>>> 
>>>> On Tue, Mar 24, 2020 at 14:49 Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I actually think we should start cutting RCs. We can cut RCs even with
>>>>> blockers.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. 
>>>>> com
>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> First of all, always "Community Over Code"!
>>>>>> I wish you the best health and happiness.
>>>>>> 
>>>>>> As we know, we are still working on QA period, we didn't reach RC stage.
>>>>>> It seems that we need to make website up-to-date once more.
>>>>>> 
>>>>>>    https:/ / spark. apache. org/ versioning-policy. html (
>>>>>> https://spark.apache.org/versioning-policy.html )
>>>>>> 
>>>>>> If possible, it would be really great if we can get `3.0.0` release
>>>>>> manager's official `branch-3.0` assessment because we have only 1 week
>>>>>> before the end of March.
>>>>>> 
>>>>>> 
>>>>>> Cloud you, the 3.0.0 release manager, share your thought and update the
>>>>>> website, please?
>>>>>> 
>>>>>> 
>>>>>> Bests
>>>>>> Dongjoon.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> ( https://databricks.com/sparkaisummit/north-america )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


[VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until 11:59pm Pacific time Fri Apr 3 , and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc1 (commit 
6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):

https://github.com/apache/spark/tree/v3.0.0-rc1

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1341/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/

The list of bug fixes going into 2.4.5 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc1.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Note: I fully expect this RC to fail.

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin
The Apache Software Foundation requires voting before any release can be 
published.

On Tue, Mar 31, 2020 at 11:27 PM, Stephen Coy < s...@infomedia.com.au.invalid > 
wrote:

> 
> 
>> On 1 Apr 2020, at 5:20 pm, Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> It can be published as "3.0.0-rc1" but how do we test that to vote on it
>> without some other RC1 RC1
>> 
> 
> 
> I’m not sure what you mean by this question?
> 
> 
> 
> 
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm
> you have obtained consent from such third parties) to Infomedia’s privacy
> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
> http://www.infomedia.com.au/privacy-policy/ )
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
The RDD is the DAG.

On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi...@husky.neu.edu > wrote:

> 
> Hello everyone,
> 
> I am implementing a caching mechanism for analytic workloads running on
> top of Spark and I need to retrieve the Spark DAG right after it is
> generated and the DAG scheduler. I would appreciate it if you could give
> me some hints or reference me to some documents about where the DAG is
> generated and inputs assigned to it. I found the DAG Scheduler class (
> https://github.com/apache/spark/blob/55dea9be62019d64d5d76619e1551956c8bb64d0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
> ) but I am not sure if it is a good starting point.
> 
> 
> 
> Regards
> Mania
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
If you are talking about a tree, then the RDDs are nodes, and the dependencies 
are the edges.

If you are talking about a DAG, then the partitions in the RDDs are the nodes, 
and the dependencies between the partitions are the edges.

On Thu, Apr 16, 2020 at 4:02 PM, Mania Abdi < abdi...@husky.neu.edu > wrote:

> 
> Is it correct to say, the nodes in the DAG are RDDs and the edges are
> computations?
> 
> 
> On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> The RDD is the DAG.
>> 
>> 
>> 
>> On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi. ma@ husky. neu. edu (
>> abdi...@husky.neu.edu ) > wrote:
>> 
>>> Hello everyone,
>>> 
>>> I am implementing a caching mechanism for analytic workloads running on
>>> top of Spark and I need to retrieve the Spark DAG right after it is
>>> generated and the DAG scheduler. I would appreciate it if you could give
>>> me some hints or reference me to some documents about where the DAG is
>>> generated and inputs assigned to it. I found the DAG Scheduler class (
>>> https://github.com/apache/spark/blob/55dea9be62019d64d5d76619e1551956c8bb64d0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
>>> ) but I am not sure if it is a good starting point.
>>> 
>>> 
>>> 
>>> Regards
>>> Mania
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Reynold Xin
The con is much more than just more effort to maintain a parallel API. It
puts the burden for all libraries and library developers to maintain a
parallel API as well. That’s one of the primary reasons we moved away from
this RDD vs JavaRDD approach in the old RDD API.


On Tue, Apr 28, 2020 at 12:38 AM ZHANG Wei  wrote:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&reserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&reserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&reserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&reserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon  wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> > > make
> > > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > > Then adding a method to the Scala API would require adding it to
> the
> > > > > Java API and we would keep the 

[VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc2 (commit 
29853eca69bceefd227cbe8421a09c116b7b753a):

https://github.com/apache/spark/tree/v3.0.0-rc2

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1345/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc2.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

smime.p7s
Description: S/MIME Cryptographic Signature


[vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc3 (commit 
3fdfce3120f307147244e5eaf46d61419a723d50):

https://github.com/apache/spark/tree/v3.0.0-rc3

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1350/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc3.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

smime.p7s
Description: S/MIME Cryptographic Signature


  1   2   3   4   5   6   7   8   9   10   >