CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines

2018-10-24 Thread Sean Owen
Severity: Low

Vendor: The Apache Software Foundation

Versions Affected:
1.3.x release branch and later, including master

Description:
Spark's Apache Maven-based build includes a convenience script, 'build/mvn',
that downloads and runs a zinc server to speed up compilation. This server
will accept connections from external hosts by default. A specially-crafted
request to the zinc server could cause it to reveal information in files
readable to the developer account running the build. Note that this issue
does not affect end users of Spark, only developers building Spark from
source code.

Mitigation:
Spark users are not affected, as zinc is only a part of the build process.
Spark developers may simply use a local Maven installation's 'mvn' command
to build, and avoid running build/mvn and zinc.
Spark developers building actively-developed branches (2.2.x, 2.3.x, 2.4.x,
master) may update their branches to receive mitigations already patched
onto the build/mvn script.
Spark developers running zinc separately may include "-server 127.0.0.1" in
its command line, and consider additional flags like "-idle-timeout 30m" to
achieve similar mitigation.

Credit:
Andre Protas, Apple Information Security

References:
https://spark.apache.org/security.html

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop-Token-Across-Kerberized -Cluster

2018-10-24 Thread Davinder Kumar
Any update on this, or anybody facing/faced similar issue. Any suggestion will 
be appreciated.



Thanks

-Davinder


From: Davinder Kumar 
Sent: Wednesday, October 17, 2018 11:01 AM
To: dev
Subject: Hadoop-Token-Across-Kerberized -Cluster


Hello All,


Need one help for Kerberized cluster. Having two Ambari-clusters. Cluster A and 
Cluster B both are Kerberized with same KDC


Use case is : Need to access the Hive data from Cluster B to Cluster A.


Action done


- Remote Cluster B Principal and keytab are provided to Cluster A. [admadmin is 
the user]

- Remote Cluster Hive metastore Principal/Keytab are provided to cluster A.

- Running the spark job on cluster A to access the data from Cluster B [Spark 
over yarn]

- Able to connect with Hive metastore of Remote cluster B by cluster A

- Now getting the error related with Hadoop-tokens [Any help or suggestion is 
appreciated]. Error logs are like this



18/10/16 20:33:55 INFO RMProxy: Connecting to ResourceManager at 
davinderrc15.c.ampool-141120.internal/10.128.15.198:8030
18/10/16 20:33:55 INFO YarnRMClient: Registering the ApplicationMaster
18/10/16 20:33:55 INFO YarnAllocator: Will request 2 executor container(s), 
each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/10/16 20:33:55 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster registered as 
NettyRpcEndpointRef(spark://YarnAM@10.128.15.198:38524)
18/10/16 20:33:55 INFO YarnAllocator: Submitted 2 unlocalized container 
requests.
18/10/16 20:33:55 INFO ApplicationMaster: Started progress reporter thread with 
(heartbeat : 3000, initial allocation : 200) intervals
18/10/16 20:33:56 INFO AMRMClientImpl: Received new token for : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:56 INFO YarnAllocator: Launching container 
container_e07_1539521606680_0045_02_02 on host 
davinderrc15.c.ampool-141120.internal for executor with ID 1
18/10/16 20:33:56 INFO YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
18/10/16 20:33:56 INFO ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
18/10/16 20:33:56 INFO ContainerManagementProtocolProxy: Opening proxy : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:57 INFO YarnAllocator: Launching container 
container_e07_1539521606680_0045_02_03 on host 
davinderrc15.c.ampool-141120.internal for executor with ID 2
18/10/16 20:33:57 INFO YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
18/10/16 20:33:57 INFO ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
18/10/16 20:33:57 INFO ContainerManagementProtocolProxy: Opening proxy : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered 
executor NettyRpcEndpointRef(spark-client://Executor) (10.128.15.198:39686) 
with ID 1
18/10/16 20:33:59 INFO BlockManagerMasterEndpoint: Registering block manager 
davinderrc15.c.ampool-141120.internal:36291 with 366.3 MB RAM, 
BlockManagerId(1, davinderrc15.c.ampool-141120.internal, 36291, None)
18/10/16 20:33:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered 
executor NettyRpcEndpointRef(spark-client://Executor) (10.128.15.198:39704) 
with ID 2
18/10/16 20:33:59 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
18/10/16 20:33:59 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook 
done
18/10/16 20:33:59 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045/container_e07_1539521606680_0045_02_01/spark-warehouse').
18/10/16 20:33:59 INFO SharedState: Warehouse path is 
'file:/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045/container_e07_1539521606680_0045_02_01/spark-warehouse'.
18/10/16 20:33:59 INFO BlockManagerMasterEndpoint: Registering block manager 
davinderrc15.c.ampool-141120.internal:45507 with 366.3 MB RAM, 
BlockManagerId(2, davinderrc15.c.ampool-141120.internal, 45507, None)
18/10/16 20:34:00 INFO HiveUtils: Initializing HiveMetastoreConnection version 
1.2.1 using Spark classes.
18/10/16 20:34:00 INFO HiveClientImpl: Attempting to login to Kerberos using 
principal: admadmin/ad...@ampool.io and keytab: 
admadmin.keytab-a796621d-bacd-47e2-bd97-077090fe8aa8
18/10/16 20:34:00 INFO UserGroupInformation: Login successful for user 
admadmin/ad...@ampool.io using keytab file 
admadmin.keytab-a796621d-bacd-47e2-bd97-077090fe8aa8
18/10/16 20:34:01 INFO metastore: Trying to connect to metastore with URI 
thrift://10.128.0.39:9083
18/10/16 20:34:01 INFO metastore: Connected to metastore.
18/10/16 20:34:01 INFO SessionState: Created local directory: 
/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045

What's a blocker?

2018-10-24 Thread Sean Owen
Shifting this to dev@. See the PR https://github.com/apache/spark/pull/22144
for more context.

There will be no objective, complete definition of blocker, or even
regression or correctness issue. Many cases are clear, some are not. We can
draw up more guidelines, and feel free to open PRs against the
'contributing' doc. But in general these are the same consensus-driven
decisions we negotiate all the time.

What isn't said that should be is that there is a cost to not releasing.
Keep in mind we have, also, decided on a 'release train' cadence. That does
properly change the calculus about what's a blocker; the right decision
could change within even a week.

I wouldn't mind some verbiage around what a regression is. Since the last
minor release?

We can VOTE on anything we like, but we already VOTE on the release.
Weirdly, technically, the release vote criteria is simple majority, FWIW:
http://www.apache.org/legal/release-policy.html#release-approval

Yes, actually, it is only the PMC's votes that literally matter. Those
votes are, surely, based on input from others too. But that is actually
working as intended.


Let's understand statements like "X is not a blocker" to mean "I don't
think that X is a blocker". Interpretations not proclamations, backed up by
reasons, not all of which are appeals to policy and precedent.

I find it hard to argue about these in the abstract, because I believe it's
already widely agreed, and written down in ASF policy, that nobody makes
decisions unilaterally. Done, yes.

Practically speaking, the urgent issue is the 2.4 release. I don't see
process failures here that need fixing or debate. I do think those
outstanding issues merit technical discussion. The outcome will be a
tradeoff of some subjective issues, not read off of a policy sheet, and
will entail tradeoffs. Let's speak freely about those technical issues and
try to find the consensus position.


On Wed, Oct 24, 2018 at 12:21 PM Mark Hamstra 
wrote:

> Thanks @tgravescs  for your latest posts --
> they've saved me from posting something similar in many respects but more
> strongly worded.
>
> What is bothering me (not just in the discussion of this PR, but more
> broadly) is that we have individuals making declarative statements about
> whether something can or can't block a release, or that something "is not
> that important to Spark at this point", etc. -- things for which there is
> no supporting PMC vote or declared policy. It may be your opinion,
> @cloud-fan  , that Hive compatibility
> should no longer be important to the Apache Spark project, and I have no
> problem with you expressing your opinion on the matter. That may even be
> the internal policy at your employer, I don't know. But you are just not in
> a position on your own to make this declaration for the Apache Spark
> project.
>
> I don't mean to single you out, @cloud-fan 
> , as the only offender, since this isn't a unique instance. For example,
> heading into a recent release we also saw individual declarations that the
> data correctness issue caused by the shuffle replay partitioning issue was
> not a blocker because it was not a regression or that it was not
> significant enough to alter the release schedule. Rather, my point is that
> things like release schedules, the declaration of release candidates,
> labeling JIRA tickets with "blocker", and de facto or even declared policy
> on regressions and release blockers are just tools in the service of the
> PMC. If, as was the case with the shuffle data correctness issue, PMC
> members think that the issue must be fixed before the next release, then
> release schedules, RC-status, other individuals' perceptions of importance
> to the project or of policy ultimately don't matter -- only the vote of the
> PMC does. What is concerning me is that, instead of efforts to persuade the
> PMC members that something should not block the next release or should not
> be important to the project, I am seeing flat declarations that an issue is
> not a blocker or not important. That may serve to stifle work to
> immediately fix a bug, or to discourage other contributions, but I can
> assure that trying to make the PMC serve the tools instead of the other way
> around won't serve to persuade at least some PMC members on how they should
> vote.
>
> Sorry, I guess I can't avoid wording things strongly after all.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 
> .
>


KryoSerializer Implementation - Not using KryoPool

2018-10-24 Thread Patrick Brown
Hi,

I am wondering about the implementation of KryoSerializer, specifically the
lack of use of KryoPool, which is recommended by Kryo themselves.

Looking at the code, it seems that frequently KryoSerializer.newInstance is
called, followed by a serialize and then this instance goes out of scope,
this seems like it causes frequent creation of Kryo instances, something
which the Kryo documentation says is expensive.

By doing flame graphs on our own running software (it processes a lot of
small jobs) it seems like a good amount of time is spent on this.

I have a small patch we are using internally which implements a reused
KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to
avoid the creation of many Kryo instances. I am wonder if I am missing
something as to why this isn't done already. If not I am wondering if this
might be a patch that Spark would be interested in merging in, and how I
might go about that.

Thanks,

Patrick


Re: KryoSerializer Implementation - Not using KryoPool

2018-10-24 Thread Sean Owen
I don't know; possibly just because it wasn't available whenever Kryo
was first used in the project.

Skimming the code, the KryoSerializerInstance looks like a wrapper
that provides a Kryo object to do work. It already maintains a 'pool'
of just 1 instance. Is the point that KryoSerializer can share a
KryoPool across KryoSerializerInstances that provides them with a Kryo
rather than allocate a new one? makes sense, though I believe the
concern is always whether that somehow shares state or config in a way
that breaks something. I see there's already a reset() call in here to
try to avoid that.

Well, seems worth a PR, especially if you can demonstrate some
performance gains.

On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
 wrote:
>
> Hi,
>
> I am wondering about the implementation of KryoSerializer, specifically the 
> lack of use of KryoPool, which is recommended by Kryo themselves.
>
> Looking at the code, it seems that frequently KryoSerializer.newInstance is 
> called, followed by a serialize and then this instance goes out of scope, 
> this seems like it causes frequent creation of Kryo instances, something 
> which the Kryo documentation says is expensive.
>
> By doing flame graphs on our own running software (it processes a lot of 
> small jobs) it seems like a good amount of time is spent on this.
>
> I have a small patch we are using internally which implements a reused 
> KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid 
> the creation of many Kryo instances. I am wonder if I am missing something as 
> to why this isn't done already. If not I am wondering if this might be a 
> patch that Spark would be interested in merging in, and how I might go about 
> that.
>
> Thanks,
>
> Patrick

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect
result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from
`map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's
decision.

I'm sending this email to draw more attention to this bug and to give some
warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: queryable state & streaming

2018-10-24 Thread Arun Mahadevan
I don't think separate API or RPCs etc might be necessary for queryable
state if the state can be exposed as just another datasource. Then the sql
queries can be issued against it just like executing sql queries against
any other data source.

For now I think the "memory" sink could be used  as a sink and run queries
against it but I agree it does not scale for large states.

On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim  wrote:

> It doesn't seem Spark has workarounds other than storing output into
> external storages, so +1 on having this.
>
> My major concern on implementing queryable state in structured streaming
> is "Are all states available on executors at any time while query is
> running?" Querying state shouldn't affect the running query. Given that
> state is huge and default state provider is loading state in memory, we may
> not want to load one more redundant snapshot of state: we want to always
> load "current state" which query is also using. (For sure, Queryable state
> should be read-only.)
>
> Regarding improvement of local state, I guess it is ideal to leverage
> embedded db, like Kafka and Flink are doing. The difference will not be
> only reading state from non-heap, but also how to take a snapshot and store
> delta. We may want to check snapshotting works well with small batch
> interval, and find alternative approach when it doesn't. Sounds like it is
> a huge item and can be handled individually.
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos 님이
> 작성:
>
>> Nice I was looking for a jira. So I agree we should justify why we are
>> building something. Now to that direction here is what I have seen from my
>> experience.
>> People quite often use state within their streaming app and may have
>> large states (TBs). Shortening the pipeline by not having to copy data (to
>> Cassandra for example for serving) is an advantage, in terms of at least
>> latency and complexity.
>> This can be true if we advantage of state checkpointing (locally could be
>> RocksDB or in general HDFS the latter is currently supported)  along with
>> an API to efficiently query data.
>> Some use cases I see:
>>
>> - real-time dashboards and real-time reporting, the faster the better
>> - monitoring of state for operational reasons, app health etc...
>> - integrating with external services via an API eg. making accessible
>>  aggregations over time windows to some third party service within your
>> system
>>
>> Regarding requirements here are some of them:
>> - support of an API to expose state (could be done at the spark driver),
>> like rest.
>> - supporting dynamic allocation (not sure how it affects state
>> management)
>> - an efficient way to talk to executors to get the state (rpc?)
>> - making local state more efficient and easier accessible with an
>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>> Some people are already working with such techs and some stuff could be
>> re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>
>> Best,
>> Stavros
>>
>>
>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust > > wrote:
>>
>>> https://issues.apache.org/jira/browse/SPARK-16738
>>>
>>> I don't believe anyone is working on it yet.  I think the most useful
>>> thing is to start enumerating requirements and use cases and then we can
>>> talk about how to build it.
>>>
>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
>>> st.kontopou...@gmail.com> wrote:
>>>
 Cool Burak do you have a pointer, should I take the initiative for a
 first design document or Databricks is working on it?

 Best,
 Stavros

 On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz  wrote:

> Hi Stavros,
>
> Queryable state is definitely on the roadmap! We will revamp the
> StateStore API a bit, and a queryable StateStore is definitely one of the
> things we are thinking about during that revamp.
>
> Best,
> Burak
>
> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <
> st.kontopou...@gmail.com> wrote:
>
>> Just to re-phrase my question: Would query-able state make a viable
>> SPIP?
>>
>> Regards,
>> Stavros
>>
>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>> st.kontopou...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Maybe this has been discussed before. Given the fact that many
>>> streaming apps out there use state extensively, could be a good idea to
>>> make Spark expose streaming state with an external API like other
>>> systems do (Kafka streams, Flink etc), in order to facilitate
>>> interactive queries?
>>>
>>> Regards,
>>> Stavros
>>>
>>
>>

>>>
>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map
type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later
entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
++
| map(1, 2, 1, 3)|
++
|[1 -> 2, 1 -> 3]|
++
The displayed string of map values has a bug and we should deduplicate the
entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
++
| map|
++
|[1 -> 3]|
++
The Hive map value convert has a bug, we should respect the "earlier entry
wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it
is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just
a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> -0 due to the following issue. From Spark 2.4.0, users may get an
> incorrect result when they use new `map_fitler` with `map_concat` functions.
>
> https://issues.apache.org/jira/browse/SPARK-25823
>
> SPARK-25823 is only aiming to fix the data correctness issue from
> `map_filter`.
>
> PMC members are able to lower the priority. Always, I respect PMC's
> decision.
>
> I'm sending this email to draw more attention to this bug and to give some
> warning on the new feature's limitation to the community.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until October 26 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc4 (commit
>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings
(including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+


spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}


hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}


presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
// Presto 0.212
 _col0
---
 {1=3}


Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:

> Hi Dongjoon,
>
> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>
> The problem is not about the function `map_filter`, but about how the map
> type values are created in Spark, when there are duplicated keys.
>
> In programming languages like Java/Scala, when creating map, the later
> entry wins. e.g. in scala
> scala> Map(1 -> 2, 1 -> 3)
> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>
> scala> Map(1 -> 2, 1 -> 3).get(1)
> res1: Option[Int] = Some(3)
>
> However, in Spark, the earlier entry wins
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
>
> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>
> But there are several bugs in Spark
>
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> The displayed string of map values has a bug and we should deduplicate the
> entries, This is tracked by SPARK-25824.
>
>
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> The Hive map value convert has a bug, we should respect the "earlier entry
> wins" semantic. No ticket yet.
>
>
> scala> sql("select map(1,2,1,3)").collect
> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> Same bug happens at `collect`. No ticket yet.
>
> I'll create tickets and list all of them as known issues in 2.4.0.
>
> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
> it is a behavior change and we can only apply it to master branch.
>
> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
> just a symptom of the hive map value converter bug. I think it's a
> non-blocker.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> -0 due to the following issue. From Spark 2.4.0, users may get an
>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>
>> https://issues.apache.org/jira/browse/SPARK-25823
>>
>> SPARK-25823 is only aiming to fix the data correctness issue from
>> `map_filter`.
>>
>> PMC members are able to lower the priority. Always, I respect PMC's
>> decision.
>>
>> I'm sending this email to draw more attention to this bug and to give
>> some warning on the new feature's limitation to the community.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc4 (commit
>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important b

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug
in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
wrote:

> Thank you for the follow-ups.
>
> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
> (including Spark/Scala) in the end?
>
> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
> ways.
>
> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
>
>
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}
>
>
> hive> select map(1,2,1,3);  // Hive 1.2.2
> OK
> {1:3}
>
>
> presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
> // Presto 0.212
>  _col0
> ---
>  {1=3}
>
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>
>> Hi Dongjoon,
>>
>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>
>> The problem is not about the function `map_filter`, but about how the map
>> type values are created in Spark, when there are duplicated keys.
>>
>> In programming languages like Java/Scala, when creating map, the later
>> entry wins. e.g. in scala
>> scala> Map(1 -> 2, 1 -> 3)
>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>
>> scala> Map(1 -> 2, 1 -> 3).get(1)
>> res1: Option[Int] = Some(3)
>>
>> However, in Spark, the earlier entry wins
>> scala> sql("SELECT map(1,2,1,3)[1]").show
>> +--+
>> |map(1, 2, 1, 3)[1]|
>> +--+
>> | 2|
>> +--+
>>
>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>
>> But there are several bugs in Spark
>>
>> scala> sql("SELECT map(1,2,1,3)").show
>> ++
>> | map(1, 2, 1, 3)|
>> ++
>> |[1 -> 2, 1 -> 3]|
>> ++
>> The displayed string of map values has a bug and we should deduplicate
>> the entries, This is tracked by SPARK-25824.
>>
>>
>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>> res11: org.apache.spark.sql.DataFrame = []
>>
>> scala> sql("select * from t").show
>> ++
>> | map|
>> ++
>> |[1 -> 3]|
>> ++
>> The Hive map value convert has a bug, we should respect the "earlier
>> entry wins" semantic. No ticket yet.
>>
>>
>> scala> sql("select map(1,2,1,3)").collect
>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>> Same bug happens at `collect`. No ticket yet.
>>
>> I'll create tickets and list all of them as known issues in 2.4.0.
>>
>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>> it is a behavior change and we can only apply it to master branch.
>>
>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>> just a symptom of the hive map value converter bug. I think it's a
>> non-blocker.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>
>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>> `map_filter`.
>>>
>>> PMC members are able to lower the priority. Always, I respect PMC's
>>> decision.
>>>
>>> I'm sending this email to draw more attention to this bug and to give
>>> some warning on the new feature's limitation to the community.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.4.0.

 The vote is open until October 26 PST and passes if a majority +1 PMC
 votes are cast, with
 a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.4.0-rc4 (commit
 e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
 https://github.com/apache/spark/tree/v2.4.0-rc4

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1290

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.

Re: What's a blocker?

2018-10-24 Thread Hyukjin Kwon
> Let's understand statements like "X is not a blocker" to mean "I don't
think that X is a blocker". Interpretations not proclamations, backed up by
reasons, not all of which are appeals to policy and precedent.
Might not be a big deal and out of the topic but I rather hope people
explicitly avoid to say like "X is not a blocker" tho. It certainly does
sound like some kind of proclamations.


2018년 10월 25일 (목) 오전 3:09, Sean Owen 님이 작성:

> Shifting this to dev@. See the PR
> https://github.com/apache/spark/pull/22144 for more context.
>
> There will be no objective, complete definition of blocker, or even
> regression or correctness issue. Many cases are clear, some are not. We can
> draw up more guidelines, and feel free to open PRs against the
> 'contributing' doc. But in general these are the same consensus-driven
> decisions we negotiate all the time.
>
> What isn't said that should be is that there is a cost to not releasing.
> Keep in mind we have, also, decided on a 'release train' cadence. That does
> properly change the calculus about what's a blocker; the right decision
> could change within even a week.
>
> I wouldn't mind some verbiage around what a regression is. Since the last
> minor release?
>
> We can VOTE on anything we like, but we already VOTE on the release.
> Weirdly, technically, the release vote criteria is simple majority, FWIW:
> http://www.apache.org/legal/release-policy.html#release-approval
>
> Yes, actually, it is only the PMC's votes that literally matter. Those
> votes are, surely, based on input from others too. But that is actually
> working as intended.
>
>
> Let's understand statements like "X is not a blocker" to mean "I don't
> think that X is a blocker". Interpretations not proclamations, backed up by
> reasons, not all of which are appeals to policy and precedent.
>
> I find it hard to argue about these in the abstract, because I believe
> it's already widely agreed, and written down in ASF policy, that nobody
> makes decisions unilaterally. Done, yes.
>
> Practically speaking, the urgent issue is the 2.4 release. I don't see
> process failures here that need fixing or debate. I do think those
> outstanding issues merit technical discussion. The outcome will be a
> tradeoff of some subjective issues, not read off of a policy sheet, and
> will entail tradeoffs. Let's speak freely about those technical issues and
> try to find the consensus position.
>
>
> On Wed, Oct 24, 2018 at 12:21 PM Mark Hamstra 
> wrote:
>
>> Thanks @tgravescs  for your latest posts
>> -- they've saved me from posting something similar in many respects but
>> more strongly worded.
>>
>> What is bothering me (not just in the discussion of this PR, but more
>> broadly) is that we have individuals making declarative statements about
>> whether something can or can't block a release, or that something "is not
>> that important to Spark at this point", etc. -- things for which there is
>> no supporting PMC vote or declared policy. It may be your opinion,
>> @cloud-fan  , that Hive compatibility
>> should no longer be important to the Apache Spark project, and I have no
>> problem with you expressing your opinion on the matter. That may even be
>> the internal policy at your employer, I don't know. But you are just not in
>> a position on your own to make this declaration for the Apache Spark
>> project.
>>
>> I don't mean to single you out, @cloud-fan 
>> , as the only offender, since this isn't a unique instance. For example,
>> heading into a recent release we also saw individual declarations that the
>> data correctness issue caused by the shuffle replay partitioning issue was
>> not a blocker because it was not a regression or that it was not
>> significant enough to alter the release schedule. Rather, my point is that
>> things like release schedules, the declaration of release candidates,
>> labeling JIRA tickets with "blocker", and de facto or even declared policy
>> on regressions and release blockers are just tools in the service of the
>> PMC. If, as was the case with the shuffle data correctness issue, PMC
>> members think that the issue must be fixed before the next release, then
>> release schedules, RC-status, other individuals' perceptions of importance
>> to the project or of policy ultimately don't matter -- only the vote of the
>> PMC does. What is concerning me is that, instead of efforts to persuade the
>> PMC members that something should not block the next release or should not
>> be important to the project, I am seeing flat declarations that an issue is
>> not a blocker or not important. That may serve to stifle work to
>> immediately fix a bug, or to discourage other contributions, but I can
>> assure that trying to make the PMC serve the tools instead of the other way
>> around won't serve to persuade at least some PMC members on how they should
>> vote.
>>
>> Sorry, I guess I can't avoi

Re: What's a blocker?

2018-10-24 Thread Saisai Shao
Just my two cents of the past experience. As a release manager of Spark
2.3.2, I felt significantly delay during the release by block issues. Vote
was failed several times by one or two "block issue". I think during the RC
time, each "block issue" should be carefully evaluated by the related PMCs
and release manager. Some issues which are not so critical or only matters
to one or two firms should be carefully marked as blocker, to avoid the
delay of the release.

Thanks
Saisai


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
For the first question, it's `bin/spark-sql` result. I didn't check STS,
but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins`
stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in
terms of the output result
while users assumed that `map_filter` works on top of the result of `m`.

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:

> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> > {1:3}
>
> Are you running in the thrift-server? Then maybe this is caused by the bug
> in `Dateset.collect` as I mentioned above.
>
> I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the follow-ups.
>>
>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>> (including Spark/Scala) in the end?
>>
>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
>> ways.
>>
>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>> +---+
>> |map(1, 2, 1, 3)|
>> +---+
>> |Map(1 -> 3)|
>> +---+
>>
>>
>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> {1:3}
>>
>>
>> hive> select map(1,2,1,3);  // Hive 1.2.2
>> OK
>> {1:3}
>>
>>
>> presto> SELECT map_concat(map(array[1],array[2]),
>> map(array[1],array[3])); // Presto 0.212
>>  _col0
>> ---
>>  {1=3}
>>
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>
>>> The problem is not about the function `map_filter`, but about how the
>>> map type values are created in Spark, when there are duplicated keys.
>>>
>>> In programming languages like Java/Scala, when creating map, the later
>>> entry wins. e.g. in scala
>>> scala> Map(1 -> 2, 1 -> 3)
>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>
>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>> res1: Option[Int] = Some(3)
>>>
>>> However, in Spark, the earlier entry wins
>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>> +--+
>>> |map(1, 2, 1, 3)[1]|
>>> +--+
>>> | 2|
>>> +--+
>>>
>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>>
>>> But there are several bugs in Spark
>>>
>>> scala> sql("SELECT map(1,2,1,3)").show
>>> ++
>>> | map(1, 2, 1, 3)|
>>> ++
>>> |[1 -> 2, 1 -> 3]|
>>> ++
>>> The displayed string of map values has a bug and we should deduplicate
>>> the entries, This is tracked by SPARK-25824.
>>>
>>>
>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>> res11: org.apache.spark.sql.DataFrame = []
>>>
>>> scala> sql("select * from t").show
>>> ++
>>> | map|
>>> ++
>>> |[1 -> 3]|
>>> ++
>>> The Hive map value convert has a bug, we should respect the "earlier
>>> entry wins" semantic. No ticket yet.
>>>
>>>
>>> scala> sql("select map(1,2,1,3)").collect
>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>> Same bug happens at `collect`. No ticket yet.
>>>
>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>
>>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>>> it is a behavior change and we can only apply it to master branch.
>>>
>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>>> just a symptom of the hive map value converter bug. I think it's a
>>> non-blocker.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 -0 due to the following issue. From Spark 2.4.0, users may get an
 incorrect result when they use new `map_fitler` with `map_concat` 
 functions.

 https://issues.apache.org/jira/browse/SPARK-25823

 SPARK-25823 is only aiming to fix the data correctness issue from
 `map_filter`.

 PMC members are able to lower the priority. Always, I respect PMC's
 decision.

 I'm sending this email to draw more attention to this bug and to give
 some warning on the new feature's limitation to the community.

 Bests,
 Dongjoon.


 On Mon, Oct

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra
Yeah, I can pretty much agree with that. Before we get into release
candidates, it's not as big a deal if something gets labeled as a blocker.
Once we are into an RC, I'd like to see any discussions as to whether
something is or isn't a blocker at least cross-referenced in the RC VOTE
thread so that PMC members can more easily be aware of the discussion and
potentially weigh in.

On Wed, Oct 24, 2018 at 7:12 PM Saisai Shao  wrote:

> Just my two cents of the past experience. As a release manager of Spark
> 2.3.2, I felt significantly delay during the release by block issues. Vote
> was failed several times by one or two "block issue". I think during the RC
> time, each "block issue" should be carefully evaluated by the related PMCs
> and release manager. Some issues which are not so critical or only matters
> to one or two firms should be carefully marked as blocker, to avoid the
> delay of the release.
>
> Thanks
> Saisai
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Ah now I see the problem. `map_filter` has a very weird semantic that is
neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these
newly added map-related functions from FunctionRegistry(for 2.4.0), so that
they are invisible to end-users, and the weird behavior of Spark map type
with duplicated keys are not escalated. We should fix it ASAP in the master
branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
wrote:

> For the first question, it's `bin/spark-sql` result. I didn't check STS,
> but it will return the same with `bin/spark-sql`.
>
> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> For the second one, `map_filter` issue is not about `earlier entry wins`
> stuff. Please see the following example.
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:2}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:3}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {}
>
> In other words, `map_filter` works like `push-downed filter` to the map in
> terms of the output result
> while users assumed that `map_filter` works on top of the result of `m`.
>
> This is a function semantic issue.
>
>
> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>
>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> > {1:3}
>>
>> Are you running in the thrift-server? Then maybe this is caused by the
>> bug in `Dateset.collect` as I mentioned above.
>>
>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the follow-ups.
>>>
>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>> (including Spark/Scala) in the end?
>>>
>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>> many ways.
>>>
>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>> +---+
>>> |map(1, 2, 1, 3)|
>>> +---+
>>> |Map(1 -> 3)|
>>> +---+
>>>
>>>
>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> {1:3}
>>>
>>>
>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>> OK
>>> {1:3}
>>>
>>>
>>> presto> SELECT map_concat(map(array[1],array[2]),
>>> map(array[1],array[3])); // Presto 0.212
>>>  _col0
>>> ---
>>>  {1=3}
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>>>
 Hi Dongjoon,

 Thanks for reporting it! This is indeed a bug that needs to be fixed.

 The problem is not about the function `map_filter`, but about how the
 map type values are created in Spark, when there are duplicated keys.

 In programming languages like Java/Scala, when creating map, the later
 entry wins. e.g. in scala
 scala> Map(1 -> 2, 1 -> 3)
 res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

 scala> Map(1 -> 2, 1 -> 3).get(1)
 res1: Option[Int] = Some(3)

 However, in Spark, the earlier entry wins
 scala> sql("SELECT map(1,2,1,3)[1]").show
 +--+
 |map(1, 2, 1, 3)[1]|
 +--+
 | 2|
 +--+

 So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

 But there are several bugs in Spark

 scala> sql("SELECT map(1,2,1,3)").show
 ++
 | map(1, 2, 1, 3)|
 ++
 |[1 -> 2, 1 -> 3]|
 ++
 The displayed string of map values has a bug and we should deduplicate
 the entries, This is tracked by SPARK-25824.


 scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
 res11: org.apache.spark.sql.DataFrame = []

 scala> sql("select * from t").show
 ++
 | map|
 ++
 |[1 -> 3]|
 ++
 The Hive map value convert has a bug, we should respect the "earlier
 entry wins" semantic. No ticket yet.


 scala> sql("select map(1,2,1,3)").collect
 res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
 Same bug happens at `collect`. No ticket yet.

 I'll create tickets and list all of them as known issues in 2.4.0.

 It's arguable if the "earlier entry wins" semantic is reasonable.
 Fixing it is a behavior change and we can only apply it to master branch.

 Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
 just a symptom of the hive map value converter bug. I think it's a
>>>

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Xiao Li
@Dongjoon Hyun   Thanks! This is a blocking
ticket. It returns a wrong result due to our undefined behavior. I agree we
should revert the newly added map-oriented functions. In 3.0 release, we
need to define the behavior of duplicate keys in the data type MAP and fix
all the related issues that are confusing to our end users.

Thanks,

Xiao

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:

> Ah now I see the problem. `map_filter` has a very weird semantic that is
> neither "earlier entry wins" or "latter entry wins".
>
> I've opened https://github.com/apache/spark/pull/22821 , to remove these
> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
> they are invisible to end-users, and the weird behavior of Spark map type
> with duplicated keys are not escalated. We should fix it ASAP in the master
> branch.
>
> If others are OK with it, I'll start a new RC after that PR is merged.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
> wrote:
>
>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>> but it will return the same with `bin/spark-sql`.
>>
>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> For the second one, `map_filter` issue is not about `earlier entry wins`
>> stuff. Please see the following example.
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:2}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:3}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {}
>>
>> In other words, `map_filter` works like `push-downed filter` to the map
>> in terms of the output result
>> while users assumed that `map_filter` works on top of the result of `m`.
>>
>> This is a function semantic issue.
>>
>>
>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>>
>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> > {1:3}
>>>
>>> Are you running in the thrift-server? Then maybe this is caused by the
>>> bug in `Dateset.collect` as I mentioned above.
>>>
>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for the follow-ups.

 Then, Spark 2.4.1 will return `{1:2}` differently from the followings
 (including Spark/Scala) in the end?

 I hoped to fix the `map_filter`, but now Spark looks inconsistent in
 many ways.

 scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
 +---+
 |map(1, 2, 1, 3)|
 +---+
 |Map(1 -> 3)|
 +---+


 spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
 {1:3}


 hive> select map(1,2,1,3);  // Hive 1.2.2
 OK
 {1:3}


 presto> SELECT map_concat(map(array[1],array[2]),
 map(array[1],array[3])); // Presto 0.212
  _col0
 ---
  {1=3}


 Bests,
 Dongjoon.


 On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan 
 wrote:

> Hi Dongjoon,
>
> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>
> The problem is not about the function `map_filter`, but about how the
> map type values are created in Spark, when there are duplicated keys.
>
> In programming languages like Java/Scala, when creating map, the later
> entry wins. e.g. in scala
> scala> Map(1 -> 2, 1 -> 3)
> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>
> scala> Map(1 -> 2, 1 -> 3).get(1)
> res1: Option[Int] = Some(3)
>
> However, in Spark, the earlier entry wins
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
>
> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2)
> .
>
> But there are several bugs in Spark
>
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> The displayed string of map values has a bug and we should deduplicate
> the entries, This is tracked by SPARK-25824.
>
>
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> The Hive map value convert has a bug, we should respect the "earlier
> entry wins" semantic. No ticket