How to resolve UnresolvedRelations (to explore FindDataSourceTable)?

2018-01-16 Thread Jacek Laskowski
Hi,

I've been exploring Spark Analyzer's FindDataSourceTable rule and found the
following example hard to explain why one of two UnresolvedRelations has
not been resolved.

Could you help me with the places marked as FIXME?

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

// Create tables
// FIXME Is there a more idiomatic way of creating tables for demos?
spark.range(10).write.saveAsTable("t1")
spark.range(0).write.saveAsTable("t2")

import org.apache.spark.sql.catalyst.dsl.plans._
val plan = table("t1").insertInto(tableName = "t2", overwrite = true)

// Transform the logical plan with ResolveRelations logical rule first
// so UnresolvedRelations become UnresolvedCatalogRelations
import spark.sessionState.analyzer.ResolveRelations
val planWithUnresolvedCatalogRelations = ResolveRelations(plan)
scala> println(planWithUnresolvedCatalogRelations.numberedTreeString)
00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
01 +- 'SubqueryAlias t1
02+- 'UnresolvedCatalogRelation `default`.`t1`,
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

// Let's resolve UnresolvedCatalogRelations then
import org.apache.spark.sql.execution.datasources.FindDataSourceTable
val r = new FindDataSourceTable(spark)
val tablesResolvedPlan = r(planWithUnresolvedCatalogRelations)
// FIXME Why is t2 not resolved?!
scala> println(tablesResolvedPlan.numberedTreeString)
00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
01 +- SubqueryAlias t1
02+- Relation[id#10L] parquet

Why is t2 not resolved?! Have I missed a rule to apply to the logical plan?
Which one? Could this be that it is not supposed to work due to "Inserting
into an RDD-based table is not allowed." [1]?

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L483-L488

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


Re: Whole-stage codegen and SparkPlan.newPredicate

2018-01-16 Thread Jacek Laskowski
Thanks for looking into it, Kazuaki!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Tue, Jan 2, 2018 at 4:27 AM, Kazuaki Ishizaki 
wrote:

> Thank you for your correction :)
>
> I also made mistake in a report. What I reported at first never occurs
> with the correct Java bean class.
> Finally, I can reproduce a problem that Jacek reported even using the
> master. In my environment, this problem occurs with or without whole-stage
> codegen. I updated the JIRA ticket.
>
> I am still working for this.
>
> Kazuaki Ishizaki
>
>
>
> From:Herman van Hövell tot Westerflier 
> To:Kazuaki Ishizaki 
> Cc:Jacek Laskowski , dev 
> Date:2018/01/02 04:12
>
> Subject:Re: Whole-stage codegen and SparkPlan.newPredicate
> --
>
>
>
> Wrong ticket: *https://issues.apache.org/jira/browse/SPARK-22935*
> 
>
> Thanks for working on this :)
>
> On Mon, Jan 1, 2018 at 2:22 PM, Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> > wrote:
> I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I
> cannot see the exception even when I disabled whole-stage codegen. Am I
> wrong?
> We would appreciate it if you could create a JIRA entry with simple
> standalone repro.
>
> In addition to this report, I realized that this program produces
> incorrect results. I created a JIRA entry
> *https://issues.apache.org/jira/browse/SPARK-22934*
> 
> .
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Herman van Hövell tot Westerflier <
> *hvanhov...@databricks.com* >
> To:Jacek Laskowski <*ja...@japila.pl* >
> Cc:dev <*dev@spark.apache.org* >
> Date:2017/12/31 21:44
> Subject:Re: Whole-stage codegen and SparkPlan.newPredicate
> --
>
>
>
> Hi Jacek,
>
> In this case whole stage code generation is turned off. However we still
> use code generation for a lot of other things: projections, predicates,
> orderings & encoders. You are currently seeing a compile time failure while
> generating a predicate. There is currently no easy way to turn code
> generation off entirely.
>
> The error itself is not great, but it still captures the problem in a
> relatively timely fashion. We should have caught this during analysis
> though. Can you file a ticket?
>
> - Herman
>
> On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski <*ja...@japila.pl*
> > wrote:
> Hi,
>
> While working on an issue with Whole-stage codegen as reported @
> *https://stackoverflow.com/q/48026060/1305344*
> I
> found out that spark.sql.codegen.wholeStage=false does *not* turn
> whole-stage codegen off completely.
>
>
> It looks like SparkPlan.newPredicate [1] gets called regardless of the
> value of spark.sql.codegen.wholeStage property.
>
> $ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false
> ...
> scala> spark.sessionState.conf.wholeStageEnabled
> res7: Boolean = false
>
> That leads to an issue in the SO question with whole-stage codegen
> regardless of the value:
>
> ...
>   at org.apache.spark.sql.execution.SparkPlan.
> newPredicate(SparkPlan.scala:385)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:214)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:213)
>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal
> $1$$anonfun$apply$24.apply(RDD.scala:816)
> ...
>
> Is this a bug or does it work as intended? Why?
>
> [1]
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386*
> 

[build system] yet another power outage at our colo

2018-01-16 Thread shane knapp
all non-UPS machines (read:  all jenkins workers) temporarily lost power a
few minutes ago, and i will need to reconnect them to the master.

this means no builds for ~20 mins.

i will also be installing a plugin for the spark-on-k8s builds (
https://wiki.jenkins.io/display/JENKINS/Parameterized+Trigger+Plugin).

sorry for the inconvenience!

shane


Re: [build system] yet another power outage at our colo

2018-01-16 Thread shane knapp
ok, we're back up and ready to build.  sorry for the inconvenience.

On Tue, Jan 16, 2018 at 9:59 AM, shane knapp  wrote:

> all non-UPS machines (read:  all jenkins workers) temporarily lost power a
> few minutes ago, and i will need to reconnect them to the master.
>
> this means no builds for ~20 mins.
>
> i will also be installing a plugin for the spark-on-k8s builds (
> https://wiki.jenkins.io/display/JENKINS/Parameterized+Trigger+Plugin).
>
> sorry for the inconvenience!
>
> shane
>


Re: [build system] currently experiencing git timeouts when building

2018-01-16 Thread shane knapp
update:  we're seeing about 3% of builds timing out, but today is (so far)
a bad one:

i've reached out to github about this and am waiting to hear back.

$ get_timeouts.py 10
Timeouts by project:
 17spark-branch-2.3-lint
 10spark-branch-2.3-test-maven-hadoop-2.6
 7spark-branch-2.3-compile-maven-hadoop-2.7
 41spark-master-lint
 23spark-master-compile-maven-hadoop-2.6
 32spark-master-compile-maven-hadoop-2.7
 5spark-branch-2.2-compile-maven-hadoop-2.6
 3spark-branch-2.2-test-maven-hadoop-2.7
 6spark-branch-2.2-compile-maven-scala-2.10
 6spark-master-test-maven-hadoop-2.6
 3spark-master-test-maven-hadoop-2.7
 14spark-branch-2.3-compile-maven-hadoop-2.6
 3spark-branch-2.2-lint
 1spark-branch-2.3-test-maven-hadoop-2.7

Timeouts by day:
2018-01-094
2018-01-1013
2018-01-1127
2018-01-1274
2018-01-139
2018-01-142
2018-01-158
2018-01-1634

Total builds:4112
Total timeouts:171
Percentage of all builds timing out:4.15856031128

On Wed, Jan 10, 2018 at 9:54 AM, shane knapp  wrote:

> i just noticed we're starting to see the once-yearly rash of git timeouts
> when building.
>
> i'll be looking in to this today...  i'm at our lab retreat, so my
> attention will be divided during the day but i will report back here once i
> have some more information.
>
> in the meantime, if your jobs have a git timeout, please just retrigger
> them and we will hope for the best.
>
> shane
>


Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-16 Thread Ted Yu
Is there going to be another RC ?

With KafkaContinuousSourceSuite hanging, it is hard to get the rest of the
tests going.

Cheers

On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen  wrote:

> The signatures and licenses look OK. Except for the missing k8s package,
> the contents look OK. Tests look pretty good with "-Phive -Phadoop-2.7
> -Pyarn" on Ubuntu 17.10, except that KafkaContinuousSourceSuite seems to
> hang forever. That was just fixed and needs to get into an RC?
>
> Aside from the Blockers just filed for R docs, etc., we have:
>
> Blocker:
> SPARK-23000 Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in
> Spark 2.3
> SPARK-23020 Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.
> testInProcessLauncher
> SPARK-23051 job description in Spark UI is broken
>
> Critical:
> SPARK-22739 Additional Expression Support for Objects
>
> I actually don't think any of those Blockers should be Blockers; not sure
> if the last one is really critical either.
>
> I think this release will have to be re-rolled so I'd say -1 to RC1.
>
> On Fri, Jan 12, 2018 at 4:42 PM Sameer Agarwal 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc1: https://github.com/apache/
>> spark/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1261/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-
>> docs/_site/index.html
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
>> appropriate.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said, if
>> there is something which is a regression from 2.2.0 that has not been
>> correctly targeted please ping me or a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.3.0 at
>> https://s.apache.org/WmoI).
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of the writing, there
>> are 19 JIRA issues targeting 2.3.0 tracking various QA/audit tasks, test
>> failures and other feature/bugs. In particular, we've currently marked 3
>> JIRAs as release blockers that are being actively worked on:
>>
>> 1. SPARK-23051 that tracks a regression in the Spark UI
>> 2. SPARK-23020 and SPARK-23000 that track a couple of flaky tests that
>> are responsible for build failures. Additionally,
>> https://github.com/apache/spark/pull/20242 fixes a few Java linter
>> errors in RC1.
>>
>> Given that these blockers are fairly isolated, in the sprit of starting a
>> thorough QA early, this RC1 aims to serve as a good approximation of the
>> functionality of final release.
>>
>> Regards,
>> Sameer
>>
>


Re: Distinct on Map data type -- SPARK-19893

2018-01-16 Thread Tejas Patil
There is a JIRA for making Map types orderable :
https://issues.apache.org/jira/browse/SPARK-18134 Given that this is a
non-trivial change, it will take time.

On Sat, Jan 13, 2018 at 9:50 PM, ckhari4u  wrote:

> Wan, Thanks a lot,! I see the issue now.
>
> Do we have any JIRA's open for the future work to be done on this?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-16 Thread Sameer Agarwal
Yes, I'll cut an RC2 as soon as the remaining blockers are resolved. In the
meantime, please continue to report any other issues here.

Here's a quick update on progress towards the next RC:

- SPARK-22908 
(KafkaContiniousSourceSuite) has been reverted
- SPARK-23051  (Spark
UI), SPARK-23063  (k8s
packaging) and SPARK-23065
 (R API docs) have all
been resolved
- A fix for SPARK-23020 
(SparkLauncherSuite) has been merged. We're monitoring the builds to make
sure that the flakiness has been resolved.



On 16 January 2018 at 13:21, Ted Yu  wrote:

> Is there going to be another RC ?
>
> With KafkaContinuousSourceSuite hanging, it is hard to get the rest of
> the tests going.
>
> Cheers
>
> On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen  wrote:
>
>> The signatures and licenses look OK. Except for the missing k8s package,
>> the contents look OK. Tests look pretty good with "-Phive -Phadoop-2.7
>> -Pyarn" on Ubuntu 17.10, except that KafkaContinuousSourceSuite seems to
>> hang forever. That was just fixed and needs to get into an RC?
>>
>> Aside from the Blockers just filed for R docs, etc., we have:
>>
>> Blocker:
>> SPARK-23000 Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in
>> Spark 2.3
>> SPARK-23020 Flaky Test: org.apache.spark.launcher.Spar
>> kLauncherSuite.testInProcessLauncher
>> SPARK-23051 job description in Spark UI is broken
>>
>> Critical:
>> SPARK-22739 Additional Expression Support for Objects
>>
>> I actually don't think any of those Blockers should be Blockers; not sure
>> if the last one is really critical either.
>>
>> I think this release will have to be re-rolled so I'd say -1 to RC1.
>>
>> On Fri, Jan 12, 2018 at 4:42 PM Sameer Agarwal 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00 am UTC
>>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc1: https://github.com/apache/spar
>>> k/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1261/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs
>>> /_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
>>> appropriate.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.2.0. That being said, if
>>> there is something which is a regression from 2.2.0 that has not been
>>> correctly targeted please ping me or a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.3.0 at
>>> https://s.apache.org/WmoI).
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the 

Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-16 Thread Holden Karau
So looking at http://pgp.mit.edu/pks/lookup?op=vindex&search=
0xA1CEDBA8AD0C022A it seems like Sameer's key isn't in the Apache web of
trust yet. This shouldn't block RC process but before we publish it's
important to get the key in the Apache web of trust.

On Tue, Jan 16, 2018 at 3:00 PM, Sameer Agarwal 
wrote:

> Yes, I'll cut an RC2 as soon as the remaining blockers are resolved. In
> the meantime, please continue to report any other issues here.
>
> Here's a quick update on progress towards the next RC:
>
> - SPARK-22908 
> (KafkaContiniousSourceSuite) has been reverted
> - SPARK-23051  (Spark
> UI), SPARK-23063  (k8s
> packaging) and SPARK-23065
>  (R API docs) have all
> been resolved
> - A fix for SPARK-23020
>  (SparkLauncherSuite)
> has been merged. We're monitoring the builds to make sure that the
> flakiness has been resolved.
>
>
>
> On 16 January 2018 at 13:21, Ted Yu  wrote:
>
>> Is there going to be another RC ?
>>
>> With KafkaContinuousSourceSuite hanging, it is hard to get the rest of
>> the tests going.
>>
>> Cheers
>>
>> On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen  wrote:
>>
>>> The signatures and licenses look OK. Except for the missing k8s package,
>>> the contents look OK. Tests look pretty good with "-Phive -Phadoop-2.7
>>> -Pyarn" on Ubuntu 17.10, except that KafkaContinuousSourceSuite seems
>>> to hang forever. That was just fixed and needs to get into an RC?
>>>
>>> Aside from the Blockers just filed for R docs, etc., we have:
>>>
>>> Blocker:
>>> SPARK-23000 Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in
>>> Spark 2.3
>>> SPARK-23020 Flaky Test: org.apache.spark.launcher.Spar
>>> kLauncherSuite.testInProcessLauncher
>>> SPARK-23051 job description in Spark UI is broken
>>>
>>> Critical:
>>> SPARK-22739 Additional Expression Support for Objects
>>>
>>> I actually don't think any of those Blockers should be Blockers; not
>>> sure if the last one is really critical either.
>>>
>>> I think this release will have to be re-rolled so I'd say -1 to RC1.
>>>
>>> On Fri, Jan 12, 2018 at 4:42 PM Sameer Agarwal 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00
 am UTC and passes if a majority of at least 3 PMC +1 votes are cast.


 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc1: https://github.com/apache/spar
 k/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)

 List of JIRA tickets resolved in this release can be found here:
 https://issues.apache.org/jira/projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1261/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs
 /_site/index.html


 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala you
 can add the staging repository to your projects resolvers and test with the
 RC (make sure to clean up the artifact cache before/after so you don't end
 up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.0?
 ===

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
 appropriate.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.2.0. That being
 said, i