Re: [SQL] parse_url does not work for Internationalized domain names ?

2018-01-12 Thread yash datta
Thanks for the prompt reply!.

Opened a ticket here: https://issues.apache.org/jira/browse/SPARK-23056


BR
Yash

On Fri, Jan 12, 2018 at 3:41 PM, StanZhai  wrote:

> This problem was introduced by
>  which is designed to
> improve performance of PARSE_URL().
>
> The same issue exists in the following SQL:
>
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc";]', 'QUERY', 'p')
>
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```
>
> I think it's a regression.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.


Re: Accessing the SQL parser

2018-01-12 Thread Michael Shtelma
Hi AbdealiJK,

In order to get AST you can parse your query with Spark Parser :

LogicalPlan logicalPlan =
sparkSession.sessionState().sqlParser().parsePlan("select * from
myTable");

Afterwards you can implement your custom logic and execute it in this way:

Dataset ds = Dataset.ofRows(sparkSession, logicalPlan);
ds.show();

Alternatively you can manually run resolve and optimize the plan and
maybe do smth else afterwards:

QueryExecution queryExecution =
sparkSession.sessionState().executePlan(logicalPlan);
SparkPlan plan = queryExecution.executedPlan();
RDD rdd = plan.execute();
System.out.println("rdd.count() = " + rdd.count());

Best,
Michael


On Fri, Jan 12, 2018 at 5:39 AM, Abdeali Kothari
 wrote:
> I was writing some code to try to auto find a list of tables and databases
> being used in a SparkSQL query. Mainly I was looking to auto-check the
> permissions and owners of all the tables a query will be trying to access.
>
> I was wondering whether PySpark has some method for me to directly use the
> AST that SparkSQL uses?
>
> Or is there some documentation on how I can generate and understand the AST
> in Spark?
>
> Regards,
> AbdealiJK
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Compiling Spark UDF at runtime

2018-01-12 Thread Michael Shtelma
Hi all,

I would like to be able to compile Spark UDF at runtime. Right now I
am using Janino for that.
My problem is, that in order to make my compiled functions visible to
spark, I have to set janino classloader (janino gives me classloader
with compiled UDF classes) as context class loader before I create
Spark Session. This approach is working locally for debugging purposes
but is not going to work in cluster mode, because the UDF classes will
not be distributed to the worker nodes.

An alternative is to register UDF via Hive functionality and generate
temporary jar somewhere, which at least in Standalone cluster mode
will be made available to spark workers using embedded http server. As
far as I understand, this is not going to work in yarn mode.

I am wondering now, how is it better to approach this problem? My
current best idea is to develop own small netty based file web server
and use it in order to distribute my custom jar, which can be created
on the fly, to workers both in standalone and in yarn modes. Can I
reference the jar in form  of http url using extra driver options and
then register UDFs contained in this jar using spark.udf().* methods?

Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Thanks,
Michael

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Compiling Spark UDF at runtime

2018-01-12 Thread Georg Heiler
You could store the jar in hdfs. Then even in yarn cluster mode your give
workaround should work.
Michael Shtelma  schrieb am Fr. 12. Jan. 2018 um 12:58:

> Hi all,
>
> I would like to be able to compile Spark UDF at runtime. Right now I
> am using Janino for that.
> My problem is, that in order to make my compiled functions visible to
> spark, I have to set janino classloader (janino gives me classloader
> with compiled UDF classes) as context class loader before I create
> Spark Session. This approach is working locally for debugging purposes
> but is not going to work in cluster mode, because the UDF classes will
> not be distributed to the worker nodes.
>
> An alternative is to register UDF via Hive functionality and generate
> temporary jar somewhere, which at least in Standalone cluster mode
> will be made available to spark workers using embedded http server. As
> far as I understand, this is not going to work in yarn mode.
>
> I am wondering now, how is it better to approach this problem? My
> current best idea is to develop own small netty based file web server
> and use it in order to distribute my custom jar, which can be created
> on the fly, to workers both in standalone and in yarn modes. Can I
> reference the jar in form  of http url using extra driver options and
> then register UDFs contained in this jar using spark.udf().* methods?
>
> Does anybody have any better ideas?
> Any assistance would be greatly appreciated!
>
> Thanks,
> Michael
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Schema Evolution in Apache Spark

2018-01-12 Thread Dongjoon Hyun
This is about Spark-layer test cases on **read-only** CSV, JSON, Parquet,
ORC files. You can find more details and comparisons in terms of Spatk
support coverage.

Bests,
Dongjoon.


On Thu, Jan 11, 2018 at 22:19 Georg Heiler 
wrote:

> Isn't this related to the data format used, i.e. parquet, Avro, ... which
> already support changing schema?
>
> Dongjoon Hyun  schrieb am Fr., 12. Jan. 2018 um
> 02:30 Uhr:
>
>> Hi, All.
>>
>> A data schema can evolve in several ways and Apache Spark 2.3 already
>> supports the followings for file-based data sources like
>> CSV/JSON/ORC/Parquet.
>>
>> 1. Add a column
>> 2. Remove a column
>> 3. Change a column position
>> 4. Change a column type
>>
>> Can we guarantee users some schema evolution coverage on file-based data
>> sources by adding schema evolution test suites explicitly? So far, there
>> are some test cases.
>>
>> For simplicity, I have several assumptions on schema evolution.
>>
>> 1. A safe evolution without data loss.
>> - e.g. from small types to larger types like int-to-long, not vice
>> versa.
>> 2. Final schema is given by users (or Hive)
>> 3. Simple Spark data types supported by Spark vectorized execution.
>>
>> I made a test case PR to receive your opinions for this.
>>
>> [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
>> data sources
>> - https://github.com/apache/spark/pull/20208
>>
>> Could you take a look and give some opinions?
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Kubernetes: why use init containers?

2018-01-12 Thread Marcelo Vanzin
On Fri, Jan 12, 2018 at 4:13 AM, Eric Charles  wrote:
>> Again, I don't see what is all this hoopla about fine grained control
>> of dependency downloads. Spark solved this years ago for Spark
>> applications. Don't reinvent the wheel.
>
> Init-containers are used today to download dependencies. I may be wrong and
> may open another facet of the discussion, but I see init container usage in
> a more generic way and not only restricted to dependencies download.

I'm not trying to discuss the general benefits of init containers as
it pertains to the kubernetes framework. I'm sure they added those for
a reason.

I'm trying to discuss them in the restricted scope of the spark-on-k8s
integration. And there, there is a single use for the single init
container that the Spark code itself injects into the pod: downloading
dependencies, which is something that spark-submit already does.

There's an option to override that one init container image with
another, where you can completely change its behavior. Given that
there is no contract currently that explains how these images should
behave, doing so is very, very risky and might break the application
completely (e.g. because dependencies are now not being downloaded, or
placed in the wrong location).

An you can do the exact same thing by overriding the main Spark image
itself. Just run the same code in your custom entry point before the
Spark-provided entry point runs. Same results and caveats as above
apply.

So again, the specific init container used by spark-on-k8s, as far as
I can see, seems to cause more problems than it solves.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-12 Thread Xin Lu
seems like someone should investigate what caused the build time to go up
an hour and if it's expected or not.

On Thu, Jan 11, 2018 at 7:37 PM, Dongjoon Hyun 
wrote:

> Hi, All and Shane.
>
> Can we increase the build time for `branch-2.3` during 2.3 RC period?
>
> There are two known test issues, but the Jenkins on branch-2.3 with
> hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
> the branch is healthy or not.
>
> Build timed out (after 255 minutes). Marking the build as aborted.
> Build was aborted
> ...
> Finished: ABORTED
>
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console
>
> Bests,
> Dongjoon.
>


Re: Kubernetes: why use init containers?

2018-01-12 Thread Marcelo Vanzin
BTW I most probably will not have time to get back to this at any time
soon, so if anyone is interested in doing some clean up, I'll leave my
branch up.

I'm seriously thinking about proposing that we document the k8s
backend as experimental in 2.3; it seems there still a lot to be
cleaned up in terms of user interface (as in extensibility and
customizability), documentation, and mainly testing, and we're pretty
far into the 2.3 cycle for all of those to be sorted out.

On Thu, Jan 11, 2018 at 8:19 AM, Anirudh Ramanathan
 wrote:
> If we can separate concerns those out, that might make sense in the short
> term IMO.
> There are several benefits to reusing spark-submit and spark-class as you
> pointed out previously,
> so, we should be looking to leverage those irrespective of how we do
> dependency management -
> in the interest of conformance with the other cluster managers.
>
> I like the idea of passing arguments through in a way that it doesn't
> trigger the dependency management code for now.
> In the interest of time for 2.3, if we could target the just that (and
> revisit the init containers afterwards),
> there should be enough time to make the change, test and release with
> confidence.
>
> On Wed, Jan 10, 2018 at 3:45 PM, Marcelo Vanzin  wrote:
>>
>> On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
>>  wrote:
>> > We can start by getting a PR going perhaps, and start augmenting the
>> > integration testing to ensure that there are no surprises - with/without
>> > credentials, accessing GCS, S3 etc as well.
>> > When we get enough confidence and test coverage, let's merge this in.
>> > Does that sound like a reasonable path forward?
>>
>> I think it's beneficial to separate this into two separate things as
>> far as discussion goes:
>>
>> - using spark-submit: the code should definitely be starting the
>> driver using spark-submit, and potentially the executor using
>> spark-class.
>>
>> - separately, we can decide on whether to keep or remove init containers.
>>
>> Unfortunately, code-wise, those are not separate. If you get rid of
>> init containers, my current p.o.c. has most of the needed changes
>> (only lightly tested).
>>
>> But if you keep init containers, you'll need to mess with the
>> configuration so that spark-submit never sees spark.jars /
>> spark.files, so it doesn't trigger its dependency download code. (YARN
>> does something similar, btw.) That will surely mean different changes
>> in the current k8s code (which I wanted to double check anyway because
>> I remember seeing some oddities related to those configs in the logs).
>>
>> To comment on one point made by Andrew:
>> > there's almost a parallel here with spark.yarn.archive, where that
>> > configures the cluster (YARN) to do distribution pre-runtime
>>
>> That's more of a parallel to the docker image; spark.yarn.archive
>> points to a jar file with Spark jars in it so that YARN can make Spark
>> available to the driver / executors running in the cluster.
>>
>> Like the docker image, you could include other stuff that is not
>> really part of standard Spark in that archive too, or even not have
>> Spark at all there, if you want things to just fail. :-)
>>
>> --
>> Marcelo
>
>
>
>
> --
> Anirudh Ramanathan



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-12 Thread Andrew Ash
+1 on the first release being marked experimental.  Many major features
coming into Spark in the past have gone through a stabilization process

On Fri, Jan 12, 2018 at 1:18 PM, Marcelo Vanzin  wrote:

> BTW I most probably will not have time to get back to this at any time
> soon, so if anyone is interested in doing some clean up, I'll leave my
> branch up.
>
> I'm seriously thinking about proposing that we document the k8s
> backend as experimental in 2.3; it seems there still a lot to be
> cleaned up in terms of user interface (as in extensibility and
> customizability), documentation, and mainly testing, and we're pretty
> far into the 2.3 cycle for all of those to be sorted out.
>
> On Thu, Jan 11, 2018 at 8:19 AM, Anirudh Ramanathan
>  wrote:
> > If we can separate concerns those out, that might make sense in the short
> > term IMO.
> > There are several benefits to reusing spark-submit and spark-class as you
> > pointed out previously,
> > so, we should be looking to leverage those irrespective of how we do
> > dependency management -
> > in the interest of conformance with the other cluster managers.
> >
> > I like the idea of passing arguments through in a way that it doesn't
> > trigger the dependency management code for now.
> > In the interest of time for 2.3, if we could target the just that (and
> > revisit the init containers afterwards),
> > there should be enough time to make the change, test and release with
> > confidence.
> >
> > On Wed, Jan 10, 2018 at 3:45 PM, Marcelo Vanzin 
> wrote:
> >>
> >> On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
> >>  wrote:
> >> > We can start by getting a PR going perhaps, and start augmenting the
> >> > integration testing to ensure that there are no surprises -
> with/without
> >> > credentials, accessing GCS, S3 etc as well.
> >> > When we get enough confidence and test coverage, let's merge this in.
> >> > Does that sound like a reasonable path forward?
> >>
> >> I think it's beneficial to separate this into two separate things as
> >> far as discussion goes:
> >>
> >> - using spark-submit: the code should definitely be starting the
> >> driver using spark-submit, and potentially the executor using
> >> spark-class.
> >>
> >> - separately, we can decide on whether to keep or remove init
> containers.
> >>
> >> Unfortunately, code-wise, those are not separate. If you get rid of
> >> init containers, my current p.o.c. has most of the needed changes
> >> (only lightly tested).
> >>
> >> But if you keep init containers, you'll need to mess with the
> >> configuration so that spark-submit never sees spark.jars /
> >> spark.files, so it doesn't trigger its dependency download code. (YARN
> >> does something similar, btw.) That will surely mean different changes
> >> in the current k8s code (which I wanted to double check anyway because
> >> I remember seeing some oddities related to those configs in the logs).
> >>
> >> To comment on one point made by Andrew:
> >> > there's almost a parallel here with spark.yarn.archive, where that
> >> > configures the cluster (YARN) to do distribution pre-runtime
> >>
> >> That's more of a parallel to the docker image; spark.yarn.archive
> >> points to a jar file with Spark jars in it so that YARN can make Spark
> >> available to the driver / executors running in the cluster.
> >>
> >> Like the docker image, you could include other stuff that is not
> >> really part of standard Spark in that archive too, or even not have
> >> Spark at all there, if you want things to just fail. :-)
> >>
> >> --
> >> Marcelo
> >
> >
> >
> >
> > --
> > Anirudh Ramanathan
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Kubernetes: why use init containers?

2018-01-12 Thread Anirudh Ramanathan
I'd like to discuss the criteria here for graduating from experimental
status. (as a fork, we were mentioned in the documentation as
experimental).

As I understand, the bigger change discussed here are like the init
containers, which will be more on the implementation side than a user
facing change/behavioral change - which is why it seemed okay to pursue it
post 2.3 as well.

If the reasoning is mostly lack of confidence in testing or documentation,
we are working on that, but we would love to have more visibility into what
we're missing, so, we can prioritize and augment that, and maybe even get
there by this release - or at least have a clear path ahead to graduating
in the next one.



On Jan 12, 2018 1:18 PM, "Marcelo Vanzin"  wrote:

> BTW I most probably will not have time to get back to this at any time
> soon, so if anyone is interested in doing some clean up, I'll leave my
> branch up.
>
> I'm seriously thinking about proposing that we document the k8s
> backend as experimental in 2.3; it seems there still a lot to be
> cleaned up in terms of user interface (as in extensibility and
> customizability), documentation, and mainly testing, and we're pretty
> far into the 2.3 cycle for all of those to be sorted out.
>
> On Thu, Jan 11, 2018 at 8:19 AM, Anirudh Ramanathan
>  wrote:
> > If we can separate concerns those out, that might make sense in the short
> > term IMO.
> > There are several benefits to reusing spark-submit and spark-class as you
> > pointed out previously,
> > so, we should be looking to leverage those irrespective of how we do
> > dependency management -
> > in the interest of conformance with the other cluster managers.
> >
> > I like the idea of passing arguments through in a way that it doesn't
> > trigger the dependency management code for now.
> > In the interest of time for 2.3, if we could target the just that (and
> > revisit the init containers afterwards),
> > there should be enough time to make the change, test and release with
> > confidence.
> >
> > On Wed, Jan 10, 2018 at 3:45 PM, Marcelo Vanzin 
> wrote:
> >>
> >> On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
> >>  wrote:
> >> > We can start by getting a PR going perhaps, and start augmenting the
> >> > integration testing to ensure that there are no surprises -
> with/without
> >> > credentials, accessing GCS, S3 etc as well.
> >> > When we get enough confidence and test coverage, let's merge this in.
> >> > Does that sound like a reasonable path forward?
> >>
> >> I think it's beneficial to separate this into two separate things as
> >> far as discussion goes:
> >>
> >> - using spark-submit: the code should definitely be starting the
> >> driver using spark-submit, and potentially the executor using
> >> spark-class.
> >>
> >> - separately, we can decide on whether to keep or remove init
> containers.
> >>
> >> Unfortunately, code-wise, those are not separate. If you get rid of
> >> init containers, my current p.o.c. has most of the needed changes
> >> (only lightly tested).
> >>
> >> But if you keep init containers, you'll need to mess with the
> >> configuration so that spark-submit never sees spark.jars /
> >> spark.files, so it doesn't trigger its dependency download code. (YARN
> >> does something similar, btw.) That will surely mean different changes
> >> in the current k8s code (which I wanted to double check anyway because
> >> I remember seeing some oddities related to those configs in the logs).
> >>
> >> To comment on one point made by Andrew:
> >> > there's almost a parallel here with spark.yarn.archive, where that
> >> > configures the cluster (YARN) to do distribution pre-runtime
> >>
> >> That's more of a parallel to the docker image; spark.yarn.archive
> >> points to a jar file with Spark jars in it so that YARN can make Spark
> >> available to the driver / executors running in the cluster.
> >>
> >> Like the docker image, you could include other stuff that is not
> >> really part of standard Spark in that archive too, or even not have
> >> Spark at all there, if you want things to just fail. :-)
> >>
> >> --
> >> Marcelo
> >
> >
> >
> >
> > --
> > Anirudh Ramanathan
>
>
>
> --
> Marcelo
>


Re: Kubernetes: why use init containers?

2018-01-12 Thread Marcelo Vanzin
On Fri, Jan 12, 2018 at 1:53 PM, Anirudh Ramanathan
 wrote:
> As I understand, the bigger change discussed here are like the init
> containers, which will be more on the implementation side than a user facing
> change/behavioral change - which is why it seemed okay to pursue it post 2.3
> as well.

It's not just a code change.

There are multiple configurations exposed to control the init
container. There's a whole step - the running of the init container -
that currently can be customized (even though there is no
documentation on how to safely do that). If you ship that as "stable",
you cannot later change it in way that will break applications. So
you'd not only be stuck with the existence of the init container, but
with all its current behavior.

Marking as experimental gives us time to stabilize these details. Not
just whether the init container exists, but what is its actual
behavior and how the use can affect it. A lot of the replies here
always mention that init containers can be customized, but I just want
to point out again that there is currently zero documentation about
how to do that and not break the assumptions the spark-on-k8s
submission code makes.

The same applies to the other images, by the way.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-12 Thread Anirudh Ramanathan
That's fair - I guess it would be a stretch to assume users wouldn't put
custom logic in their init containers if that hook is provided to them. :)

Experimental sounds like a good idea for 2.3. Gives us enough wriggle room
for the next one, and hopefully user feedback in the meantime.

Thanks,
Anirudh

On Jan 12, 2018 2:00 PM, "Marcelo Vanzin"  wrote:

> On Fri, Jan 12, 2018 at 1:53 PM, Anirudh Ramanathan
>  wrote:
> > As I understand, the bigger change discussed here are like the init
> > containers, which will be more on the implementation side than a user
> facing
> > change/behavioral change - which is why it seemed okay to pursue it post
> 2.3
> > as well.
>
> It's not just a code change.
>
> There are multiple configurations exposed to control the init
> container. There's a whole step - the running of the init container -
> that currently can be customized (even though there is no
> documentation on how to safely do that). If you ship that as "stable",
> you cannot later change it in way that will break applications. So
> you'd not only be stuck with the existence of the init container, but
> with all its current behavior.
>
> Marking as experimental gives us time to stabilize these details. Not
> just whether the init container exists, but what is its actual
> behavior and how the use can affect it. A lot of the replies here
> always mention that init containers can be customized, but I just want
> to point out again that there is currently zero documentation about
> how to do that and not break the assumptions the spark-on-k8s
> submission code makes.
>
> The same applies to the other images, by the way.
>
> --
> Marcelo
>


[VOTE] Spark 2.3.0 (RC1)

2018-01-12 Thread Sameer Agarwal
Please vote on releasing the following candidate as Apache Spark version
2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00 am UTC
and passes if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc1:
https://github.com/apache/spark/tree/v2.3.0-rc1
(964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)

List of JIRA tickets resolved in this release can be found here:
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1261/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/index.html


FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
appropriate.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.2.0. That being said, if
there is something which is a regression from 2.2.0 that has not been
correctly targeted please ping me or a committer to help target the issue
(you can see the open issues listed as impacting Spark 2.3.0 at
https://s.apache.org/WmoI).

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of the writing, there are
19 JIRA issues targeting 2.3.0 tracking various QA/audit tasks, test
failures and other feature/bugs. In particular, we've currently marked 3
JIRAs as release blockers that are being actively worked on:

1. SPARK-23051 that tracks a regression in the Spark UI
2. SPARK-23020 and SPARK-23000 that track a couple of flaky tests that are
responsible for build failures. Additionally,
https://github.com/apache/spark/pull/20242 fixes a few Java linter errors
in RC1.

Given that these blockers are fairly isolated, in the sprit of starting a
thorough QA early, this RC1 aims to serve as a good approximation of the
functionality of final release.

Regards,
Sameer


Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-12 Thread Shixiong(Ryan) Zhu
FYI, we reverted a commit in
https://github.com/apache/spark/commit/55dbfbca37ce4c05f83180777ba3d4fe2d96a02e
to fix the issue.

On Fri, Jan 12, 2018 at 11:45 AM, Xin Lu  wrote:

> seems like someone should investigate what caused the build time to go up
> an hour and if it's expected or not.
>
> On Thu, Jan 11, 2018 at 7:37 PM, Dongjoon Hyun 
> wrote:
>
>> Hi, All and Shane.
>>
>> Can we increase the build time for `branch-2.3` during 2.3 RC period?
>>
>> There are two known test issues, but the Jenkins on branch-2.3 with
>> hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
>> the branch is healthy or not.
>>
>> Build timed out (after 255 minutes). Marking the build as aborted.
>> Build was aborted
>> ...
>> Finished: ABORTED
>>
>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console
>>
>> Bests,
>> Dongjoon.
>>
>
>


Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-12 Thread Dongjoon Hyun
For this issue, during SPARK-23028, Shane shared that the server limit is
already higher.

1. Xiao Li increased the timeout of Spark test script for `master` branch
first in the following commit.

[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT


2. Marco Gaido reports a flaky test suite and it turns out that the test
suite hangs in SPARK-23055


3. Sameer Agarwal swiftly reverts it.

Thank you all!

Let's wait and see the dashboard

.

Bests,
Dongjoon.



On Fri, Jan 12, 2018 at 3:22 PM, Shixiong(Ryan) Zhu  wrote:

> FYI, we reverted a commit in https://github.com/apache/spark/commit/
> 55dbfbca37ce4c05f83180777ba3d4fe2d96a02e to fix the issue.
>
> On Fri, Jan 12, 2018 at 11:45 AM, Xin Lu  wrote:
>
>> seems like someone should investigate what caused the build time to go up
>> an hour and if it's expected or not.
>>
>> On Thu, Jan 11, 2018 at 7:37 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All and Shane.
>>>
>>> Can we increase the build time for `branch-2.3` during 2.3 RC period?
>>>
>>> There are two known test issues, but the Jenkins on branch-2.3 with
>>> hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
>>> the branch is healthy or not.
>>>
>>> Build timed out (after 255 minutes). Marking the build as aborted.
>>> Build was aborted
>>> ...
>>> Finished: ABORTED
>>>
>>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>>> t%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
>>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>>> t%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-12 Thread Anirudh Ramanathan
Felix just pointed out to me that staging is missing the spark-kubernetes
package.
I think we missed updating release-build.sh
,
which is why staging

and
the binary release are missing spark-kubernetes.
Created SPARK-23063  to
track.

On Fri, Jan 12, 2018 at 2:42 PM, Sameer Agarwal  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00 am UTC
> and passes if a majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc1: https://github.com/apache/
> spark/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1261/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-
> docs/_site/index.html
>
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.0?
> ===
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> appropriate.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said, if
> there is something which is a regression from 2.2.0 that has not been
> correctly targeted please ping me or a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>
> ===
> What are the unresolved issues targeted for 2.3.0?
> ===
>
> Please see https://s.apache.org/oXKi. At the time of the writing, there
> are 19 JIRA issues targeting 2.3.0 tracking various QA/audit tasks, test
> failures and other feature/bugs. In particular, we've currently marked 3
> JIRAs as release blockers that are being actively worked on:
>
> 1. SPARK-23051 that tracks a regression in the Spark UI
> 2. SPARK-23020 and SPARK-23000 that track a couple of flaky tests that are
> responsible for build failures. Additionally, https://github.com/apache/
> spark/pull/20242 fixes a few Java linter errors in RC1.
>
> Given that these blockers are fairly isolated, in the sprit of starting a
> thorough QA early, this RC1 aims to serve as a good approximation of the
> functionality of final release.
>
> Regards,
> Sameer
>



-- 
Anirudh Ramanathan


Distinct on Map data type -- SPARK-19893

2018-01-12 Thread ckhari4u
I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP 
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>  (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===>   (1[Stage
1:===>   (1[Stage
1:==>(1 

{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types. 
>From the PR, it says, 
// TODO: although map type is not orderable, technically map type should be
able to be
 +  // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators? 





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Distinct on Map data type -- SPARK-19893

2018-01-12 Thread Wenchen Fan
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
result...
We are still working on adding this feature, but before that, we should
fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u  wrote:

> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
> see
> a clear justification for why SPARK 19893 is important and needed. I have a
> sample table which works fine with an earlier build of Spark 2.1.0. Now
> that
> the latest build is having the backport of SPARK-19893, its failing with
> error:
>
> Error in query: Cannot have map type columns in DataFrame which calls set
> operations(intersect, except, etc.), but the type of column metrics is
> map;;
> Distinct
>
>
> *In Old Build of Spark 2.1.0, I tried the below:*
>
>
> create TABLE map_demo2
> (
> country_id BIGINT,
> metrics MAP 
> );
>
> insert into table map_demo2 select 2,map("chaka",102) ;
> insert into table map_demo2 select 3,map("chaka",102) ;
> insert into table map_demo2 select 4,map("mangaa",103) ;
>
>
> spark-sql> select distinct metrics from map_demo2;
> [Stage 0:>  (0 + 4)
> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
> create the Initialization Vector used by CryptoStream
> [Stage 1:===>   (1[Stage
> 1:===>   (1[Stage
> 1:==>(1
> {"mangaa":103}
> {"chaka":102}
> {"chaka":103}
> Time taken: 15.331 seconds, Fetched 3 row(s)
>
> Here the simple distinct query works fine in Spark. Any thoughts why
> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
> From the PR, it says,
> // TODO: although map type is not orderable, technically map type should be
> able to be
>  +  // used inequality comparison, remove this type check once we
> support it.
>
> Could not figure out the issue caused by using the aforementioned
> operators?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Distinct on Map data type -- SPARK-19893

2018-01-12 Thread HariKrishnan CK
Hi Wan, could you please be more specific on the scenarios where it will
give wrong results. I checked distinct and intersect operators in many use
cases i have and could not figure out a failure scenario giving wrong
results.

Thanks

On Jan 12, 2018 7:36 PM, "Wenchen Fan"  wrote:

Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
result...
We are still working on adding this feature, but before that, we should
fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u  wrote:

> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
> see
> a clear justification for why SPARK 19893 is important and needed. I have a
> sample table which works fine with an earlier build of Spark 2.1.0. Now
> that
> the latest build is having the backport of SPARK-19893, its failing with
> error:
>
> Error in query: Cannot have map type columns in DataFrame which calls set
> operations(intersect, except, etc.), but the type of column metrics is
> map;;
> Distinct
>
>
> *In Old Build of Spark 2.1.0, I tried the below:*
>
>
> create TABLE map_demo2
> (
> country_id BIGINT,
> metrics MAP 
> );
>
> insert into table map_demo2 select 2,map("chaka",102) ;
> insert into table map_demo2 select 3,map("chaka",102) ;
> insert into table map_demo2 select 4,map("mangaa",103) ;
>
>
> spark-sql> select distinct metrics from map_demo2;
> [Stage 0:>  (0 + 4)
> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
> create the Initialization Vector used by CryptoStream
> [Stage 1:
> 
> ===
> 
> >   (1[Stage
> 
> 1:===>   (1[Stage
> 1:==>(1
> {"mangaa":103}
> {"chaka":102}
> {"chaka":103}
> Time taken: 15.331 seconds, Fetched 3 row(s)
>
> Here the simple distinct query works fine in Spark. Any thoughts why
> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
> From the PR, it says,
> // TODO: although map type is not orderable, technically map type should be
> able to be
>  +  // used inequality comparison, remove this type check once we
> support it.
>
> Could not figure out the issue caused by using the aforementioned
> operators?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>