Re: Apache Spark 2.3.3

2019-01-16 Thread Takeshi Yamamuro
Hi, all

I took some time to check the recent Jenkins test failures in branch-2.3
(See https://github.com/apache/spark/pull/23507 for detailed).

I'm re-publishing a candidate now, so I think I'll start a first vote for
v2.3.3-rc1 in a few days
after the Jenkins tests checked.

Best,
Takeshi


On Sun, Jan 13, 2019 at 9:42 AM Takeshi Yamamuro 
wrote:

> Hi, all
>
> re-sent as a new thread;
> ===
> I'm planning to start the release vote for v2.3.3 in the start of the next
> week.
> # I've already checked that all the tests passed in branch-2.3 and
> # there is no problem by the release scripts with dry-run.
>
> If there is any problem, please ping me.
>
> Best,
> Takeshi
>
> --
> ---
> Takeshi Yamamuro
>


-- 
---
Takeshi Yamamuro


How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
I know some implementations of model save/load in MLlib use an
explicit version 1.0, 2.0, 3.0 mechanism. I've also seen that some
just decide based on the version of Spark that wrote the model.

Is one or the other preferred?

See https://github.com/apache/spark/pull/23549#discussion_r248318392
for example. In cases like this, is it simpler still to just select
all the values written in the model and decide what to do based on the
presence or absence of columns? That seems a little more robust. It
wouldn't be so much an option if the contents or meaning of the
columns had changed.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: How to implement model versions in MLlib?

2019-01-16 Thread Ilya Matiach
Hi Sean and Jatin,
Could you point to some examples of load() methods that use the spark version 
vs the model version (or the columns available)?
I see only cases where we use the spark version (eg 
https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239)

Logically, I think it would be better to use a separate versioning mechanism 
for models than to use the spark version, IMHO it is more reliable that way.
Especially since we patch versions of spark by merging some fixes back 
sometimes, it seems less reliable to depend on a specific spark version in the 
code.
In addition, models don't change as frequently as the spark version, and having 
an explicit versioning mechanism makes it clearer how often the saved model 
structure has changed over time.  Having said that, I think you could implement 
either way without issues if the code is written carefully - but logically, if 
I had to choose, I would prefer having a separate versioning mechanism for 
models.
Thank you, Ilya



-Original Message-
From: Sean Owen  
Sent: Wednesday, January 16, 2019 10:12 AM
To: dev 
Cc: Jatin Puri 
Subject: How to implement model versions in MLlib?

I know some implementations of model save/load in MLlib use an explicit version 
1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based on the 
version of Spark that wrote the model.

Is one or the other preferred?

See 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&reserved=0
for example. In cases like this, is it simpler still to just select all the 
values written in the model and decide what to do based on the presence or 
absence of columns? That seems a little more robust. It wouldn't be so much an 
option if the contents or meaning of the columns had changed.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
I'm thinking of mechanisms like:
https://github.com/apache/spark/blob/c5daccb1dafca528ccb4be65d63c943bf9a7b0f2/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L99

On Wed, Jan 16, 2019 at 9:46 AM Ilya Matiach  wrote:
>
> Hi Sean and Jatin,
> Could you point to some examples of load() methods that use the spark version 
> vs the model version (or the columns available)?
> I see only cases where we use the spark version (eg 
> https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239)
>
> Logically, I think it would be better to use a separate versioning mechanism 
> for models than to use the spark version, IMHO it is more reliable that way.
> Especially since we patch versions of spark by merging some fixes back 
> sometimes, it seems less reliable to depend on a specific spark version in 
> the code.
> In addition, models don't change as frequently as the spark version, and 
> having an explicit versioning mechanism makes it clearer how often the saved 
> model structure has changed over time.  Having said that, I think you could 
> implement either way without issues if the code is written carefully - but 
> logically, if I had to choose, I would prefer having a separate versioning 
> mechanism for models.
> Thank you, Ilya
>
>
>
> -Original Message-
> From: Sean Owen 
> Sent: Wednesday, January 16, 2019 10:12 AM
> To: dev 
> Cc: Jatin Puri 
> Subject: How to implement model versions in MLlib?
>
> I know some implementations of model save/load in MLlib use an explicit 
> version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based 
> on the version of Spark that wrote the model.
>
> Is one or the other preferred?
>
> See 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&reserved=0
> for example. In cases like this, is it simpler still to just select all the 
> values written in the model and decide what to do based on the presence or 
> absence of columns? That seems a little more robust. It wouldn't be so much 
> an option if the contents or meaning of the columns had changed.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
+1 for what Marcelo and Hyukjin said.

In particular, I agree that we can't expect Hive to release a version that
is now more than 3 years old just to solve a problem for Spark. Maybe that
would have been a reasonable ask instead of publishing a fork years ago,
but I think this is now Spark's problem.

On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin  wrote:

> +1 to that. HIVE-16391 by itself means we're giving up things like
> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
> problem that we created.
>
> The current PR is basically a Spark-side fix for that bug. It does
> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
> it's really the right path to take here.
>
> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
> >
> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes
> of our Hive fork (correct me if I am mistaken).
> >
> > Just to be honest by myself and as a personal opinion, that basically
> says Hive to take care of Spark's dependency.
> > Hive looks going ahead for 3.1.x and no one would use the newer release
> of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for
> instance,
> >
> > Frankly, my impression was that it's, honestly, our mistake to fix.
> Since Spark community is big enough, I was thinking we should try to fix it
> by ourselves first.
> > I am not saying upgrading is the only way to get through this but I
> think we should at least try first, and see what's next.
> >
> > It does, yes, sound more risky to upgrade it in our side but I think
> it's worth to check and try it and see if it's possible.
> > I think this is a standard approach to upgrade the dependency than using
> the fork or letting Hive side to release another 1.2.x.
> >
> > If we fail to upgrade it for critical or inevitable reasons somehow,
> yes, we could find an alternative but that basically means
> > we're going to stay in 1.2.x for, at least, a long time (say .. until
> Spark 4.0.0?).
> >
> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >
>
>
> --
> Marcelo
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Xiao Li
Thanks for your feedbacks!

Working with Yuming to reduce the risk of stability and quality. Will keep
you posted when the proposal is ready.

Cheers,

Xiao

Ryan Blue  于2019年1月16日周三 上午9:27写道:

> +1 for what Marcelo and Hyukjin said.
>
> In particular, I agree that we can't expect Hive to release a version that
> is now more than 3 years old just to solve a problem for Spark. Maybe that
> would have been a reasonable ask instead of publishing a fork years ago,
> but I think this is now Spark's problem.
>
> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
> wrote:
>
>> +1 to that. HIVE-16391 by itself means we're giving up things like
>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>> problem that we created.
>>
>> The current PR is basically a Spark-side fix for that bug. It does
>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>> it's really the right path to take here.
>>
>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>> >
>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>> fixes of our Hive fork (correct me if I am mistaken).
>> >
>> > Just to be honest by myself and as a personal opinion, that basically
>> says Hive to take care of Spark's dependency.
>> > Hive looks going ahead for 3.1.x and no one would use the newer release
>> of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for
>> instance,
>> >
>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>> Since Spark community is big enough, I was thinking we should try to fix it
>> by ourselves first.
>> > I am not saying upgrading is the only way to get through this but I
>> think we should at least try first, and see what's next.
>> >
>> > It does, yes, sound more risky to upgrade it in our side but I think
>> it's worth to check and try it and see if it's possible.
>> > I think this is a standard approach to upgrade the dependency than
>> using the fork or letting Hive side to release another 1.2.x.
>> >
>> > If we fail to upgrade it for critical or inevitable reasons somehow,
>> yes, we could find an alternative but that basically means
>> > we're going to stay in 1.2.x for, at least, a long time (say .. until
>> Spark 4.0.0?).
>> >
>> > I know somehow it happened to be sensitive but to be just literally
>> honest to myself, I think we should make a try.
>> >
>>
>>
>> --
>> Marcelo
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


There is no way to force partition discovery if _spark_metadata exists

2019-01-16 Thread Dmitry
Hello,
I have two stage processing pipeline:
1. Spark streaming job receives data from kafka and saves  it to
partitioned orc
2. There is spark etl job that  runs ones per day that compact each
partition( i have two variables for partitioning
dt=20180529/location=mumbai ( merge small files  to bigger one). Argument
for compactor job is full path to partition, so compactor job can not
update metadata.
So next time I want to read this table as orc ( if i try to read it as a
hive table it works ), spark read metadata directory,  found a structure of
orc table ( partitions and files that are placed into these partitions)  ,
tries to read some  file and fails with file not found, because  compactor
job had already removed this file and merged it to another file. I see
three workarounds
1. Remove _spark_metadata manually
2. modify spark compactor job the way when it updates metadata
3. found a configuration property that turns on ignoring of spark metadata
1 and 2 are good, but  it can be that I do not have access rights
So does the 3 chose exist ( I checked this
https://github.com/apache/spark/blob/56e9e97073cf1896e301371b3941c9307e42ff77/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L199
and could not find any property) ? If its not  I think it should be added
in some way to spark. May be it should not be a global property but a
property for query.