Re: Apache Spark 2.3.3
Hi, all I took some time to check the recent Jenkins test failures in branch-2.3 (See https://github.com/apache/spark/pull/23507 for detailed). I'm re-publishing a candidate now, so I think I'll start a first vote for v2.3.3-rc1 in a few days after the Jenkins tests checked. Best, Takeshi On Sun, Jan 13, 2019 at 9:42 AM Takeshi Yamamuro wrote: > Hi, all > > re-sent as a new thread; > === > I'm planning to start the release vote for v2.3.3 in the start of the next > week. > # I've already checked that all the tests passed in branch-2.3 and > # there is no problem by the release scripts with dry-run. > > If there is any problem, please ping me. > > Best, > Takeshi > > -- > --- > Takeshi Yamamuro > -- --- Takeshi Yamamuro
How to implement model versions in MLlib?
I know some implementations of model save/load in MLlib use an explicit version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based on the version of Spark that wrote the model. Is one or the other preferred? See https://github.com/apache/spark/pull/23549#discussion_r248318392 for example. In cases like this, is it simpler still to just select all the values written in the model and decide what to do based on the presence or absence of columns? That seems a little more robust. It wouldn't be so much an option if the contents or meaning of the columns had changed. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
RE: How to implement model versions in MLlib?
Hi Sean and Jatin, Could you point to some examples of load() methods that use the spark version vs the model version (or the columns available)? I see only cases where we use the spark version (eg https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239) Logically, I think it would be better to use a separate versioning mechanism for models than to use the spark version, IMHO it is more reliable that way. Especially since we patch versions of spark by merging some fixes back sometimes, it seems less reliable to depend on a specific spark version in the code. In addition, models don't change as frequently as the spark version, and having an explicit versioning mechanism makes it clearer how often the saved model structure has changed over time. Having said that, I think you could implement either way without issues if the code is written carefully - but logically, if I had to choose, I would prefer having a separate versioning mechanism for models. Thank you, Ilya -Original Message- From: Sean Owen Sent: Wednesday, January 16, 2019 10:12 AM To: dev Cc: Jatin Puri Subject: How to implement model versions in MLlib? I know some implementations of model save/load in MLlib use an explicit version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based on the version of Spark that wrote the model. Is one or the other preferred? See https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&reserved=0 for example. In cases like this, is it simpler still to just select all the values written in the model and decide what to do based on the presence or absence of columns? That seems a little more robust. It wouldn't be so much an option if the contents or meaning of the columns had changed. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: How to implement model versions in MLlib?
I'm thinking of mechanisms like: https://github.com/apache/spark/blob/c5daccb1dafca528ccb4be65d63c943bf9a7b0f2/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L99 On Wed, Jan 16, 2019 at 9:46 AM Ilya Matiach wrote: > > Hi Sean and Jatin, > Could you point to some examples of load() methods that use the spark version > vs the model version (or the columns available)? > I see only cases where we use the spark version (eg > https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239) > > Logically, I think it would be better to use a separate versioning mechanism > for models than to use the spark version, IMHO it is more reliable that way. > Especially since we patch versions of spark by merging some fixes back > sometimes, it seems less reliable to depend on a specific spark version in > the code. > In addition, models don't change as frequently as the spark version, and > having an explicit versioning mechanism makes it clearer how often the saved > model structure has changed over time. Having said that, I think you could > implement either way without issues if the code is written carefully - but > logically, if I had to choose, I would prefer having a separate versioning > mechanism for models. > Thank you, Ilya > > > > -Original Message- > From: Sean Owen > Sent: Wednesday, January 16, 2019 10:12 AM > To: dev > Cc: Jatin Puri > Subject: How to implement model versions in MLlib? > > I know some implementations of model save/load in MLlib use an explicit > version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based > on the version of Spark that wrote the model. > > Is one or the other preferred? > > See > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&reserved=0 > for example. In cases like this, is it simpler still to just select all the > values written in the model and decide what to do based on the presence or > absence of columns? That seems a little more robust. It wouldn't be so much > an option if the contents or meaning of the columns had changed. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
+1 for what Marcelo and Hyukjin said. In particular, I agree that we can't expect Hive to release a version that is now more than 3 years old just to solve a problem for Spark. Maybe that would have been a reasonable ask instead of publishing a fork years ago, but I think this is now Spark's problem. On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin wrote: > +1 to that. HIVE-16391 by itself means we're giving up things like > Hadoop 3, and we're also putting the burden on the Hive folks to fix a > problem that we created. > > The current PR is basically a Spark-side fix for that bug. It does > mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think > it's really the right path to take here. > > On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon wrote: > > > > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes > of our Hive fork (correct me if I am mistaken). > > > > Just to be honest by myself and as a personal opinion, that basically > says Hive to take care of Spark's dependency. > > Hive looks going ahead for 3.1.x and no one would use the newer release > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for > instance, > > > > Frankly, my impression was that it's, honestly, our mistake to fix. > Since Spark community is big enough, I was thinking we should try to fix it > by ourselves first. > > I am not saying upgrading is the only way to get through this but I > think we should at least try first, and see what's next. > > > > It does, yes, sound more risky to upgrade it in our side but I think > it's worth to check and try it and see if it's possible. > > I think this is a standard approach to upgrade the dependency than using > the fork or letting Hive side to release another 1.2.x. > > > > If we fail to upgrade it for critical or inevitable reasons somehow, > yes, we could find an alternative but that basically means > > we're going to stay in 1.2.x for, at least, a long time (say .. until > Spark 4.0.0?). > > > > I know somehow it happened to be sensitive but to be just literally > honest to myself, I think we should make a try. > > > > > -- > Marcelo > -- Ryan Blue Software Engineer Netflix
Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
Thanks for your feedbacks! Working with Yuming to reduce the risk of stability and quality. Will keep you posted when the proposal is ready. Cheers, Xiao Ryan Blue 于2019年1月16日周三 上午9:27写道: > +1 for what Marcelo and Hyukjin said. > > In particular, I agree that we can't expect Hive to release a version that > is now more than 3 years old just to solve a problem for Spark. Maybe that > would have been a reasonable ask instead of publishing a fork years ago, > but I think this is now Spark's problem. > > On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin > wrote: > >> +1 to that. HIVE-16391 by itself means we're giving up things like >> Hadoop 3, and we're also putting the burden on the Hive folks to fix a >> problem that we created. >> >> The current PR is basically a Spark-side fix for that bug. It does >> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think >> it's really the right path to take here. >> >> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon wrote: >> > >> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the >> fixes of our Hive fork (correct me if I am mistaken). >> > >> > Just to be honest by myself and as a personal opinion, that basically >> says Hive to take care of Spark's dependency. >> > Hive looks going ahead for 3.1.x and no one would use the newer release >> of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for >> instance, >> > >> > Frankly, my impression was that it's, honestly, our mistake to fix. >> Since Spark community is big enough, I was thinking we should try to fix it >> by ourselves first. >> > I am not saying upgrading is the only way to get through this but I >> think we should at least try first, and see what's next. >> > >> > It does, yes, sound more risky to upgrade it in our side but I think >> it's worth to check and try it and see if it's possible. >> > I think this is a standard approach to upgrade the dependency than >> using the fork or letting Hive side to release another 1.2.x. >> > >> > If we fail to upgrade it for critical or inevitable reasons somehow, >> yes, we could find an alternative but that basically means >> > we're going to stay in 1.2.x for, at least, a long time (say .. until >> Spark 4.0.0?). >> > >> > I know somehow it happened to be sensitive but to be just literally >> honest to myself, I think we should make a try. >> > >> >> >> -- >> Marcelo >> > > > -- > Ryan Blue > Software Engineer > Netflix >
There is no way to force partition discovery if _spark_metadata exists
Hello, I have two stage processing pipeline: 1. Spark streaming job receives data from kafka and saves it to partitioned orc 2. There is spark etl job that runs ones per day that compact each partition( i have two variables for partitioning dt=20180529/location=mumbai ( merge small files to bigger one). Argument for compactor job is full path to partition, so compactor job can not update metadata. So next time I want to read this table as orc ( if i try to read it as a hive table it works ), spark read metadata directory, found a structure of orc table ( partitions and files that are placed into these partitions) , tries to read some file and fails with file not found, because compactor job had already removed this file and merged it to another file. I see three workarounds 1. Remove _spark_metadata manually 2. modify spark compactor job the way when it updates metadata 3. found a configuration property that turns on ignoring of spark metadata 1 and 2 are good, but it can be that I do not have access rights So does the 3 chose exist ( I checked this https://github.com/apache/spark/blob/56e9e97073cf1896e301371b3941c9307e42ff77/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L199 and could not find any property) ? If its not I think it should be added in some way to spark. May be it should not be a global property but a property for query.