There is no way to force partition discovery if _spark_metadata exists

2019-01-16 Thread Dmitry
Hello, I have two stage processing pipeline: 1. Spark streaming job receives data from kafka and saves it to partitioned orc 2. There is spark etl job that runs ones per day that compact each partition( i have two variables for partitioning dt=20180529/location=mumbai ( merge small files to bigg

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Xiao Li
Thanks for your feedbacks! Working with Yuming to reduce the risk of stability and quality. Will keep you posted when the proposal is ready. Cheers, Xiao Ryan Blue 于2019年1月16日周三 上午9:27写道: > +1 for what Marcelo and Hyukjin said. > > In particular, I agree that we can't expect Hive to release a

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
+1 for what Marcelo and Hyukjin said. In particular, I agree that we can't expect Hive to release a version that is now more than 3 years old just to solve a problem for Spark. Maybe that would have been a reasonable ask instead of publishing a fork years ago, but I think this is now Spark's probl

Re: How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
I'm thinking of mechanisms like: https://github.com/apache/spark/blob/c5daccb1dafca528ccb4be65d63c943bf9a7b0f2/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L99 On Wed, Jan 16, 2019 at 9:46 AM Ilya Matiach wrote: > > Hi Sean and Jatin, > Could you point to some examples of load()

RE: How to implement model versions in MLlib?

2019-01-16 Thread Ilya Matiach
Hi Sean and Jatin, Could you point to some examples of load() methods that use the spark version vs the model version (or the columns available)? I see only cases where we use the spark version (eg https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/

How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
I know some implementations of model save/load in MLlib use an explicit version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based on the version of Spark that wrote the model. Is one or the other preferred? See https://github.com/apache/spark/pull/23549#discussion_r248318392 for

Re: Apache Spark 2.3.3

2019-01-16 Thread Takeshi Yamamuro
Hi, all I took some time to check the recent Jenkins test failures in branch-2.3 (See https://github.com/apache/spark/pull/23507 for detailed). I'm re-publishing a candidate now, so I think I'll start a first vote for v2.3.3-rc1 in a few days after the Jenkins tests checked. Best, Takeshi On S