Re: Start releasing the master branch

Stamatis Zampetakis Wed, 09 Mar 2022 08:01:36 -0800

I just logged HIVE-26022 [1] which seems to be another potential blocker
for 4.0.0-alpha-1.


Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-26022

On Thu, Mar 3, 2022 at 3:54 PM Peter Vary <pv...@cloudera.com> wrote:

> Hi Team,
>
> Here is our status:
> We collected the blocker tickets and marked them with fixVersion
> 4.0.0-alpha-1:
>
> https://issues.apache.org/jira/issues/?filter=-1&jql=project%20%3D%20HIVE%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%204.0.0-alpha-1
> <https://issues.apache.org/jira/issues/?filter=-1&jql=project%20=%20HIVE%20AND%20resolution%20=%20Unresolved%20AND%20fixVersion%20=%204.0.0-alpha-1>
>
>    - HIVE-26002 - Create db scripts for 4.0.0-alpha-1
>    - HIVE-25994 - Analyze table runs into ClassNotFoundException-s in
>    case binary distribution is used
>    - HIVE-25935 - Cleanup IMetaStoreClient#getPartitionsByNames APIs
>
> Please create a jira and mark it with fixVersion 4.0.0-alpha-1, if you
> happen to know of any other blockers.
>
> We plan to fix these jiras, and then release the following artifacts
> together:
>
>    - Storage API - 4.0.0-alpha-1
>    - Standalone Metastore - 4.0.0-alpha-1
>    - Hive - 4.0.0-alpha-1
>
>
> Thanks,
> Peter
>
>
> On 2022. Mar 2., at 11:50, Peter Vary <pv...@cloudera.com> wrote:
>
> Will continue this discussion on the #hive ASF slack. If you are
> interested, please join.
> We will do updates here time-to-time, so the ones who are not using slack
> can participate that way.
>
> On 2022. Mar 2., at 11:11, Peter Vary <pv...@cloudera.com> wrote:
>
> Good idea Zoltan, joined the channel.
> I would like to scope reasonably small, so I agree with focusing on
> 4.0.0-alpha-1
>
> On 2022. Mar 2., at 11:01, Zoltan Haindrich <k...@rxd.hu> wrote:
>
> Hey,
>
> regarding 4.0.0 / 4.0.0-alpha-1 target/fix versions in the jira:
> * I think we should change all already resolved tickets with fix version
> 4.0.0 to have fix version 4.0.0-alpha-1
> ** this could be postponed until we are actually releasing the thing as I
> think everyone committing to the master is entering 4.0.0 as fix version
> without much aftertought...this could probably change after we get the
> first release out.
> * regarding the the existing tickets with fix version/target version 4.0.0
> - I think that would be a bit too much (>200 tickets)
> ** some numbers:
> *** 239 tickets open now
> *** 224 was not updated in the last 90 days
> *** 216 was not changed in the last 180 days
> *** 178 was not updated in the last 360 days
> ** as a matter of fact I think many of these tickets shouldn't even have a
> target or fix version - and most of them should be unassigned...I don't
> want to get lost in this right now...I think for now we should keep the
> scope small and only care with 4.0.0-alpha-1 tickets
>
> https://issues.apache.org/jira/issues/?
> jql=project%20%3D%20hive%20and%20resolutiondate%20%20is%20empty%20and%20(fixVersion%20%20in%20(%274.0.0%27)%20or%20cf%5B12310320%5D%20%20in%20(%274.0.0%27))
>
> I think for faster communication regarding these things we could also
> utilize the #hive channel on the ASF slack - what do you guys think?
>
> cheers,
> Zoltan
>
> On 3/2/22 9:51 AM, Stamatis Zampetakis wrote:
>
> Agree with Peter, creating JIRAs is the way to go.
> Putting the appropriate priority (e.g., BLOCKER) and version (4.0.0 or
> 4.0.0-alpha-1) when creating the JIRA should be enough to keep us on track.
> I am mentioning both 4.0.0 and 4.0.0-alpha-1 because eventually I think we
> are gonna move everything with target 4.0.0 to 4.0.0-alpha-1.
> On Wed, Mar 2, 2022 at 9:37 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
> Hi Team,
>
> Could we create tickets for the issues?
> I think it would be good to collect the issues/potential blockers in the
> jira instead of having a complicated mail thread.
>
> If we set the target version to 4.0.0-alpha-1, then we can easily use the
> following filter to see the status of the tasks:
>
>
> https://issues.apache.org/jira/issues/?jql=project%3D%22HIVE%22%20AND%20%22Target%20Version%2Fs%22%3D%224.0.0-alpha-1%22
> <
>
> https://issues.apache.org/jira/issues/?jql=project=%22HIVE%22%20AND%20%22Target%20Version/s%22=%224.0.0-alpha-1%22
>
>
>
> @Stamatis: Sadly I have missed your letter/jira and created my own with
> the fix for building from the src package:
> https://issues.apache.org/jira/browse/HIVE-25997 <
> https://issues.apache.org/jira/browse/HIVE-25997>
> If you have time, I would like to ask you to review.
>
> If anyone knows of any blocker I would like to ask them to create a jira
> for that too.
>
> Thanks,
> Peter
>
>
> On 2022. Mar 2., at 7:04, Sungwoo Park <c...@pl.postech.ac.kr> wrote:
>
> Hello Alessandro,
>
> For the latest commit, loading ORC tables fails (with the log message
>
> shown below). Let me try to find a commit that introduces this bug and
> create a JIRA ticket.
>
>
> --- Sungwoo
>
> 2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run
>
> stats task
>
> java.io.IOException: org.apache.hadoop.mapred.InvalidInputException:
>
> Input path does not exist:
>
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
>
> at
>
>
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
>
> at
>
>
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
>
> at
>
>
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
>
> at
>
>
> org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)
>
> at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
> at
>
>
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>
> at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
>
> does not exist:
>
> hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-10000/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
>
> at
>
>
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
>
> at
>
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
>
> at
>
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
>
> at
>
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>
> at
>
>
> org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
>
> at
>
>
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
>
> at
>
>
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
>
> at
>
>
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
>
> ... 7 more
>
> On Tue, 1 Mar 2022, Alessandro Solimando wrote:
>
> Hi Sungwoo,
> last time I tried to run TPCDS-based benchmark I stumbled upon a similar
> situation, finally I found that statistics were not computed, so CBO was
> not kicking in, and the automatic retry goes with CBO off which was
>
> failing
>
> for something like 10 queries (subqueries cannot be decorrelated, but
>
> also
>
> some runtime errors).
>
> Making sure that (column) statistics were correctly computed fixed the
> problem.
>
> Can you check if this is the case for you?
>
> HTH,
> Alessandro
>
> On Tue, 1 Mar 2022 at 15:28, POSTECH CT <c...@pl.postech.ac.kr> wrote:
>
> Hello Hive team,
>
> I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
> the master branch recently.  We occasionally run TPC-DS system tests
> using the master branch, and the tests don't succeed completely. Here
> is how our TPC-DS tests proceed.
>
> 1. Compile and run Hive on Tez (not Hive-LLAP)
> 2. Load ORC tables from 1TB TPC-DS raw text data, and compute
>
> statistics
>
> 3. Run 99 TPC-DS queries which were slightly modified to return
> varying number of rows (rather than 100 rows)
> 4. Compare the results against the previous results
>
> The previous results were obtained and cross-checked by running Hive
> 3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
> correctness.
>
> For the latest commit in the master branch, step 2 fails. For earlier
> commits (for example, commits in February 2021), step 3 fails where
> several queries either fail or return wrong results.
>
> We can compile and report the test results in this mailing list, but
> would like to know if similar results have been reproduced by the Hive
> team, in order to make sure that we did not make errors in our tests.
>
> If it is okay to open a JIRA ticket that only reports failures in the
> TPC-DS test, we could also perform git bi-sect to locate the commit
> that begin to generate wrong results.
>
> --- Sungwoo Park
>
> On Tue, 1 Mar 2022, Zoltan Haindrich wrote:
>
> Hey,
>
> Great to hear that we are on the same side regarding these things :)
>
> For around a week now - we have nightly builds for the master branch:
> http://ci.hive.apache.org/job/hive-nightly/12/
>
> I think we have 1 blocker issue:
> https://issues.apache.org/jira/browse/HIVE-25665
>
> I know about one more thing I would rather get fixed before we release
>
> it:
>
> https://issues.apache.org/jira/browse/HIVE-25994
> The best would be to introduce smoke tests (HIVE-22302) to ensure that
> something like this will not happen in the future - but we should
>
> probably
>
> start moving forward.
>
> I think we could call the first iteration of this as "4.0.0-alpha-1"
>
> :)
>
>
> I've added 4.0.0-alpha-1 as a version - and added the above two ticket
>
> to it.
>
>
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%204.0.0-alpha-1
>
>
> Are there any more things you guys know which would be needed?
>
> cheers,
> Zoltan
>
>
> On 2/22/22 12:18 PM, Peter Vary wrote:
>
> I would vote for 4.0.0-alpha-1 or similar for all of the components.
>
> When we have more stable releases I would keep the 4.x.x schema,
>
> since
>
> everyone is familiar with it, and I do not see a really good reason
>
> to
>
> change it.
>
> Thanks,
> Peter
>
>
> On 2022. Feb 10., at 3:34, Szehon Ho <szehon.apa...@gmail.com>
>
> wrote:
>
>
> +1 that would be awesome to see Hive master released after so long.
>
> Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would
>
> pick
>
> any 3.x or calendar date (which could tend to slip and be more
> confusing?).
>
> Thanks in any case to get the ball rolling.
> Szehon
>
> On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich <k...@rxd.hu>
>
> wrote:
>
>
> Hey,
>
> Thank you guys for chiming in; versioning is for sure something we
>
> should
>
> get to some common ground.
> Its a triple problem right now; I think we have the following
>
> things:
>
> * storage-api
> ** we have "2.7.3-SNAPSHOT" in the repo
> ***
>
>
>
> https://github.com/apache/hive/blob/0d1cffffc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
>
> ** meanwhile we already have 2.8.1 released to maven central
> ***
>
> https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
>
> * standalone-metastore
> ** 4.0.0-SNAPSHOT in the repo
> ** last release is 3.1.2
> * hive
> ** 4.0.0-SNAPSHOT in the repo
> ** last release is 3.1.2
>
> Regarding the actual version number I'm not entirely sure where we
>
> should
>
> start the numbering - that's why I was referring to it as Hive-X
>
> in my
>
> first letter.
>
> I think the key point here would be to start shipping releases
>
> regularily
>
> and not the actual version number we will use - I'll kinda open to
>
> any
>
> versioning scheme which
> reflects that this is a newer release than 3.1.2.
>
> I could imagine the following ones:
> (A) start with something less expected; but keep 3 in the prefix to
> reflect that this is not yet 4.0
>   I can imagine the following numbers:
>   3.900.0, 3.901.0, ...
>   3.9.0, 3.9.1, ...
> (B) start 4.0.0
>   4.0.0, 4.1.0, ...
> (C) jump to some calendar based version number like 2022.2.9
>   trunk based development has pros and cons...making a move like
>
> this
>
> irreversibly pledges trunk based development; and makes release
>
> branches
>
> hard to introduce
> (X) somewhat orthogonal is to (also) use some suffixes
>   4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
>   this is probably the most tempting to use - but this versioning
> schema with a non-changing MINOR and PATCH number will
>   also suggest that the actual software is fully compatible - and
>
> only
>
> bugs are being fixed - which will not be true...
>
> I really like the idea to suffix these releases with alpha or beta
>
> -
>
> which
> will communicate our level commitment that these are not 100%
>
> production
>
> ready artifacts.
>
> I think we could fix HIVE-25665; and probably experiment with
> 4.0.0-alpha1
> for start...
>
> This also means there should *not* be a branch-4 after releasing
>
> Hive
>
> 4.0
>
> and let that diverge (and becomes the next, super-ignored
>
> branch-3),
>
> correct; no need to keep a branch we don't maintain...but in any
>
> case
>
> I
>
> think we can postpone this decision until there will be something
>
> to
>
> release... :)
>
> cheers,
> Zoltan
>
>
>
> On 2/9/22 10:23 AM, L?szl? Bodor wrote:
>
> Hi All!
>
> A purely technical question: what will the SNAPSHOT version become
>
> after
>
> releasing Hive 4.0.0? I think this is important, as it defines and
>
> reflects
>
> the future release plans.
>
> Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 +
>
> branch-3.
>
> Hive is an evolving and super-active project: if we want to make
>
> regular
>
> releases, we should simply release Hive 4.0 and bump pom to
>
> 4.1.0-SNAPSHOT,
>
> which clearly says that we can release Hive 4.1 anytime we want,
>
> without
>
> being frustrated about "whether we included enough cool stuff to
>
> release
>
> 5.0".
>
> This also means there should *not* be a branch-4 after releasing
>
> Hive
>
> 4.0
> and let that diverge (and becomes the next, super-ignored
>
> branch-3),
>
> only
> when we end up bringing a minor backward-incompatible thing that
>
> needs a
>
> 4.0.x, and when it happens, we'll create *branch-4.0 *on demand.
>
> For
>
> me,
>
> a
>
> branch called *branch-4.0* doesn't imply either I can expect cool
>
> releases
>
> in the future from there or the branch is maintained and tries to
>
> be
>
> in
>
> sync with the *master*.
>
> Regards,
> Laszlo Bodor
>
> Alessandro Solimando <alessandro.solima...@gmail.com> ezt ?rta
>
> (id?pont:
>
> 2022. febr. 8., K, 16:42):
>
> Hello everyone,
> thank you for starting this discussion.
>
> I agree that releasing the master branch regularly and
>
> sufficiently
>
> often
>
> is welcome and vital for the health of the community.
>
> It would be great to hear from others too, especially PMC members
>
> and
>
> committers, but even simple contributors/followers as myself.
>
> Best regards,
> Alessandro
>
> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis <
>
> zabe...@gmail.com
>
>
> wrote:
>
> Hello,
>
> Thanks for starting the discussion Zoltan.
>
> I strongly believe that it is important to have regular and
>
> often
>
> releases
>
> otherwise people will create and maintain separate Hive forks.
> The latter is not good for the project and the community may
>
> lose
>
> valuable
>
> members because of it.
>
> Going forward I fully agree that there is no point bringing up
>
> strong
>
> blockers for the next release. For sure there are many backward
> incompatible changes and possibly unstable features but unless
>
> we
>
> get
>
> a
> release out it will be difficult to determine what is broken and
>
> what
>
> needs
>
> to be fixed.
>
> Due to the big number of changes that are going to appear in the
>
> next
>
> version I would suggest using the terms Hive X-alpha, Hive
>
> X-beta
>
> for
>
> the
>
> first few releases. This will make it clear to the end users
>
> that
>
> they
>
> need
>
> to be careful when upgrading from an older version and it will
>
> give us
>
> a
>
> bit more time and freedom to treat issues that the users will
>
> likely
>
> discover.
>
> The only real blocker that we may want to treat is HIVE-25665
>
> [1]
>
> but
>
> we
>
> can continue the discussion under that ticket and re-evaluate if
>
> necessary,
>
>
> Best,
> Stamatis
>
> [1] https://issues.apache.org/jira/browse/HIVE-25665
>
>
> On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich <k...@rxd.hu>
>
> wrote:
>
>
> Hey All,
>
> We didn't made a release for a long time now; (3.1.2 was
>
> released
>
> on
>
> 26
>
> August 2019) - and I think because we didn't made that many
>
> branch-3
>
> releases; not too many fixes
> were ported there - which made that release branch kinda erode
>
> away.
>
>
> We have a lot of new features/changes in the current master.
> I think instead of aiming for big feature-packed releases we
>
> should
>
> aim
>
> for making a regular release every few months - we should make
> regular
> releases which people could
> install and use.
> After all releasing Hive after more than 2 years would be big
>
> step
>
> forward
>
> in itself alone - we have so many improvements that I can't
>
> even
>
> count...
>
>
> But I may know not every aspects of the project / states of
>
> some
>
> internal
>
> features - so I would like to ask you:
> What would be the bare minimum requirements before we could
>
> release
>
> the
>
> current master as Hive X?
>
> There are many nice-to-have-s like:
> * hadoop upgrade
> * jdk11
> * remove HoS or MR
> * ?
> but I don't think these are blockers...we can make any of these
>
> in
>
> the
> next release if we start making them...
>
> cheers,
> Zoltan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Start releasing the master branch

Reply via email to