> I think Sijie is referring to using KubernetesRuntime to deploy functions > where each function/source/sink runs as an independent statefulset in K8s. > In this scenario, it is possible to have fine grained control over which > version of the function container the function is using.
Not everybody is using the KubernetesRuntime yet (especially since the Helm charts aren't feature-complete), and it appears that those who aren't running KubernetesRuntime would be impacted the most by this issue. Devin G. Bost On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <devin.b...@gmail.com> wrote: > > For example, if you are upgrading Flink from one version to the other > > version, you have to make a save point in the previous version for all > > the Flink jobs. > > Upgrade the Flink cluster and resume jobs in a new version. > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/ > > > > So it is not unreasonable for asking people to do that when dealing > > with upgrading a centralized computing engine. > > One difference with Flink is that organizations running Flink in job mode > or application mode can upgrade jobs independently of one another, so teams > can upgrade jobs when they are ready without impacting other teams. In the > Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would > break every tenant simultaneously and would block the flow of all messages > until all functions are upgraded. If one team takes a year to upgrade their > one function, the cluster could not be upgraded until that happened. Also, > after all the functions have been upgraded, there would be production > downtime while deploying all the upgraded functions, which would be a major > outage... It might be possible to write a script to speed up the deployment > to shrink the outage window, but there's currently a bug that wipes out > existing userConfigs when a function is upgraded, so that adds to the > complexity of upgrading all the functions since someone would need to know > all the userConfigs for all the functions. > > So, I don't think we're really comparing the same things here. > > Devin G. Bost > > > On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <guosi...@gmail.com> wrote: > >> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <jerry.boyang.p...@gmail.com> >> wrote: >> > >> > I agree that the best we can do right now is to just clearly document >> this >> > as a potential problem when updating 2.7 to 2.8. >> > >> > We should definitely make every attempt to not make BC breaking changes. >> > However, there are times when we have to make these tough decisions for >> one >> > reason or another. The bigger problem I see here is not necessarily a BC >> > breaking change occurred, but rather we didn't know about it beforehand >> so >> > we can clearly document this caveat when 2.8 is released. Perhaps this >> is >> > where we can improve our backwards compatibility testing. We already >> have >> > some but probably not enough as highlighted by this case. >> > >> > In regards to >> > >> > This is partially correct, because you can wait to upgrade the workers >> pod, >> > > but there is no fine grained control over which version of each pod >> will >> > > be running your function, especially in a big cluster with many >> tenants and >> > > functions with this problem >> > > >> > >> > >> > I think Sijie is referring to using KubernetesRuntime to deploy >> functions >> > where each function/source/sink runs as an independent statefulset in >> K8s. >> > In this scenario, it is possible to have fine grained control over which >> > version of the function container the function is using. There >> currently >> > might not be tools to easily allow users to do this but using kubectl >> one >> > can definitely determine which container version is running and >> potentially >> > update the container version on a per function basis. >> >> Jerry - Thank you! That was what I meant. >> >> > >> > Best, >> > >> > Jerry >> > >> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <eolive...@gmail.com> >> > wrote: >> > >> > > Sijie, >> > > Thank you for your feedback >> > > Some additional considerations inline >> > > >> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <guosi...@gmail.com> ha scritto: >> > > >> > > > I don't think this is a big problem. Because people can recompile >> the >> > > > function and submit the function. Most of the computing/streaming >> > > > engines ask users to recompile the jobs and resubmit the jobs when >> it >> > > > upgrades to a new version. >> > > >> > > >> > > Unfortunately this is not easily feasible if the org that is managing >> the >> > > Pulsar service is different from the org who is developing the >> Functions. >> > > And especially it is quite impossible to prevent service interruption. >> > > >> > > BTW I believe that there is no way to fix this at this point. >> > > >> > > The best approach here is to document this >> > > > behavior. >> > > > >> > > >> > > I agree that the best thing we can do is to document this requirement. >> > > >> > > Therefore we must ensure in the future that we won't fall again into >> this >> > > kind of issues. >> > > >> > > Pulsar is becoming more and more used by large enterprises and >> backward >> > > compatibility is a big value. >> > > >> > > Fortunately not all the Functions need rebuilding. >> > > >> > > >> > > >> > > >> > > > Also, if you are using Kubernetes runtime to schedule functions, you >> > > > are not really impacted. >> > > > >> > > >> > > This is partially correct, because you can wait to upgrade the >> workers pod, >> > > but there is no fine grained control over which version of each pod >> will >> > > be running your function, especially in a big cluster with many >> tenants and >> > > functions with this problem >> > > >> > > >> > > Enrico >> > > >> > > >> > > > - Sijie >> > > > >> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli < >> eolive...@gmail.com> >> > > > wrote: >> > > > > >> > > > > Hello, >> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7 to >> 2.8. >> > > > > More information is on the ticket, but the short version of the >> story >> > > is >> > > > > that >> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API, >> by >> > > > > switching SchemaInfo from a class to an interface. >> > > > > >> > > > > This leads to an IncompatibleClassChangeError when you have a >> Function >> > > > or >> > > > > a Connector that is using Schema.JSON(Pojo.class) and you upgrade >> your >> > > > > Pulsar cluster (the functions worker pod for instance) from Pulsar >> > > 2.7.x >> > > > to >> > > > > Pulsar 2.8.0. >> > > > > >> > > > > The bad problem is that you cannot upgrade Pulsar without >> interrupting >> > > > the >> > > > > service and coordinating with the upgrade of the Functions. >> > > > > Your functions need to be recompiled against the Pulsar 2.8 API >> and >> > > > > deployed again in production. >> > > > > >> > > > > I have tried to move back SchemaInfo to an "abstract class" but >> without >> > > > > success, because then you fall into errors. >> > > > > >> > > > > I am not sure there is a way to provide a good "upgrade path" for >> > > > > Functions/IO users. >> > > > > >> > > > > If we do not find a way we have to document the upgrade in the >> official >> > > > > Pulsar Documentation. >> > > > > >> > > > > We must do our best to prevent users from falling again into this >> bad >> > > > > situation. >> > > > > >> > > > > Any suggestions or thoughts ? >> > > > > >> > > > > Regards >> > > > > Enrico >> > > > > >> > > > > [1] https://github.com/apache/pulsar/issues/11338 >> > > > >> > > >> >