> This leads to an IncompatibleClassChangeError when you have a Function or > a Connector that is using Schema.JSON(Pojo.class)
I just noticed this detail. Do we have a sense of how often people are using Schema.JSON in Functions/Connectors? Most of our functions are using a string schema, so it's not clear to me if they would be impacted. Devin G. Bost On Mon, Jul 19, 2021 at 12:41 PM Devin Bost <devin.b...@gmail.com> wrote: > > I think Sijie is referring to using KubernetesRuntime to deploy functions > > where each function/source/sink runs as an independent statefulset in > K8s. > > In this scenario, it is possible to have fine grained control over which > > version of the function container the function is using. > > Not everybody is using the KubernetesRuntime yet (especially since the > Helm charts aren't feature-complete), and it appears that those who aren't > running KubernetesRuntime would be impacted the most by this issue. > > Devin G. Bost > > > On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <devin.b...@gmail.com> wrote: > >> > For example, if you are upgrading Flink from one version to the other >> > version, you have to make a save point in the previous version for all >> > the Flink jobs. >> > Upgrade the Flink cluster and resume jobs in a new version. >> > >> > >> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/ >> > >> > So it is not unreasonable for asking people to do that when dealing >> > with upgrading a centralized computing engine. >> >> One difference with Flink is that organizations running Flink in job mode >> or application mode can upgrade jobs independently of one another, so teams >> can upgrade jobs when they are ready without impacting other teams. In the >> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would >> break every tenant simultaneously and would block the flow of all messages >> until all functions are upgraded. If one team takes a year to upgrade their >> one function, the cluster could not be upgraded until that happened. Also, >> after all the functions have been upgraded, there would be production >> downtime while deploying all the upgraded functions, which would be a major >> outage... It might be possible to write a script to speed up the deployment >> to shrink the outage window, but there's currently a bug that wipes out >> existing userConfigs when a function is upgraded, so that adds to the >> complexity of upgrading all the functions since someone would need to know >> all the userConfigs for all the functions. >> >> So, I don't think we're really comparing the same things here. >> >> Devin G. Bost >> >> >> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <guosi...@gmail.com> wrote: >> >>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <jerry.boyang.p...@gmail.com> >>> wrote: >>> > >>> > I agree that the best we can do right now is to just clearly document >>> this >>> > as a potential problem when updating 2.7 to 2.8. >>> > >>> > We should definitely make every attempt to not make BC breaking >>> changes. >>> > However, there are times when we have to make these tough decisions >>> for one >>> > reason or another. The bigger problem I see here is not necessarily a >>> BC >>> > breaking change occurred, but rather we didn't know about it >>> beforehand so >>> > we can clearly document this caveat when 2.8 is released. Perhaps >>> this is >>> > where we can improve our backwards compatibility testing. We already >>> have >>> > some but probably not enough as highlighted by this case. >>> > >>> > In regards to >>> > >>> > This is partially correct, because you can wait to upgrade the workers >>> pod, >>> > > but there is no fine grained control over which version of each pod >>> will >>> > > be running your function, especially in a big cluster with many >>> tenants and >>> > > functions with this problem >>> > > >>> > >>> > >>> > I think Sijie is referring to using KubernetesRuntime to deploy >>> functions >>> > where each function/source/sink runs as an independent statefulset in >>> K8s. >>> > In this scenario, it is possible to have fine grained control over >>> which >>> > version of the function container the function is using. There >>> currently >>> > might not be tools to easily allow users to do this but using kubectl >>> one >>> > can definitely determine which container version is running and >>> potentially >>> > update the container version on a per function basis. >>> >>> Jerry - Thank you! That was what I meant. >>> >>> > >>> > Best, >>> > >>> > Jerry >>> > >>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <eolive...@gmail.com> >>> > wrote: >>> > >>> > > Sijie, >>> > > Thank you for your feedback >>> > > Some additional considerations inline >>> > > >>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <guosi...@gmail.com> ha scritto: >>> > > >>> > > > I don't think this is a big problem. Because people can recompile >>> the >>> > > > function and submit the function. Most of the computing/streaming >>> > > > engines ask users to recompile the jobs and resubmit the jobs when >>> it >>> > > > upgrades to a new version. >>> > > >>> > > >>> > > Unfortunately this is not easily feasible if the org that is >>> managing the >>> > > Pulsar service is different from the org who is developing the >>> Functions. >>> > > And especially it is quite impossible to prevent service >>> interruption. >>> > > >>> > > BTW I believe that there is no way to fix this at this point. >>> > > >>> > > The best approach here is to document this >>> > > > behavior. >>> > > > >>> > > >>> > > I agree that the best thing we can do is to document this >>> requirement. >>> > > >>> > > Therefore we must ensure in the future that we won't fall again into >>> this >>> > > kind of issues. >>> > > >>> > > Pulsar is becoming more and more used by large enterprises and >>> backward >>> > > compatibility is a big value. >>> > > >>> > > Fortunately not all the Functions need rebuilding. >>> > > >>> > > >>> > > >>> > > >>> > > > Also, if you are using Kubernetes runtime to schedule functions, >>> you >>> > > > are not really impacted. >>> > > > >>> > > >>> > > This is partially correct, because you can wait to upgrade the >>> workers pod, >>> > > but there is no fine grained control over which version of each pod >>> will >>> > > be running your function, especially in a big cluster with many >>> tenants and >>> > > functions with this problem >>> > > >>> > > >>> > > Enrico >>> > > >>> > > >>> > > > - Sijie >>> > > > >>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli < >>> eolive...@gmail.com> >>> > > > wrote: >>> > > > > >>> > > > > Hello, >>> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7 >>> to 2.8. >>> > > > > More information is on the ticket, but the short version of the >>> story >>> > > is >>> > > > > that >>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API, >>> by >>> > > > > switching SchemaInfo from a class to an interface. >>> > > > > >>> > > > > This leads to an IncompatibleClassChangeError when you have a >>> Function >>> > > > or >>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you >>> upgrade your >>> > > > > Pulsar cluster (the functions worker pod for instance) from >>> Pulsar >>> > > 2.7.x >>> > > > to >>> > > > > Pulsar 2.8.0. >>> > > > > >>> > > > > The bad problem is that you cannot upgrade Pulsar without >>> interrupting >>> > > > the >>> > > > > service and coordinating with the upgrade of the Functions. >>> > > > > Your functions need to be recompiled against the Pulsar 2.8 API >>> and >>> > > > > deployed again in production. >>> > > > > >>> > > > > I have tried to move back SchemaInfo to an "abstract class" but >>> without >>> > > > > success, because then you fall into errors. >>> > > > > >>> > > > > I am not sure there is a way to provide a good "upgrade path" for >>> > > > > Functions/IO users. >>> > > > > >>> > > > > If we do not find a way we have to document the upgrade in the >>> official >>> > > > > Pulsar Documentation. >>> > > > > >>> > > > > We must do our best to prevent users from falling again into >>> this bad >>> > > > > situation. >>> > > > > >>> > > > > Any suggestions or thoughts ? >>> > > > > >>> > > > > Regards >>> > > > > Enrico >>> > > > > >>> > > > > [1] https://github.com/apache/pulsar/issues/11338 >>> > > > >>> > > >>> >>