Hello, Which Ignite version do you use? Please share exception details after "Exception during start processors, node will be stopped and close connections" (there should be a reason in the log, why the page delta can't be applied).
вт, 18 июл. 2023 г. в 05:05, Raymond Wilson <raymond_wil...@trimble.com>: > Hi, > > We run a dev/alpha stack of our application in Azure Kubernetes. > Persistent storage is contained in Azure Files NAS storage volumes, one per > server node. > > We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the > update various pods were stopped and restarted as is normal for an update. > This included nodes running the dev/alpha stack. > > At least one node (of a cluster of four server nodes in the cluster) > failed to restart after the update, with the following logging: > > 2023-07-18 01:23:55.171 [1] INF Restoring checkpoint after logical > recovery, will start physical recovery from back pointer: WALPointer > [idx=2431, fileOff=209031823, len=29] > 2023-07-18 01:23:55.205 [28] ERR Failed to apply page delta. > rec=[PagesListRemovePageRecord [rmvdPageId=0101000100000057, > pageId=0101000100000004, grpId=-1476359018, super=PageDeltaRecord > [grpId=-1476359018, pageId=0101000100000004, super=WALRecord [size=41, > chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41], > type=PAGES_LIST_REMOVE_PAGE]]]] > 2023-07-18 01:23:55.217 [1] INF Cleanup cache stores [total=0, left=0, > cleanFiles=false] > 2023-07-18 01:23:55.218 [1] ERR Got exception while starting (will > rollback startup routine). > 2023-07-18 01:23:55.218 [1] ERR Exception during start processors, > node will be stopped and close connections > > I know Apache Ignite is very good at surviving 'Big Red Switch' scenarios, > and we have our data regions configured with the strictest update protocol > (full sync after each write), however it's possible the NAS implementation > does something different! > > I think if we delete the WAL files from the nodes that won't restart then > the node may be happy, though we will lose any updates since the last > checkpoint (but then, it has low use and checkpoints are every 30-45 > seconds or so, so this won't be significant). > > Is this an error anyone else has noticed? > Has anyone else had similar issues with Azure Files when using strict > update/sync semantics? > > Thanks, > Raymond. > > -- > <http://www.trimble.com/> > Raymond Wilson > Trimble Distinguished Engineer, Civil Construction Software (CCS) > 11 Birmingham Drive | Christchurch, New Zealand > raymond_wil...@trimble.com > > > <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >