Re: Possible WAL corruption on running system during K8s update

Alex Plehanov Tue, 18 Jul 2023 07:55:14 -0700

Hello,

Which Ignite version do you use?
Please share exception details after "Exception during start processors,
node will be stopped and close connections" (there should be a reason in
the log, why the page delta can't be applied).


вт, 18 июл. 2023 г. в 05:05, Raymond Wilson <[email protected]>:

> Hi,
>
> We run a dev/alpha stack of our application in Azure Kubernetes.
> Persistent storage is contained in Azure Files NAS storage volumes, one per
> server node.
>
> We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the
> update various pods were stopped and restarted as is normal for an update.
> This included nodes running the dev/alpha stack.
>
> At least one node (of a cluster of four server nodes in the cluster)
> failed to restart after the update, with the following logging:
>
>   2023-07-18 01:23:55.171 [1] INF    Restoring checkpoint after logical
> recovery, will start physical recovery from back pointer: WALPointer
> [idx=2431, fileOff=209031823, len=29]
>  2023-07-18 01:23:55.205  [28] ERR    Failed to apply page delta.
> rec=[PagesListRemovePageRecord [rmvdPageId=0101000100000057,
> pageId=0101000100000004, grpId=-1476359018, super=PageDeltaRecord
> [grpId=-1476359018, pageId=0101000100000004, super=WALRecord [size=41,
> chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41],
> type=PAGES_LIST_REMOVE_PAGE]]]]
>  2023-07-18 01:23:55.217 [1] INF    Cleanup cache stores [total=0, left=0,
> cleanFiles=false]
>  2023-07-18 01:23:55.218 [1] ERR    Got exception while starting (will
> rollback startup routine).
>  2023-07-18 01:23:55.218 [1] ERR    Exception during start processors,
> node will be stopped and close connections
>
> I know Apache Ignite is very good at surviving 'Big Red Switch' scenarios,
> and we have our data regions configured with the strictest update protocol
> (full sync after each write), however it's possible the NAS implementation
> does something different!
>
> I think if we delete the WAL files from the nodes that won't restart then
> the node may be happy, though we will lose any updates since the last
> checkpoint (but then, it has low use and checkpoints are every 30-45
> seconds or so, so this won't be significant).
>
> Is this an error anyone else has noticed?
> Has anyone else had similar issues with Azure Files when using strict
> update/sync semantics?
>
> Thanks,
> Raymond.
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> [email protected]
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Possible WAL corruption on running system during K8s update

Reply via email to