Re: Possible WAL corruption on running system during K8s update

2023-07-18 Thread Alex Plehanov
Hello,

Which Ignite version do you use?
Please share exception details after "Exception during start processors,
node will be stopped and close connections" (there should be a reason in
the log, why the page delta can't be applied).

вт, 18 июл. 2023 г. в 05:05, Raymond Wilson :

> Hi,
>
> We run a dev/alpha stack of our application in Azure Kubernetes.
> Persistent storage is contained in Azure Files NAS storage volumes, one per
> server node.
>
> We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the
> update various pods were stopped and restarted as is normal for an update.
> This included nodes running the dev/alpha stack.
>
> At least one node (of a cluster of four server nodes in the cluster)
> failed to restart after the update, with the following logging:
>
>   2023-07-18 01:23:55.171 [1] INFRestoring checkpoint after logical
> recovery, will start physical recovery from back pointer: WALPointer
> [idx=2431, fileOff=209031823, len=29]
>  2023-07-18 01:23:55.205  [28] ERRFailed to apply page delta.
> rec=[PagesListRemovePageRecord [rmvdPageId=010100010057,
> pageId=010100010004, grpId=-1476359018, super=PageDeltaRecord
> [grpId=-1476359018, pageId=010100010004, super=WALRecord [size=41,
> chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41],
> type=PAGES_LIST_REMOVE_PAGE
>  2023-07-18 01:23:55.217 [1] INFCleanup cache stores [total=0, left=0,
> cleanFiles=false]
>  2023-07-18 01:23:55.218 [1] ERRGot exception while starting (will
> rollback startup routine).
>  2023-07-18 01:23:55.218 [1] ERRException during start processors,
> node will be stopped and close connections
>
> I know Apache Ignite is very good at surviving 'Big Red Switch' scenarios,
> and we have our data regions configured with the strictest update protocol
> (full sync after each write), however it's possible the NAS implementation
> does something different!
>
> I think if we delete the WAL files from the nodes that won't restart then
> the node may be happy, though we will lose any updates since the last
> checkpoint (but then, it has low use and checkpoints are every 30-45
> seconds or so, so this won't be significant).
>
> Is this an error anyone else has noticed?
> Has anyone else had similar issues with Azure Files when using strict
> update/sync semantics?
>
> Thanks,
> Raymond.
>
> --
> 
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> 
>


Re: Ignite SQL

2023-07-18 Thread Stephen Darlington
“Correct” is hard to quantify without knowing your use case, but option 1 is 
probably what you want. Spark pushes down SQL execution to Ignite, so you get 
all the distribution, use of indexes, etc. 

> On 14 Jul 2023, at 16:12, Arunima Barik  wrote:
> 
> Hello team
> 
> What is the correct way out of these? 
> 
> 1. Write a spark dataframe to ignite
> Read the same back and perform spark.sql() on that
> 
> 2. Write the spark dataframe to ignite
> Connect to server via a thin client
> Perform client.sql() 
> 
> Regards
> Arunima



Re: Read Write through cache

2023-07-18 Thread Stephen Darlington
Write through works regardless of how you insert data into Ignite.

I’m not clear what you mean by federated query. Are the records in Spark a 
subset of those in the cache?

Assuming not, create a data frame with a SQL query against Ignite. Create a 
data frame with a SQL query against your Spark data frame. Union together.

> On 13 Jul 2023, at 08:27, Arunima Barik  wrote:
> 
> Hello Team
> 
> I want to build a read write through Ignite cache over Spark
> 
> If I have 50 rows in the cache and entire 100 row data in spark then how can 
> I use federated queries? 
> 
> Also, how to perform write through using Sql queries? 
> 
> Regards
> Arunima



Re: Possible WAL corruption on running system during K8s update

2023-07-18 Thread Raymond Wilson
Hi Alex,

We are using Ignite v2.15.

I will track down the additional log information and reply on this thread.

Raymond.


On Wed, Jul 19, 2023 at 2:55 AM Alex Plehanov 
wrote:

> Hello,
>
> Which Ignite version do you use?
> Please share exception details after "Exception during start processors,
> node will be stopped and close connections" (there should be a reason in
> the log, why the page delta can't be applied).
>
> вт, 18 июл. 2023 г. в 05:05, Raymond Wilson :
>
>> Hi,
>>
>> We run a dev/alpha stack of our application in Azure Kubernetes.
>> Persistent storage is contained in Azure Files NAS storage volumes, one per
>> server node.
>>
>> We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the
>> update various pods were stopped and restarted as is normal for an update.
>> This included nodes running the dev/alpha stack.
>>
>> At least one node (of a cluster of four server nodes in the cluster)
>> failed to restart after the update, with the following logging:
>>
>>   2023-07-18 01:23:55.171 [1] INFRestoring checkpoint after logical
>> recovery, will start physical recovery from back pointer: WALPointer
>> [idx=2431, fileOff=209031823, len=29]
>>  2023-07-18 01:23:55.205  [28] ERRFailed to apply page delta.
>> rec=[PagesListRemovePageRecord [rmvdPageId=010100010057,
>> pageId=010100010004, grpId=-1476359018, super=PageDeltaRecord
>> [grpId=-1476359018, pageId=010100010004, super=WALRecord [size=41,
>> chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41],
>> type=PAGES_LIST_REMOVE_PAGE
>>  2023-07-18 01:23:55.217 [1] INFCleanup cache stores [total=0,
>> left=0, cleanFiles=false]
>>  2023-07-18 01:23:55.218 [1] ERRGot exception while starting (will
>> rollback startup routine).
>>  2023-07-18 01:23:55.218 [1] ERRException during start processors,
>> node will be stopped and close connections
>>
>> I know Apache Ignite is very good at surviving 'Big Red Switch'
>> scenarios, and we have our data regions configured with the strictest
>> update protocol (full sync after each write), however it's possible the NAS
>> implementation does something different!
>>
>> I think if we delete the WAL files from the nodes that won't restart then
>> the node may be happy, though we will lose any updates since the last
>> checkpoint (but then, it has low use and checkpoints are every 30-45
>> seconds or so, so this won't be significant).
>>
>> Is this an error anyone else has noticed?
>> Has anyone else had similar issues with Azure Files when using strict
>> update/sync semantics?
>>
>> Thanks,
>> Raymond.
>>
>> --
>> 
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> 
>>
>

-- 

Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com




Re: Possible WAL corruption on running system during K8s update

2023-07-18 Thread Raymond Wilson
Hi Alex,

Here is the log from the Ignite startup. It's fairly short but shows
everything I think:

2023-07-17 22:38:55,061 [1] DBG [ImmutableCacheComputeServer]   Starting
Ignite.NET 2.15.0.23172
2023-07-17 22:38:55,065 [1] DBG [ImmutableCacheComputeServer]
2023-07-17 22:38:55,068 [1] DBG [ImmutableCacheComputeServer]
2023-07-17 22:38:55,070 [1] DBG [ImmutableCacheComputeServer]
2023-07-17 22:38:55,070 [1] DBG [ImmutableCacheComputeServer]
2023-07-17 22:38:55,073 [1] DBG [ImmutableCacheComputeServer]
2023-07-17 22:38:55,471 [1] DBG [ImmutableCacheComputeServer]   JVM
started.
2023-07-17 22:38:56,340 [1] WRN [ImmutableCacheComputeServer]   Consistent
ID is not set, it is recommended to set consistent ID for production
clusters (use IgniteConfiguration.setConsistentId property)
2023-07-17 22:38:56,382 [1] INF [ImmutableCacheComputeServer]
>>>__  
>>>   /  _/ ___/ |/ /  _/_  __/ __/
>>>  _/ // (7 7// /  / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.15.0#20230425-sha1:f98f7f35
>>> 2023 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: https://ignite.apache.org

2023-07-17 22:38:56,383 [1] INF [ImmutableCacheComputeServer]   Config URL:
n/a
2023-07-17 22:38:56,414 [1] INF [ImmutableCacheComputeServer]
IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=250,
svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=250,
mgmtPoolSize=4, dataStreamerPoolSize=8, utilityCachePoolSize=8,
utilityCacheKeepAliveTime=6, p2pPoolSize=2, qryPoolSize=8,
buildIdxPoolSize=1, igniteHome=/trex/, igniteWorkDir=/persist/Immutable,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e46d9f4,
nodeId=4e70ba5e-5829-4b2d-b349-6539918990b5, marsh=BinaryMarshaller [],
marshLocJobs=false, p2pEnabled=false, netTimeout=5000,
netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3,
metricsHistSize=1, metricsUpdateFreq=2000,
metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi
[addrRslvr=null, addressFilter=null, sockTimeout=0, ackTimeout=0,
marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=60, soLinger=0,
forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null,
skipAddrsRandomization=false], segPlc=USE_FAILURE_HANDLER,
segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true,
segChkFreq=1, commSpi=TcpCommunicationSpi
[connectGate=org.apache.ignite.spi.communication.tcp.internal.ConnectGateway@5bb3d42d,
ctxInitLatch=java.util.concurrent.CountDownLatch@5bf61e67[Count = 1],
stopping=false, clientPool=null, nioSrvWrapper=null, stateProvider=null],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@2c1dc8e,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@61019f59,
addrRslvr=null,
encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@62e8f862,
tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@26f3d90c,
clientMode=false, rebalanceThreadPoolSize=1, rebalanceTimeout=1,
rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0,
rebalanceBatchSize=524288, txCfg=TransactionConfiguration
[txSerEnabled=false, dfltIsolation=REPEATABLE_READ,
dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0,
txTimeoutOnPartitionMapExchange=0, deadlockTimeout=1,
pessimisticTxLogSize=0, pessimisticTxLogLinger=1, tmLookupClsName=null,
txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true,
discoStartupDelay=6, deployMode=SHARED, p2pMissedCacheSize=100,
locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100,
failureDetectionTimeout=6, sysWorkerBlockedTimeout=null,
clientFailureDetectionTimeout=6, metricsLogFreq=3,
connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11212,
noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768,
idleQryCurTimeout=60, idleQryCurCheckFreq=6, sndQueueLimit=0,
selectorCnt=2, idleTimeout=7000, sslEnabled=false, sslClientAuth=false,
sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=8,
msgInterceptor=null], odbcCfg=null, warmupClos=null,
atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED,
backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null,
platformCfg=PlatformDotNetConfiguration [binaryCfg=null],
binaryCfg=BinaryConfiguration [idMapper=null, nameMapper=null,
serializer=null, compactFooter=true], memCfg=null, pstCfg=null,
dsCfg=DataStorageConfiguration [pageSize=4096, concLvl=2,
sysDataRegConf=org.apache.ignite.configuration.SystemDataRegionConfiguration@55a8dc49,
dfltDataRegConf=DataRegionConfiguration [name=Default-Immutable,
maxSize=8589934592, initSize=8589934592, swapPath=null,
pageEvictionMode=DISABLED, pageReplacementMode=CLOCK,
evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false,
metricsSubIntervalCount=5, metricsRateTimeInterval=6,
persistenceEnabled=true, checkpointPageBufSize=0,
lazyMemoryAllocation=true, warmUpCfg=null, memoryAllocator=null,
cdcE

Cache write synchronization mode

2023-07-18 Thread Raymond Wilson
I have a query regarding the CacheWriteSynchronizationMode in
CacheConfiguration.

This enum is defined like this in the .Net client:

  public enum CacheWriteSynchronizationMode
  {
/// 
/// Mode indicating that Ignite should wait for write or commit replies
from all nodes.
/// This behavior guarantees that whenever any of the atomic or
transactional writes
/// complete, all other participating nodes which cache the written
data have been updated.
/// 
FullSync,
/// 
/// Flag indicating that Ignite will not wait for write or commit
responses from participating nodes,
/// which means that remote nodes may get their state updated a bit
after any of the cache write methods
/// complete, or after {@link Transaction#commit()} method completes.
/// 
FullAsync,
/// 
/// This flag only makes sense for {@link CacheMode#PARTITIONED} mode.
When enabled, Ignite will wait
/// for write or commit to complete on primary node, but will not wait
for backups to be updated.
/// 
PrimarySync,
  }

We have some replicated caches (where cfg.CacheMode =
CacheMode.Replicated), but we don't specify the WriteSynchronizationMode.

I note in the comment for PrimarySync (the default) that this "only makes
sense" for Partitioned caches. Given we don't set this mode for our
replicated caches then they will be using the PrimarySync write
synchronization mode.

The core Ignite help does not distinguish these synchronization modes and
strongly implies that all three synchronization modes have equivalent
consistency guarantees, but the help comment implies that replicated caches
should use either FullSync or FullAsync to ensure all replicated contexts
receive the written value.

As a background, I am investigating an issue in our system that could be
explained by replicated caches not having consistent values and am writing
some triage tooling to prove if that is the case or not by comparing the
stored values in each of the replicates cache nodes, However, I'm also
doing some due diligence on our configuration and ran into this item.

Thanks,
Raymond.


-- 

Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com




Re: Ignite SQL

2023-07-18 Thread Arunima Barik
I have a huge dataset and I am keeping few (say 100) rows in Ignite and the
entire dataset remains in Spark

When I query Ignite I want to write an Sql query to perform the same.

Does option 1 still hold good?

On Tue, 18 Jul, 2023, 10:40 pm Stephen Darlington, <
stephen.darling...@gridgain.com> wrote:

> “Correct” is hard to quantify without knowing your use case, but option 1
> is probably what you want. Spark pushes down SQL execution to Ignite, so
> you get all the distribution, use of indexes, etc.
>
> > On 14 Jul 2023, at 16:12, Arunima Barik 
> wrote:
> >
> > Hello team
> >
> > What is the correct way out of these?
> >
> > 1. Write a spark dataframe to ignite
> > Read the same back and perform spark.sql() on that
> >
> > 2. Write the spark dataframe to ignite
> > Connect to server via a thin client
> > Perform client.sql()
> >
> > Regards
> > Arunima
>
>


Re: Read Write through cache

2023-07-18 Thread Arunima Barik
How does write through work? I mean if I add a row in Ignite dataframe, how
does it reflect to spark?

I have 50 rows in Ignite and all 100 rows in Spark.
If I perform a union all, wont the performance degrade?
I mean if will get slower than just querying spark

On Tue, 18 Jul, 2023, 10:43 pm Stephen Darlington, <
stephen.darling...@gridgain.com> wrote:

> Write through works regardless of how you insert data into Ignite.
>
> I’m not clear what you mean by federated query. Are the records in Spark a
> subset of those in the cache?
>
> Assuming not, create a data frame with a SQL query against Ignite. Create
> a data frame with a SQL query against your Spark data frame. Union together.
>
> > On 13 Jul 2023, at 08:27, Arunima Barik 
> wrote:
> >
> > Hello Team
> >
> > I want to build a read write through Ignite cache over Spark
> >
> > If I have 50 rows in the cache and entire 100 row data in spark then how
> can I use federated queries?
> >
> > Also, how to perform write through using Sql queries?
> >
> > Regards
> > Arunima
>
>