Glad you got to the bottom of it! On Sat, 24 Feb 2024 at 00:19, Aleksej Avrutin <alexavru...@gmail.com> wrote:
> Stephen, > > Thank you for the message. At last, I've found the root cause of the > issue. It was an application bug (expected) but it wasn't the most apparent > one. Out of despair I decided to check all the components of the > application including Ignite. The good thing is that now I have better > knowledge of how to troubleshoot issues like this. > > My best, > Alex Avrutin > > > On Fri, Feb 23, 2024 at 10:38 AM Stephen Darlington < > sdarling...@apache.org> wrote: > >> Is there a pattern to the lost records? Is it old records? Records for a >> particular customer? Records stored on a specific node or partition? >> >> On Thu, 22 Feb 2024 at 21:14, Aleksej Avrutin <alexavru...@gmail.com> >> wrote: >> >>> Jeremy, >>> >>> Thank you for the response. I reviewed cache properties using GG Control >>> Center and there was nothing in the cache props that would lead me to the >>> conclusion that any expiry policy/TTL is set up for the cache. It wasn't >>> set on the operation level, either. >>> >>> I decided to delete the cache entirely and re-create it. Tomorrow I'll >>> check if it helps. >>> >>> My best, >>> Alex Avrutin >>> >>> >>> On Thu, Feb 22, 2024 at 3:56 AM Jeremy McMillan < >>> jeremy.mcmil...@gridgain.com> wrote: >>> >>>> First, logging should be configured to at least WARN level if not INFO. >>>> >>>> Ignite manages data internally at the page level. If you see errors >>>> about pages, it is low, low level ignite problems. The next level up is >>>> partitions. Errors involving partitions are mid low level ignite problems. >>>> The next level up is caches. Errors at the cache level are mid to high >>>> level problems. The next level is cache records. Errors in cache record >>>> handling are high level of abstraction, and the next level is client >>>> application operations. >>>> >>>> The lower level of abstraction the errors appear, the less chance >>>> operations in general will succeed. Since the cache appears to operate >>>> mostly as expected, and there are no obvious errors in the ignite logs, >>>> most likely there is some client side logic which is deleting records, and >>>> ignite does not consider this behavior to be in error. >>>> >>>> I would recommend fine tuning cache delete method log coverage. First >>>> identify if the deletion is happening on a client connection thread pool or >>>> a thread for server initiated operations. >>>> >>>> My guess is that a client is connecting, getting a cache object, and >>>> then setting expiration on that cache connection so that all cache adds >>>> under that cache connection will have expiration applied to them. >>>> >>>> >>>> https://ignite.apache.org/docs/2.14.0/configuring-caches/expiry-policies#configuration >>>> >>>> "You can also change or set Expiry Policy for individual cache >>>> operations. This policy is used for each operation invoked on the returned >>>> cache instance." >>>> >>>> >>>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Client.Cache.ICacheClient-2.html?q=withExpiryPolicy#Apache_Ignite_Core_Client_Cache_ICacheClient_2_WithExpiryPolicy_Apache_Ignite_Core_Cache_Expiry_IExpiryPolicy_ >>>> >>>> On Wed, Feb 21, 2024, 19:17 Aleksej Avrutin <alexavru...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> A couple of days ago I encountered a strange phenomenon in our >>>>> application based on Apache Ignite .Net 2.14 with persistence (3 nodes, 1 >>>>> backup per cache). >>>>> Data in a cache started disappearing for seemingly no reason and the >>>>> amount of records could be halved (220K to 108K) overnight. I spent a >>>>> couple of days trying to find a problem in the application, crunched >>>>> hundreds megabytes of application logs but didn't manage to find a reason >>>>> to blame the application. Retention/TTL is not set for the cache. Apache >>>>> Ignite logs with the option -DIGNITE_QUIET=false also don't reveal any >>>>> anomalies (or I don't know what to look for). The data shares are expected >>>>> to be durable (based on Azure Disk) and we never had any issues with them. >>>>> RAM utilisation is normal and there's plenty of available RAM. >>>>> The Ignite cluster is hosted in a 3 node Kubernetes cluster on Azure. >>>>> >>>>> The question is: how would you recommend investigating issues like >>>>> this? What metrics and logs can I check? Is it possible to log and track >>>>> individual Remove() operations as well as SQL queries at Ignite engine >>>>> level? >>>>> >>>>> The application has been working on Ignite for years already and we >>>>> didn't encounter data loss at such scales before. It's possible that the >>>>> app wasn't used so extensively before as it is now and the problem left >>>>> unnoticed. >>>>> >>>>> My best, >>>>> Alex Avrutin >>>>> >>>>