Re: Data loss in an Ignite application

Stephen Darlington Mon, 26 Feb 2024 01:34:21 -0800

Glad you got to the bottom of it!

On Sat, 24 Feb 2024 at 00:19, Aleksej Avrutin <[email protected]> wrote:


> Stephen,
>
> Thank you for the message. At last, I've found the root cause of the
> issue. It was an application bug (expected) but it wasn't the most apparent
> one. Out of despair I decided to check all the components of the
> application including Ignite. The good thing is that now I have better
> knowledge of how to troubleshoot issues like this.
>
> My best,
> Alex Avrutin
>
>
> On Fri, Feb 23, 2024 at 10:38 AM Stephen Darlington <
> [email protected]> wrote:
>
>> Is there a pattern to the lost records? Is it old records? Records for a
>> particular customer? Records stored on a specific node or partition?
>>
>> On Thu, 22 Feb 2024 at 21:14, Aleksej Avrutin <[email protected]>
>> wrote:
>>
>>> Jeremy,
>>>
>>> Thank you for the response. I reviewed cache properties using GG Control
>>> Center and there was nothing in the cache props that would lead me to the
>>> conclusion that any expiry policy/TTL is set up for the cache. It wasn't
>>> set on the operation level, either.
>>>
>>> I decided to delete the cache entirely and re-create it. Tomorrow I'll
>>> check if it helps.
>>>
>>> My best,
>>> Alex Avrutin
>>>
>>>
>>> On Thu, Feb 22, 2024 at 3:56 AM Jeremy McMillan <
>>> [email protected]> wrote:
>>>
>>>> First, logging should be configured to at least WARN level if not INFO.
>>>>
>>>> Ignite manages data internally at the page level. If you see errors
>>>> about pages, it is low, low level ignite problems. The next level up is
>>>> partitions. Errors involving partitions are mid low level ignite problems.
>>>> The next level up is caches. Errors at the cache level are mid to high
>>>> level problems. The next level is cache records. Errors in cache record
>>>> handling are high level of abstraction, and the next level is client
>>>> application operations.
>>>>
>>>> The lower level of abstraction the errors appear, the less chance
>>>> operations in general will succeed. Since the cache appears to operate
>>>> mostly as expected, and there are no obvious errors in the ignite logs,
>>>> most likely there is some client side logic which is deleting records, and
>>>> ignite does not consider this behavior to be in error.
>>>>
>>>> I would recommend fine tuning cache delete method log coverage. First
>>>> identify if the deletion is happening on a client connection thread pool or
>>>> a thread for server initiated operations.
>>>>
>>>> My guess is that a client is connecting, getting a cache object, and
>>>> then setting expiration on that cache connection so that all cache adds
>>>> under that cache connection will have expiration applied to them.
>>>>
>>>>
>>>> https://ignite.apache.org/docs/2.14.0/configuring-caches/expiry-policies#configuration
>>>>
>>>> "You can also change or set Expiry Policy for individual cache
>>>> operations. This policy is used for each operation invoked on the returned
>>>> cache instance."
>>>>
>>>>
>>>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Client.Cache.ICacheClient-2.html?q=withExpiryPolicy#Apache_Ignite_Core_Client_Cache_ICacheClient_2_WithExpiryPolicy_Apache_Ignite_Core_Cache_Expiry_IExpiryPolicy_
>>>>
>>>> On Wed, Feb 21, 2024, 19:17 Aleksej Avrutin <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> A couple of days ago I encountered a strange phenomenon in our
>>>>> application based on Apache Ignite .Net 2.14 with persistence (3 nodes, 1
>>>>> backup per cache).
>>>>> Data in a cache started disappearing for seemingly no reason and the
>>>>> amount of records could be halved (220K to 108K) overnight. I spent a
>>>>> couple of days trying to find a problem in the application, crunched
>>>>> hundreds megabytes of application logs but didn't manage to find a reason
>>>>> to blame the application. Retention/TTL is not set for the cache. Apache
>>>>> Ignite logs with the option -DIGNITE_QUIET=false also don't reveal any
>>>>> anomalies (or I don't know what to look for). The data shares are expected
>>>>> to be durable (based on Azure Disk) and we never had any issues with them.
>>>>> RAM utilisation is normal and there's plenty of available RAM.
>>>>> The Ignite cluster is hosted in a 3 node Kubernetes cluster on Azure.
>>>>>
>>>>> The question is: how would you recommend investigating issues like
>>>>> this? What metrics and logs can I check? Is it possible to log and track
>>>>> individual Remove() operations as well as SQL queries at Ignite engine
>>>>> level?
>>>>>
>>>>> The application has been working on Ignite for years already and we
>>>>> didn't encounter data loss at such scales before. It's possible that the
>>>>> app wasn't used so extensively before as it is now and the problem left
>>>>> unnoticed.
>>>>>
>>>>> My best,
>>>>> Alex Avrutin
>>>>>
>>>>

Re: Data loss in an Ignite application

Reply via email to