Hi Jeff, It is happening on multiple servers and even on different DCs. The schema contains two keyspaces as follows:
CREATE KEYSPACE ks1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true; CREATE TABLE ks1.tbl1 ( account_id text, metric_name text, day_of_month int, hour_of_day int, insert_time timestamp, metric_value text, PRIMARY KEY ((account_id, metric_name, day_of_month, hour_of_day), insert_time) ) WITH CLUSTERING ORDER BY (insert_time DESC) AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND default_time_to_live = 2764800 AND gc_grace_seconds = 864000; CREATE TABLE ks1.tbl2 ( account_id text, lambda_uuid uuid, metric_name text, day_of_month int, insert_time timestamp, hour_of_day int, metric_value text, PRIMARY KEY ((account_id, lambda_uuid, metric_name, day_of_month), insert_time) ) WITH CLUSTERING ORDER BY (insert_time DESC) AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND default_time_to_live = 2764800 AND gc_grace_seconds = 864000; CREATE KEYSPACE ks2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true; CREATE TABLE ks2.tbl3 ( account_id text, lambda_uuid uuid, insert_time timestamp, data text, error_count int, info_count int, request_id text, warn_count int, PRIMARY KEY ((account_id, lambda_uuid), insert_time) ) WITH CLUSTERING ORDER BY (insert_time DESC) AND default_time_to_live = 2764800 AND gc_grace_seconds = 864000; I'm afraid I am unable to provide the actual key cache file, but for what it's worth, here's a data example from the above tables: *ks1.tbl1:* account_id | metric_name | day_of_month | hour_of_day | insert_time | metric_value ------------+-----------------------------------------------------------+--------------+-------------+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12345678 | com.selfbuilds.faas.system.invocation-finished-successful | 1 | 10 | 2020-01-01 10:59:03.588000+0000 | {"invocationIdentifier":"51176dd4-aa34-4834-bcf5-7e3d5b5494d1","lambdaUUID":"ADD3A5B4-7497-499A-8364-A97108C7400B","event":"","invocationState":"SUCCEEDED","executionTimeInMs":3392} *ks1.tbl2:* account_id | lambda_uuid | metric_name | day_of_month | insert_time | hour_of_day | metric_value ------------+--------------------------------------+-------------------------------------------------+--------------+---------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 12345678 | ed7f0629-8d5c-47dc-a66d-4a1197aa01bc | com.selfbuilds.faas.system.invocation-triggered | 9 | 2020-01-09 13:03:06.653000+0000 | 13 | {"invocationIdentifier":"3631e9ea-b230-4c4d-a051-c7b4241ee231","lambdaUUID":"ED7F0629-8D5C-47DC-A66D-4A1197AA01BC","event":"","lambdaName":"1578574875000","lastModifierOfLambda":"FaaSAll"} *ks2.tbl3:* account_id | lambda_uuid | insert_time | data | error_count | info_count | request_id | warn_count ------------+--------------------------------------+---------------------------------+----------------------------------------------------------------------------------------+-------------+------------+--------------------------------------+------------ 12345678 | 9f4eaa4f-42de-4c6e-9c96-02a4749281f3 | 2020-01-01 01:30:22.302000+0000 | [{"timestamp":1577842222458,"level":"Info","message":"Best Logs in town","extras":[]}] | 0 | 1 | 2966683f-cf37-4ea3-9d82-1de46207d51e | 0 Thanks for your help on this one! On Thu, Jan 23, 2020 at 5:40 PM Jeff Jirsa <jji...@gmail.com> wrote: > 489MB key seems huge. Unlikely it’s really that large. > > If it’s happening on multiple servers it’d be interesting to see the > schema and/or saved cache file if it’s not incredibly sensitive > > On Jan 23, 2020, at 3:48 AM, Shalom Sagges <shalomsag...@gmail.com> wrote: > > > Hi All, > > Cassandra 3.11.4. > On one of our clusters, during startup, I see two types of "Harmless > error" notification regarding the keycache: > > *Server 1:* > > INFO [pool-3-thread-1] 2020-01-23 04:34:46,167 AutoSavingCache.java:263 - > *Harmless > error reading saved cache* /path/to/saved_caches/KeyCache-e.db > *java.lang.NegativeArraySizeException: null* > at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:401) > at > org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:472) > at > org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:220) > at > org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:165) > at > org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:161) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > INFO [pool-3-thread-1] 2020-01-23 04:34:46,168 AutoSavingCache.java:174 - > Completed loading (188 ms; 10916 keys) KeyCache cache > > > *Server 2:* > > INFO [pool-3-thread-1] 2020-01-15 11:18:35,797 AutoSavingCache.java:263 - > Harmless error reading saved cache /path/to/saved_caches//KeyCache-e.db > *java.io.IOException: Corrupted key cache. Key length of 489735494 is > longer than maximum of 65535* > at > org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:469) > at > org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:220) > at > org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:165) > at > org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:161) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > INFO [pool-3-thread-1] 2020-01-15 11:18:35,799 AutoSavingCache.java:174 - > Completed loading (285 ms; 20681 keys) KeyCache cache > > The KeyCache size is a lot less than max default (100MB). So does this > mean that the key itself is the culprit? > > I'd really appreciate some guidance on what this message actually means in > the logs during startup and what problems may arise from it. > I see that one of the nodes experiences high CPU load and even gets kernel > panic from time to time, so I want to understand if the above may cause > such issues. > > Thanks a lot! > >