[ 
https://issues.apache.org/jira/browse/FC-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060134#comment-18060134
 ] 

Ben Manes edited comment on FC-327 at 2/22/26 4:32 PM:
-------------------------------------------------------

Shawn, thank you for the details. I believe calling this a "cache" is 
misleading for the search use-case and what you want is replication.

For terminology, a cache is a hot subset of transient data for a transparent 
speedup by reducing lookups to the system-of-record. A cache miss is common and 
expected, with the side effect of a slower but correct operation. Since the 
cache avoids costly lookups it can return stale data, so consistency skew must 
be acceptable, e.g. the TTL you mentioned.

A replica is a complete copy of a data set, where partial or full means across 
a span of data sets, e.g. the entirety of a single sql table where partial 
replication is only subset of all tables in the database. That differs, as a 
cache would only be a subset of an individual data set (e.g. partial copy of a 
sql table). A _cold replica_ is a one-time snapshot, a _warm replica_ is a 
periodic snapshot, and a _hot replica_ is a snapshot that is being continuously 
updated (steaming).

For simple key-value lookups then a _cache_ is perfectly reasonable. It cannot 
perform search queries itself, but can store the results of previous searches. 
For example a query cache would have the query itself be a cache key, the 
matching record ids the cache value, and a cache miss would execute the query 
against the external data store. The query cache would avoid expensive 
redundant searches and could be combined with a data cache to fetch the records 
by a multi-get batch load. The caches would always be key-value maps that speed 
up single key lookups.

For the actual search query evaluation then a cache cannot fulfill this 
use-case because the entire data set is needed for the evaluation. Here 
replication is required to build a complete search index to query against. The 
search index would contain all records in the data set, but would be thin 
records by only storing the searchable fields to match against and 
materializing from the data store. Think ElasticSearch / Solr vs Memcached / 
Redis.

Your usage of Ehcache v2 required the entire data set to be in the cache or 
else it gave incorrect responses, as search evaluated only a subset of data. 
You can certainly emulate this with any other cache by similarly scanning over 
its contents, but it can be very misleading and confusing. A more 
straightforward approach is to periodically load all of the searchable portions 
of your data set, e.g. [(123, john, doe), (456, dan, smith)], to evaluate the 
queries against and then return the full records through a cache, e.g. [(john, 
doe, admin, sales), 456 -> (dan, smith, member, engineering)]. The search index 
can be an immutable {{List}} that is replaced by a scheduled task, scanned to 
match against a query's criteria, and the results fully materialized by cache 
lookups.

(x) {{Search == Cache}}
(/) {{Search + Cache}}


was (Author: ben.manes):
Shawn, thank you for the details. I believe calling this a "cache" is 
misleading for the search use-case and what you want is replication.

For terminology, a cache is a hot subset of transient data for a transparent 
speedup by reducing lookups to the system-of-record. A cache miss is common and 
expected, with the side effect of a slower but correct operation. Since the 
cache avoids costly lookups it can return stale data, so consistency skew must 
be acceptable, e.g. the TTL you mentioned.

A replica is a complete copy of a data set, where partial or full means across 
a span of data sets, e.g. the entirety of a single sql table where partial 
replication is only subset of all tables in the database. That differs, as a 
cache would only be a subset of an individual data set (e.g. partial copy of a 
sql table). A _cold replica_ is a one-time snapshot, a _warm replica_ is a 
periodic snapshot, and a _hot replica_ is a snapshot that is being continuously 
updated (steaming).

For simple key-value lookups then a _cache_ is perfectly reasonable. It cannot 
perform search queries itself, but can store the results of previous searches. 
For example a query cache would have the query itself be a cache key, the 
matching record ids the cache value, and a cache miss would execute the query 
against the external data store. The query cache would avoid expensive 
redundant searches and could be combined with a data cache to fetch the records 
by a multi-get batch load. The caches would always be key-value maps that speed 
up single key lookups.

For the actual search query evaluation then a cache cannot fulfill this 
use-case because the entire data set is needed for the evaluation. Here 
replication required to build a complete search index to query against. The 
search index would contain all records in the data set, but would be thin 
records by only storing the searchable fields to match against and 
materializing from the data store. Think ElasticSearch / Solr vs Memcached / 
Redis.

Your usage of Ehcache v2 required the entire data set to be in the cache or 
else it gave incorrect responses, as search evaluated only a subset of data. 
You can certainly emulate this with any other cache by similarly scanning over 
its contents, but it can be very misleading and confusing. A more 
straightforward approach is to use periodically load all of the searchable 
portions of your data set, e.g. [(123, john, doe), (456, dan, smith)], to 
evaluate the queries against and then return the full records by the a cache, 
e.g. [(john, doe, admin, sales), 456 -> (dan, smith, member, engineering)]. The 
search index can be an immutable {{List}} that is replaced by a scheduled task, 
scanned to match against the query criteria, and then fully materialized by 
cache lookups.

(x) {{Search == Cache}}
(/) {{Search + Cache}}

> Upgrade from ehcache v2
> -----------------------
>
>                 Key: FC-327
>                 URL: https://issues.apache.org/jira/browse/FC-327
>             Project: FORTRESS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Shawn McKinney
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Fortress core uses ehcache v2. It is getting long in tooth, has a number of 
> CVE's, and needs to be replaced. Here we'll look at alternatives.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to