[
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204365#comment-16204365
]
Gus Heck commented on SOLR-11299:
---------------------------------
Was thinking about the timezone bit... it seems to me that just as in
applications where one normally stores data as UTC and converts when needed, we
should dodge the timezone metadata read-only issue and always name our
partitions in terms of UTC... conversions can be done based on the timezone
portion of the key date field in cases where we are not receiving UTC... if no
timezone specifier assume UTC...
I've seen an implementation of this sort of thing where routing was based on
DateFormat parsing the partition names on each request, but I could also
imagine that we might simplify things by naming the partitions based on epoch
milliseconds, which could also be kept in alias metadata as a sorted list of
partition boundaries with partitions named for their (inclusive) lower bound.
Allowing pretty, human friendly collection names that are formatted versions of
the lower bounds and mapping the collection start time values to those names
could be a follow on enhancement just adding a layer of indirection...
> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
> Key: SOLR-11299
> URL: https://issues.apache.org/jira/browse/SOLR-11299
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: David Smiley
> Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think
> logs or sensor data / IOT) itself without a lot of manual/external work. The
> most naive and painless approach today is to create a collection with a high
> numShards with hash routing but this isn't as good as partitioning the
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change. (No need
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries:
> ** can search fewer shards, reducing overall load
> ** realtime search is more tractable (since most shards are stable --
> good caches)
> ** "recent" shards (that might be queried more) can be allocated to
> faster hardware
> ** aged out data is simply removed, not marked as deleted. Deleted docs
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection
> (potentially actually an alias) in a normal way (search or update), letting
> Solr handle the addition of new partitions, removing of old ones, and
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it
> all happen -- either subtasks or issue linking.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]