[
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204466#comment-16204466
]
David Smiley commented on SOLR-11299:
-------------------------------------
The timezone bit is for two things:
* the interpretation of the partition time size. A timezone is useful and in
fact necessary for the same reasons as facet.range.gap with dates which support
it. See SOLR-2690 for context as to why {{TZ}} exists.
* allowing for shorter friendly collection names like mycollection_2017-10-13
instead of needing to get to the hour. This isn't a big deal, granted. I
don't really like millisecond collection names, sorry. Hey [~hossman] I recall
we both attended an LSR presentation (Rocana?) that described a time
partitioning strategy with the dubious choice of milliseconds in the name and
you were like, oh yeah, ol collection 1507953042461 -- there's some great data
in there :-)
RE alias metadata for storing partition ranges... yeah I suppose that's
possible but I admit I like the lean sufficiency of the names themselves in
series being adequate. The only problem I can think of with using the names
alone is that you must have a complete contiguous series with no gaps of
collections that haven't been created. That doesn't seam like a serious
limitation, I think? If we wanted metadata on each partition like the start
and end range, I'm not inclined to think the alias is where it goes -- more
likely it's metadata on the collection.
> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
> Key: SOLR-11299
> URL: https://issues.apache.org/jira/browse/SOLR-11299
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: David Smiley
> Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think
> logs or sensor data / IOT) itself without a lot of manual/external work. The
> most naive and painless approach today is to create a collection with a high
> numShards with hash routing but this isn't as good as partitioning the
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change. (No need
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries:
> ** can search fewer shards, reducing overall load
> ** realtime search is more tractable (since most shards are stable --
> good caches)
> ** "recent" shards (that might be queried more) can be allocated to
> faster hardware
> ** aged out data is simply removed, not marked as deleted. Deleted docs
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection
> (potentially actually an alias) in a normal way (search or update), letting
> Solr handle the addition of new partitions, removing of old ones, and
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it
> all happen -- either subtasks or issue linking.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]