[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Gus Heck (JIRA) Fri, 13 Oct 2017 16:51:31 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204365#comment-16204365
 ]


Gus Heck commented on SOLR-11299:
---------------------------------

Was thinking about the timezone bit... it seems to me that just as in 
applications where one normally stores data as UTC and converts when needed, we 
should dodge the timezone metadata read-only issue and always name our 
partitions in terms of UTC... conversions can be done based on the timezone 
portion of the key date field in cases where we are not receiving UTC... if no 
timezone specifier assume UTC... 

I've seen an implementation of this sort of thing where routing was based on 
DateFormat parsing the partition names on each request, but I could also 
imagine that we might simplify things by naming the partitions based on epoch 
milliseconds, which could also be kept in alias metadata as a sorted list of 
partition boundaries with partitions named for their (inclusive) lower bound. 
Allowing pretty, human friendly collection names that are formatted versions of 
the lower bounds and mapping the collection start time values to those names 
could be a follow on enhancement just adding a layer of indirection... 



> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
>                 Key: SOLR-11299
>                 URL: https://issues.apache.org/jira/browse/SOLR-11299
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think 
> logs or sensor data / IOT) itself without a lot of manual/external work.  The 
> most naive and painless approach today is to create a collection with a high 
> numShards with hash routing but this isn't as good as partitioning the 
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change.  (No need 
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries: 
>     ** can search fewer shards, reducing overall load
>     ** realtime search is more tractable (since most shards are stable -- 
> good caches)
>     ** "recent" shards (that might be queried more) can be allocated to 
> faster hardware
>     ** aged out data is simply removed, not marked as deleted.  Deleted docs 
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system 
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection 
> (potentially actually an alias) in a normal way (search or update), letting 
> Solr handle the addition of new partitions, removing of old ones, and 
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it 
> all happen -- either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Reply via email to