[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

David Smiley (JIRA) Fri, 06 Oct 2017 08:21:20 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194736#comment-16194736
 ]


David Smiley commented on SOLR-11299:
-------------------------------------

Hi Gus.

bq. One thought that comes to mind is that with deletions of old collections, 
we could more or less think of it as solr collection based ring buffer...

Perhaps in an abstract sense but I don't think modeling it physically (creating 
X collections with some suffix ordinal name in advance) makes sense. I don't 
think it's a big deal to delete collections and create new ones.  This is very 
flexible to changing settings of how much data to retain but an actual ring 
buffer design is rigid.

bq. The implicit assumption seems to be that writes are "mostly ordered" and 
that severely out of order writes might be rejected? I think that that's 
probably a critical assumption since I imagine that we'll have an alias that's 
moving from collection to collection for writes.  ...

My proposed design does not call for a so-called write alias, which would be a 
limitation for out-of-order.  Instead there is an URP (or add-on to 
DistributedURP) that can route to the proper partition.  For fixed time based 
partitions, it shouldn't be a big deal to add data out of order.  For size 
capped partitions, it's definitely incompatible.  For documents far in the 
future, instead of creating too many intermediate collections, we very well 
might reject it.

bq. Thoughts on the possible URP/DURP maybe it's always present by default, but 
a silent no-op unless it sees that a time partitioned collection is being 
accessed, and only then does it do anything?  ...

Yeah maybe; more investigation is needed to help us pick. Perhaps collections 
involved in a time series have a boolean piece of metadata denoting it is a 
part of a time series?  Or a string back-reference to the alias?

bq. Another thought is that while date/time is the objective here, it would 
seem that any numeric field should work...

I've thought of this but I think the time based use case is so prevalent that I 
have doubts it's worth bothering to add non-time support.  It could be 
theoretically added in the future.  And such a user could abuse their number as 
a time to use this feature.

> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
>                 Key: SOLR-11299
>                 URL: https://issues.apache.org/jira/browse/SOLR-11299
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think 
> logs or sensor data / IOT) itself without a lot of manual/external work.  The 
> most naive and painless approach today is to create a collection with a high 
> numShards with hash routing but this isn't as good as partitioning the 
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change.  (No need 
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries: 
>     ** can search fewer shards, reducing overall load
>     ** realtime search is more tractable (since most shards are stable -- 
> good caches)
>     ** "recent" shards (that might be queried more) can be allocated to 
> faster hardware
>     ** aged out data is simply removed, not marked as deleted.  Deleted docs 
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system 
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection 
> (potentially actually an alias) in a normal way (search or update), letting 
> Solr handle the addition of new partitions, removing of old ones, and 
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it 
> all happen -- either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Reply via email to