[
https://issues.apache.org/jira/browse/LUCENE-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005576#comment-14005576
]
David Smiley commented on LUCENE-5648:
--------------------------------------
I was putting some thought into the different ways of indexing durations,
listing pros & cons. The approach here should work very well but it has two
main down-sides of note:
* Overlapping or adjacent ranges are effectively coalesced, which impacts the
semantics of Contains & Within. To be clear, it's a non-issue if the multiple
durations for a given field on a document don't touch. But if you wanted to
index say \[2000 TO 2014] and \[2006 TO 2007] then it's as if the 2nd range
doesn't even exist. The document won't match for IsWithin a query of
\[2006-2008].
* The worst-case number of terms generated for a range at index-time is pretty
high. If you wanted to index Long.MIN_VALUE+1 TO Long.MAX_VALUE-1 (which spans
hundreds of millions of years), we're talking about 14k terms(*). But it's
certainly not commonly that bad unless you were indexing random milliseconds at
random millennia. Indexing a 2 adjacent month duration in the same year is only
7 terms. At search time, lots of hypothetical terms in a duration isn't an
issue for RPTs algorithms for the common case of a sparsely populated term
space.
Interestingly, using a 2D prefix-tree for single-dimensional durations
expressed as points doesn't have these shortcomings. But that approach is
slower to search than this approach (more possible terms in a search area; it's
half of the square of the number of terms in this 1D tree), and is not amenable
to terms-enumeration style interval faceting that I'll be doing next.
(*) The number of terms currently being generated would be cut by ~40-50% once
LUCENE-4942 gets done.
> Index/search multi-valued time durations
> ----------------------------------------
>
> Key: LUCENE-5648
> URL: https://issues.apache.org/jira/browse/LUCENE-5648
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/spatial
> Reporter: David Smiley
> Assignee: David Smiley
> Attachments: LUCENE-5648.patch, LUCENE-5648.patch, LUCENE-5648.patch,
> LUCENE-5648.patch
>
>
> If you need to index a date/time duration, then the way to do that is to have
> a pair of date fields; one for the start and one for the end -- pretty
> straight-forward. But if you need to index a variable number of durations per
> document, then the options aren't pretty, ranging from denormalization, to
> joins, to using Lucene spatial with 2D as described
> [here|http://wiki.apache.org/solr/SpatialForTimeDurations]. Ideally it would
> be easier to index durations, and work in a more optimal way.
> This issue implements the aforementioned feature using Lucene-spatial with a
> new single-dimensional SpatialPrefixTree implementation. Unlike the other two
> SPT implementations, it's not based on floating point numbers. It will have a
> Date based customization that indexes levels at meaningful quantities like
> seconds, minutes, hours, etc. The point of that alignment is to make it
> faster to query across meaningful ranges (i.e. [2000 TO 2014]) and to enable
> a follow-on issue to facet on the data in a really fast way.
> I'll expect to have a working patch up this week.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]