[jira] [Commented] (LUCENE-5648) Index/search multi-valued time durations

David Smiley (JIRA) Wed, 21 May 2014 20:31:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005576#comment-14005576
 ]


David Smiley commented on LUCENE-5648:
--------------------------------------

I was putting some thought into the different ways of indexing durations, 
listing pros & cons.  The approach here should work very well but it has two 
main down-sides of note:
* Overlapping or adjacent ranges are effectively coalesced, which impacts the 
semantics of Contains & Within.  To be clear, it's a non-issue if the multiple 
durations for a given field on a document don't touch.  But if you wanted to 
index say \[2000 TO 2014] and \[2006 TO 2007] then it's as if the 2nd range 
doesn't even exist.  The document won't match for IsWithin a query of 
\[2006-2008].
* The worst-case number of terms generated for a range at index-time is pretty 
high.  If you wanted to index Long.MIN_VALUE+1 TO Long.MAX_VALUE-1 (which spans 
hundreds of millions of years), we're talking about 14k terms(*).  But it's 
certainly not commonly that bad unless you were indexing random milliseconds at 
random millennia. Indexing a 2 adjacent month duration in the same year is only 
7 terms.  At search time, lots of hypothetical terms in a duration isn't an 
issue for RPTs algorithms for the common case of a sparsely populated term 
space.

Interestingly, using a 2D prefix-tree for single-dimensional durations 
expressed as points doesn't have these shortcomings.  But that approach is 
slower to search than this approach (more possible terms in a search area; it's 
half of the square of the number of terms in this 1D tree), and is not amenable 
to terms-enumeration style interval faceting that I'll be doing next.

(*) The number of terms currently being generated would be cut by ~40-50% once 
LUCENE-4942 gets done.

> Index/search multi-valued time durations
> ----------------------------------------
>
>                 Key: LUCENE-5648
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5648
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/spatial
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE-5648.patch, LUCENE-5648.patch, LUCENE-5648.patch, 
> LUCENE-5648.patch
>
>
> If you need to index a date/time duration, then the way to do that is to have 
> a pair of date fields; one for the start and one for the end -- pretty 
> straight-forward. But if you need to index a variable number of durations per 
> document, then the options aren't pretty, ranging from denormalization, to 
> joins, to using Lucene spatial with 2D as described 
> [here|http://wiki.apache.org/solr/SpatialForTimeDurations].  Ideally it would 
> be easier to index durations, and work in a more optimal way.
> This issue implements the aforementioned feature using Lucene-spatial with a 
> new single-dimensional SpatialPrefixTree implementation. Unlike the other two 
> SPT implementations, it's not based on floating point numbers. It will have a 
> Date based customization that indexes levels at meaningful quantities like 
> seconds, minutes, hours, etc.  The point of that alignment is to make it 
> faster to query across meaningful ranges (i.e. [2000 TO 2014]) and to enable 
> a follow-on issue to facet on the data in a really fast way.
> I'll expect to have a working patch up this week.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5648) Index/search multi-valued time durations

Reply via email to