suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this?  One simple way is:

for a_range in listA:
  for b_range in listB:
    is_within(b_range, a_range):
      # accumulate a counter here

Could you detail the is_within() function? most importantly, is it inclusive, such that "a1 <= b1 <= b2 <= a2", or is it overlapping such that "a1 <= b2 or a2 <= b1". Additionally, do you want to count how many intervals of A overlap with intervals of B, or do you just want a count of how many intervals in B have *any* overlap within A?

My leaning would be to make a custom "is this in A" function, iterate over B and test against an "appropriate subset" of A:

  listA = sorted([...])
  min_a1 = min(a1 for (a1, a2) in listA)
  max_a2 = max(a2 for (a1, a2) in listA)
  def in_a(b1, b2):
    for a1, a2 in smartly_chosen_subset(listA, b1, b2):
      if is_within((b1, b2), (a1, a2)): return True
    return False
  i = 0
  for b1, b2 in sorted(listB):
    if b2 < min_a1: continue
    if b1 > max_a2: break
    if in_a(b1, b2): i += 1

The in_a() function can be optimized to terminate early if a1>b2 or a2 < b1 (perhaps even choosing an smart starting point for iterating). Short-circuiting by returning True as soon as you know there's a match will trim some of the time.

How distributed are the endpoints? If there's a lot of commonality in the data, you might be able to cache results to cut even further.

Just a few ideas, and questions that might help develop better solutions.

-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to