RE: union of times algorithm

Jackson, Harry Thu, 06 Jun 2002 03:45:01 -0700

> -----Original Message-----
> From: Jonathan E. Paton [mailto:[EMAIL PROTECTED]]
> 
> 
>  --- "Jackson, Harry" <[EMAIL PROTECTED]> wrote: > 
> > I have been looking back at this problem and here is what I 
> have found.
> > 
> > Lets take the following set of times
> >      A  , B
> > 1   (4  , 5)
> > 2   (9  , 10)
> > 
> > If we sort by column A the set inherits the following 
> property or Rule 
> > 
> >     a1 <= a2 is always true
> > 
> > This means that we are only interested b1 properties in 
> relation to a2 and
> > b2.  If (b1 >= b2) then because of Rule 1 we can ignore the 
> values in the
> > second set and remove them.
> 
> This seems wrong, why (b1 >= b2)?  That's an assertion we 
> aren't looking
> backwards in time (lets assume that the input has been 
> checked for this
> - it makes the calculations easier (I could solve it with a 
> double pass)).


I made the assertion of the stop time always being greater than or equal to
its associated start time because of the wording in the original post. A
stop (time) cannot be negative in respect to its start time. I know its an
assumption but for what was asked it is relatively safe. If the stop time
was to be negative i.e.

Column a holds the most recent start time and Column B hold the last stop
time and you where working out the downtime of a server then your point is
valid but if you swap column A for B then run it through again its fixed.  



> 
> If I assume you meant (a1 >= b2), then it is not valid:

I did not mean the above.

> 
> >      A  , B
> > 1   (4  , 5)
> > 2   (9  , 10)
> 
> Which is already sorted on A, now we can see that if a1>=b2 
> should never be
> true.  So, lets assume (a2 >= b1):
> 
> >      A  , B
> > 1   (4  , 5)
> > 2   (9  , 10)


Again I did not mean (a2 >= b1). The reason I do not compare the value of a1
with anything is because we know one of its properties proved by Rule 1
(after it has been sorted on column A).



> 
> Now, if 5 > 9, then we could merge these (in this case they 
> aren't).  With
> the new set being (4, 10) [if 5 > 9!]
> 
> > If this is not the case then if (b1 < a2) we can safely 
> deduce that it has
> > no overlap in any other set from rule 1 and calculate the 
> time and store it.
> 
> I think this might be true, proving you have collapsed the 
> above.  It warrants
> someone better than I at Algorithms to look at.


If you look at the source you will see that I calculated the value then
removed the first element and collapsed the array. I then go back to the
start of the rules and apply them again.


> 
> > If however (b1 < a2) is false then because of Rule 1 we can 
> safely replace
> > b1 with b2 and remove the second set.
> 
> I'm lost.  :P

If the above is false then we can assume that b1 is greater than or equal to
a1. It overlaps. We have checked the other two rules before hand and so far
we have the following logic.

We know a1 =< a2                FROM RULE 1
We know b2 > b1         FROM RULE 2
We know b1 > a2         FROM RULE 3

therefore we can replace b1 with b2 and remove the second set which I have
used splice to do.    

> 
> > The code I have used to get the results is at the end. I 
> have no idea if it
> > is a fast solution. I am not sure if it is the best way to 
> code it either
> > and I am sure some people can optimise it for best 
> performance. I wanted to
> > use a for loop but went for the easier while loop instead. 
> 
> If you only have to sort once, and you can step through in a 
> linear fashion
> then it will be pretty fast.  It is not dissimilar to what I 
> did (and nobody
> has complained yet), it does however have an unfortunate side effect:
> 
>   the original sample of data gets destroyed

It could be written so that the shift and splice will fill another array as
the bits are taken off. Or pairs could be passed in for processing and the
original array not touched.


> 
> but that is easily fixable.  The time complexity of our 
> solutions is going
> to be the same [I think], however improvements can be made to 
> make it even
> more efficient - if we pre-process the input data, which is 
> not unlikely.
> Imagine a calendar with associated timeslices, where you can do some
> pre-processing and even store to disk.
> 
> I was going to try using an AVL tree, but someone hasn't 
> written Tree::AVL
> yet!  I don't have time just now, so it's going to have to be 
> on the back
> burner for the moment.

I have drawn a representation of what I was doing in a tree like structure
and it comes out well, although I am unsure if my method could be used as a
binary tree. What I found was that I was left with node and leafs whose sums
would add up to the total time used. The main advantage of a binary tree is
sorting and I am trying to reason what the advantage would be in this case.
For me if I wanted a range I would remove the range from the array and
calculate it as above. 

> 
> > For thousands of records if we find the greatest stop time 
> then if our
> > searching set every has this value as its stop time  we can 
> calculate the
> > total and exit without iterating over any more sets 
> providing the set has
> > been sorted on column A.

When I read it back it does not make much sense ;-) . What I mean is that
depending on the spread or set size if we know the largest stop time using
the rules above if we ever find (b1 = (largest stop time)) then we have no
need to search any further along the array. We would need to find the
largest stop time before we start traversing the array.


> Huh?
> 
> > When we have sets with matching start times we are only 
> > interested in the set with the largest difference and 
> > vice versa for the stop times.
> 
> Time complexity to detect that case is poor, is it not?

Again this depends, if we have start times that vary by only one hour but
millions of them it may be quicker to sort them into a hash, after sorting
column B in ascending order where column A is the key (I think a hash
automatically inserts a key value pair over an entry where it finds two keys
equal which means we will only have the entries with the largest spread
(could someone better at Perl than me confirm this)). This would greatly
reduce the number of pairs that we would have to work with. Obviously if the
stop times are all close then we could use the stops as keys.

Please bare with me as I am no Maths Nut and am just bumbling through this
trying to apply what little common sense I have.

Harry




*************************************************************************************
COLT Telecommunications
Registered in England No. 2452736
Registered Office: Bishopsgate Court, 4 Norton Folgate, London E1 6DQ
Tel. +44 20 7390 3900

This message is subject to and does not create or vary any contractual
relationship between COLT Telecommunications, its subsidiaries or 
affiliates ("COLT") and you. Internet communications are not secure
and therefore COLT does not accept legal responsibility for the
contents of this message.  Any view or opinions expressed are those of
the author. The message is intended for the addressee only and its
contents and any attached files are strictly confidential. If you have
received it in error, please telephone the number above. Thank you.
*************************************************************************************


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: union of times algorithm

Reply via email to