On Wed, 6 Jun 2001 09:43:09 -0700 , Xitong Li said:
>
> Case 1: overlap is in the beginning of a string
> $string1 =
> "longoverlapstringfollowedbylongrandomstringadfjakjekjtanc;zitkl;nv;ihqkktls
> adfjkalfdjladfjladfl"
> $string2 =
> "longoverlapstringfollowedbylongrandomstringqiutrnzuktqtkionkajgnyinlcvkjgk"
>
> Case 2: overlap is in the middle of a string
> $string1 =
> "pnuuermnoucnkdjantrtaklongoverlapstringfollowedbylongrandomstringadfjakjekj
> tanc;zitkl;nv;ihqkktlsadfjkalfdjladfjladfl"
> $string2 =
> "clkjakninkczjfginmrtlongoverlapstringfollowedbylongrandomstringqiutrnzuktqt
> kionkajgnyinlcvkjgk"
>
> Question:
> Is there any good way to parse out the overlap string (should be
> "longoverlapstringfollowedbylongrandomstrin") other than tedious repetitive
> while loop to test one character at a time?
Hello
Randal Schwartz recommended using the LCS function (part of the
Algorithm::Diff). However, LCS, or the longest common sequence, is a different
problem than the one asked here.
LCS: given 2 arrays, find the longest list L s.t. forall i, j=i+1 if Li comes
before Lj in the sequence
L then Li comes before Lj in both given arrays.
Example:
a b c d e f g
b c a e f g
-----------
- b c - e f g
However, the problem here is given 2 strings, find the longest str s.t. str is
a substring in both 2 strings, which is "efg" in the prev example.
I don't think there is a modules for solving this problem (since I don't think
it's that common). However, to roll your own is not that hard. Let me give
you a clue: given 2 strings, split them to make 2 lists, then create a 2-dim
array (list of lists) W s.a. $W[$i][$j] = ($str1[$i] eq $str2[$j])? '1':'0';
if you print this array, you will see the pattern. Ones are matches and Zeros
are none.
1 0
0 1 for a common string of length 2
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 for a common string of length 4
It shouldn't be hard to find the longest common substring from this bitmap.
First, we can rotate one of the strings as we fill the array from left to
right. This way, we get the following shape for a substring of length 4:
1 1 1 1
0 0 0 0
0 0 0 0
0 0 0 0
And from the entire map, the problem now is locating the longest horizontal
sequence of Ones. If the matrix was filled with $W[$i][$j] = ($str1[$i] eq
$str2[$j])? $str1[$i] :'1', then you can see the longest matching string in the
matrix.
I hope this helped writing an effecient perl algorithm.
Aziz,,,
_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com