----- Original Message ----- From: "Wijaya Edward" <[EMAIL PROTECTED]>
Subject: Mismatch Positions of Two Ambiguous Strings



Hi,
I have two strings that I want to compute the number of mismatches between them. These two strings are of the "same" size. Let's call them 'source' string and 'target' string. Now, the problem is that the 'source' and 'target' string may come in ambiguous form, meaning that in one position they may contain more than 1 (upto 4) characters. The ambiguous position is marked with square bracketed [ATCG] region.
The example is as follows:

Example 1 (where the source is ambiguous):

my $source1  = '[TCG]GGGG[AT]'; # ambiguous
my $target1   = 'AGGGGC'; # No of mismatch = 2  on position 1 and 6
my $target2  = 'TGGGGC'; # No of mismatch = 1  on position 6 only


Example 2 (where the source is NOT ambiguous):

my $source2  =  'TGGGGT'; # not-ambiguous
my $target1  = 'AGGGGC'; # No of mismatch = 2  on position 1 and 6
my $target3  = 'TGGGGT'; # No of mismatch = 0  all position matches


Example 3 (where both source and target are ambiguous)
my $source1  = '[TCG]GGGG[AT]'; # ambiguous
my $target1 = 'AGGGG[CT]'; # ambiguous, no of mismatch = 1 only at position 1

For example I can use bitwise operator to do it.

I have no problem when dealing with Example 1 and 2 above.
But I'm stuck with example 3, where both source and target is ambiguous.


Here is the current snippet I have, which doesn't do the job:

[snip]

Hello Edward

This code will handle ambiguous cases in the source, target or both or neither. I 'lifted' the expand_fasta funtion mostly from Perlmonks, (the link is given in the comments on that section of code). I that code is pretty much bulletproof. I tested all your cases and came up with good answers, as well as additional cases I constructed (where all characters mismatched, for example).

This did not provide the positions that the mismatches occurred at, but I didn't think that was what you were after. If it is, the mismatches() function would need to be different.

Chris



#!/usr/bin/perl
use strict;
use warnings;
use Set::CrossProduct;

my $source = '[TCG]GGGG[AT]';
my $sources = expand_fasta($source);

my $target = 'AGGGG[CT]';
my $targets = expand_fasta($target);

my $mismatches = mismatches($sources, $targets);
print $mismatches;

# 'lifted' most of this function, (w/one correction) from Perlmonks
# http://www.perlmonks.org/index.pl?node_id=510756
sub expand_fasta {
my $str = shift;
my $strings = [];
my @wildcard;
$str =~ s{ \[ ([ATCG]+) \] }{ push @wildcard, $1; "%s"; }xge;

# if string was ATC[TG]CC, then
# now your string looks like "ATC%sCC"

if( @wildcard ) {
    my @set = map [ split //, $_ ], @wildcard;

    # now @set contains ( [ 'T', 'G' ] )
    # and we weave each possible combination
    # into the %s placeholders in $str

    if (@set > 1) { # this if/else needed for crossproduct to work OK
     my $xp = Set::CrossProduct->new( [EMAIL PROTECTED] );
     while( my @tuple = $xp->get ) {
         push @$strings, sprintf $str, @tuple;
     }
    }
    else {
     for (@{$set[0]}) {
      push @$strings, sprintf $str, $_;
     }
    }
}
else {
    push @$strings, $str;
}
return $strings;
}




sub mismatches {
my ($source_ary, $target_ary) = @_;
my $length = length $source_ary->[0];
my $mismatch = $length; # mismatches starts out equal to length of word

for my $src (@$source_ary) {
 for my $target (@$target_ary) {
  my $compare = $src ^ $target;
  my $matches = $compare =~ tr /\0//;
  my $no_match = $length - $matches; # number of mismatches
  if ($no_match == 0) {
   return 0;
  }
  else {
   if ($no_match < $mismatch) {
    $mismatch = $no_match;
   }
  }
 }
}
return $mismatch;
}



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to