Re: how to optimize a string search

R. Joseph Newton Mon, 10 Nov 2003 23:57:32 -0800

Ramprasad A Padmanabhan wrote:

> I know this is more of an algorithm question but please bear with me.
>
> In my program I am checking wether a emailid exists in a list
> I have in the complete_list a string like
> $complete_string="<[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> ... 
> <[EMAIL PROTECTED]>";
>
> #  that is a string of emailids sorted
>
> Now I want to find out if another ID  "<[EMAIL PROTECTED]>" exists in this list
>
> How is it possible to optimize the search  given that the complete list
> string is sorted in order


Don't optimize--at least not until you are sure your getting the information
right.  It is just the wrong emphasis.

Here are a couple of approaches that have worked for me.  The first, used in
building an index, is predicated on the concept of saving all possible
information.  The second, used in loading a tree structure, takes as a design
imperative that any node should focus only on one prime reference, in this case
the post being replied to.  The wealth of possibilities can be a recipe for
inifnite loops when loading a tree, hence it's better to throw away some
information than to follow potentially circular links.

First approach:

Setup:

The only information available to this routine is the $message_info hash
reference.  This has the form:

$message_info => {
     $sequence_number1 => {
          'Message-ID' => $message_id,
          Subject      => $subject
           ...
          }
      ...
}

The references item is a potentially multi-item string, and this routine
handles this.  $reference_index and $no_references are sent into the
gather_References_items function to load their data structures.  The latter,
no_references, is an afterthought.


sub create_References_index {
  my $message_info = $_[0];
  my $reference_index = {};
  my $no_references = [];

  gather_References_items($message_info, $reference_index, $no_references);

  open OUT, ">dbm/referenc.ndi" or die
   "Could not open file to write Reference index: $!";
  foreach my $column_value (sort {uc $a cmp uc $b} keys %$reference_index) {
    my $index_row = column_index_line($reference_index, $column_value);
    print OUT "$index_row\n";
  }
  close OUT or die
   "Could not close file after writing Reference index: $!";
  create_no_reference_index($no_references);
}

sub gather_References_items {
  my ($message_info, $reference_index, $no_references) = @_;

  foreach my $msg_key (keys %$message_info) {
    my $data_field = $message_info->{$msg_key}->{'References'};
    if (!$data_field) {
      push(@$no_references, $msg_key);
    }
    next if !$data_field or $data_field !~ /\S/;
    $data_field =~ s/^<(.*)>$/$1/;
    my @references = split />\s*</, $data_field;
    foreach my $reference (@references) {
      $reference_index->{$reference} = []   # create the anonymous target array

       if !defined($reference_index->{$reference});  # if it doesn't exist
      push @{$reference_index->{$reference}}, $msg_key;
    }
  }
}


The above very hadily preserves all sequence numbers of posts referencing any
given Message-ID.  The following is more focused on getting just one [hoefully
the best, but more importantly cogent].  It is used in the active process of
loading a threaded mail index.  As I said, the focus here is on keeping the
data lean and focused.  That is why, instead of dealing with the whole list of
References items, I just pop until I find the one that seems best suited, then
get out, leaving extraneous data behind.  I tried using different data
structures to hold different kinds of relationships, and ended up fighting with
the loading routine all day.  This works:


sub establish_parentage {
  my ($message_details, $subjects, $message_ids, $threads) = @_;

  my $no_references = [];
  my $candidate_roots = [sort {$a <=> $b} keys %$message_details];
  my $parents = {};
  my $temp = [];
  while (my $candidate = shift @$candidate_roots) {
    seek_explicit_references($candidate, $message_details,
     $no_references, $parents, $message_ids);
  }
  $candidate_roots = $no_references;
  $no_references = [];
  while (my $candidate = shift @$candidate_roots) {
    seek_implicit_references($candidate, $message_details, $no_references,
$parents,
     $subjects);
  }

  return $parents;
}

sub seek_explicit_references {
  my ($candidate, $message_details, $no_references, $parents, $message_ids) =
@_;

  my $details = $message_details->{$candidate};
  return if $details->{'parent'};
  my $references = $details->{'References'};
  $references =~ s/^\s+//;
  $references =~ s/\s+$//;
  my @references = split /\s+/, $references;
  my $found = 0;
  while (my $reference = pop @references) {
    if (my $parent_sequence = $message_ids->{$reference}) {
      $details->{'parent'} = $parent_sequence;
      $parents->{$candidate} = $parent_sequence;
      $found = 1;
      last;
    }
  }
  push @$no_references, $candidate if not $found;
}

sub seek_implicit_references {
  my ($candidate, $message_details, $no_references, $parents, $subjects) = @_;
#  We won't go here.  I'm still feeling the pain, but this all is now working.
#  I used to call this function try_idiot_threading, but this is slightly more
clear.
#  I guess I should name it seek_subject_based_references()

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to optimize a string search

Reply via email to