Ramprasad A Padmanabhan wrote: > I know this is more of an algorithm question but please bear with me. > > In my program I am checking wether a emailid exists in a list > I have in the complete_list a string like > $complete_string="<[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> ... > <[EMAIL PROTECTED]>"; > > # that is a string of emailids sorted > > Now I want to find out if another ID "<[EMAIL PROTECTED]>" exists in this list > > How is it possible to optimize the search given that the complete list > string is sorted in order
Don't optimize--at least not until you are sure your getting the information right. It is just the wrong emphasis. Here are a couple of approaches that have worked for me. The first, used in building an index, is predicated on the concept of saving all possible information. The second, used in loading a tree structure, takes as a design imperative that any node should focus only on one prime reference, in this case the post being replied to. The wealth of possibilities can be a recipe for inifnite loops when loading a tree, hence it's better to throw away some information than to follow potentially circular links. First approach: Setup: The only information available to this routine is the $message_info hash reference. This has the form: $message_info => { $sequence_number1 => { 'Message-ID' => $message_id, Subject => $subject ... } ... } The references item is a potentially multi-item string, and this routine handles this. $reference_index and $no_references are sent into the gather_References_items function to load their data structures. The latter, no_references, is an afterthought. sub create_References_index { my $message_info = $_[0]; my $reference_index = {}; my $no_references = []; gather_References_items($message_info, $reference_index, $no_references); open OUT, ">dbm/referenc.ndi" or die "Could not open file to write Reference index: $!"; foreach my $column_value (sort {uc $a cmp uc $b} keys %$reference_index) { my $index_row = column_index_line($reference_index, $column_value); print OUT "$index_row\n"; } close OUT or die "Could not close file after writing Reference index: $!"; create_no_reference_index($no_references); } sub gather_References_items { my ($message_info, $reference_index, $no_references) = @_; foreach my $msg_key (keys %$message_info) { my $data_field = $message_info->{$msg_key}->{'References'}; if (!$data_field) { push(@$no_references, $msg_key); } next if !$data_field or $data_field !~ /\S/; $data_field =~ s/^<(.*)>$/$1/; my @references = split />\s*</, $data_field; foreach my $reference (@references) { $reference_index->{$reference} = [] # create the anonymous target array if !defined($reference_index->{$reference}); # if it doesn't exist push @{$reference_index->{$reference}}, $msg_key; } } } The above very hadily preserves all sequence numbers of posts referencing any given Message-ID. The following is more focused on getting just one [hoefully the best, but more importantly cogent]. It is used in the active process of loading a threaded mail index. As I said, the focus here is on keeping the data lean and focused. That is why, instead of dealing with the whole list of References items, I just pop until I find the one that seems best suited, then get out, leaving extraneous data behind. I tried using different data structures to hold different kinds of relationships, and ended up fighting with the loading routine all day. This works: sub establish_parentage { my ($message_details, $subjects, $message_ids, $threads) = @_; my $no_references = []; my $candidate_roots = [sort {$a <=> $b} keys %$message_details]; my $parents = {}; my $temp = []; while (my $candidate = shift @$candidate_roots) { seek_explicit_references($candidate, $message_details, $no_references, $parents, $message_ids); } $candidate_roots = $no_references; $no_references = []; while (my $candidate = shift @$candidate_roots) { seek_implicit_references($candidate, $message_details, $no_references, $parents, $subjects); } return $parents; } sub seek_explicit_references { my ($candidate, $message_details, $no_references, $parents, $message_ids) = @_; my $details = $message_details->{$candidate}; return if $details->{'parent'}; my $references = $details->{'References'}; $references =~ s/^\s+//; $references =~ s/\s+$//; my @references = split /\s+/, $references; my $found = 0; while (my $reference = pop @references) { if (my $parent_sequence = $message_ids->{$reference}) { $details->{'parent'} = $parent_sequence; $parents->{$candidate} = $parent_sequence; $found = 1; last; } } push @$no_references, $candidate if not $found; } sub seek_implicit_references { my ($candidate, $message_details, $no_references, $parents, $subjects) = @_; # We won't go here. I'm still feeling the pain, but this all is now working. # I used to call this function try_idiot_threading, but this is slightly more clear. # I guess I should name it seek_subject_based_references() Joseph -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]