Re: Counting the number of times a string matches in another string

scott . e . robinson Thu, 13 Mar 2003 07:48:21 -0800

Okay, since i wasn't clear the first time, let me try again.  Sorry, I'm
not a professional programmer, I'm a true beginner.

SHORT VERSION:

What I want is Joseph's option 2b, if I understand him correctly.  Given a
string of the form

:M260:

I want to get a count of its occurrences in a single other string of the
form

:L000:W000:M260:B271:8:A:

(Incidentally, the first string, :M260:, came from a line like
:L121:M260:B250:L000:, but saying so probably only confused the issue.
Since I want to compare each colon-delimited "chunk" of the source string
to the target string, I assume I will need to loop through each chunk in
the source string, comparing each "chunk" to the target string.)

LONG VERSION:

I think John W. Krahn's post is very close to what I was actually asking.
John's solution showed me that the context of the problem may affect the
choice of solution, so let me explain the entire problem, in case someone
wants to look at the whole thing:

I am comparing a "source" set of oil well names to a "target" set of oil
well names to find the closest matches.  Each well name consists of two
parts: a name (like "McAffey" or "Idd El Shargi") and a number (like B-1).
I print out the 50 or so best-matching "target" names for each name in the
"source" list.

I split the well name into "chunks" based on whitespace and punctuation.  I
also remove leading zeroes.  If letters and numbers adjoin, I split those
into separate chunks as well.  I reduce each word in the well name to a
Soundexed string and chain them together in a single string delimited by
colons.  I divide the well number into its separate numeric and alpha parts
and include them in the string, but I don't Soundex them.  I end up with
data of the form

:L520:T400:C000:S000:L200:8:
:L520:T400:C000:S000:L200:8:
:L520:T400:C000:S000:L200:24:E214:
:L520:T400:C000:S000:M:24:E214:
:L520:T400:C000:S000:L200:14:E214:
:L520:T400:C000:S000:L200:14:E214:
:L520:M260:C000:S000:L200:14:E214:
:L520:T400:M260:S000:M260:14:E214:
:L520:T400:C000:S000:L200:14:E214:

To compare one well name to another to see how well they match, I compared
each "chunk" from each name in the "source" list to each "chunk" in each
name in the "target" list in a doubly-nested loop, which is quite
expensive.  I accumulate a score which is just the number of matches.  Then
I sort by the score and print the top 100.

I thought string comparison would eliminate the need to loop across the
"target" string and make the whole thing run much faster, but I need to
know how many times the source is found in the target.  That's what i
couldn't figure out how to do without looping.

I have also realized I need to do two separate comparison passes, one for
the well name part (the Soundexed chunks) and a second pass on the
high-scoring well names to compare the well number parts of them (the pure
alpha or pure numeric chunks).  I need to do this because the well number
"chunks" give a lot of artificially high scores, since simple strings like
"B" and "1" are pretty common in well numbers.

I think John Krahn's post is just about what I was asking for.  I just need
a way to process the Soundexed "chunks" in the first comparison, and only
the non-Soundexed (pure alpha or pure numeric) "chunks" in the second
comparison.

Thanks,

Scott

Scott E. Robinson
SWAT Team
UTC Onsite User Support
RR-690 -- 281-654-5169
EMB-2813N -- 713-656-3629

                      "R. Joseph Newton"                                               

                      <[EMAIL PROTECTED]>       To:       [EMAIL PROTECTED]            

                                               cc:       Mark Anderson <[EMAIL 
PROTECTED]>, [EMAIL PROTECTED]          
                                               Subject:  Re: COunting the number of 
times a string matches in another       
                      03/12/03 07:19 PM          string                                

[EMAIL PROTECTED] wrote:

> Thanks, Rob and Mark, but I'm pretty sure I'm trying to do something a
> little different from a count hash.  Each token in the candidate string
> needs to be compared separately to all the target strings, and then count
> the number of matches.  So take any token out of the first string --
> :M260: for example -- and count its matches against one of the target
> strings, like :L520:M260:C000:S000:L200:14:E214:.  I don't think a count
> hash does that??
>
> Thanks,
>
> Scott
>
> Scott E. Robinson
> SWAT Team
> UTC Onsite User Support
> RR-690 -- 281-654-5169
> EMB-2813N -- 713-656-3629

Hi Scott,

The first task of a programmer is to develop a clear specification for the
functionality desired.  It looks like you need to do some work here,
because the specification is somewhat ambiguous.  Right offhand, I can
think of two interpretations for the functionality you describe:

1)  You wish to get the account of occurences, in each other line, of each
line in some basis line.  You indicate something about the first line.  Is
this line somehow distinct from the others, that it is used as a basis?

2)
  a) You wish to get the count, for each unique token in each line, of
occurences in each other line.
  b) You wish to get the count, for each token in each line, of occurences
in each other line.
Either of these would produce a lot more output, and require much more
processing.  Alternative b would also be redundant.

What precisely are you trying to get?

Joseph

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Counting the number of times a string matches in another string

Reply via email to