Re: Retrieving partial field values

SGreen Wed, 05 Jan 2005 07:28:55 -0800

Stephen Moretti <[EMAIL PROTECTED]> wrote on 01/05/2005 07:22:18 AM:

> Kentucky Families wrote:
> 
> >Very new to MySQL and this list. Thanks in advance for your 
> patience with elementary questions. I'm in process of converting a 
> current web site to a mysql database. As this database will 
> eventually contain millions of records, I want to set it up properly.
> > 
> >The database will contain transcriptions of birth, marriage and 
> death records. For surname fields, I would like for my transcribers 
> to be able to enter multiple values for records that are difficult 
> to read. For instance, a surname that could be read Stotts or Statts
> or Stutts. If I use a VARCHAR or TINYEXT field to enter these values
> and I want to be able to retrieve all records where the surname 
> field contains the whole word Stotts, how would I enter these values:
> > 
> >stotts or statts or stutts
> >stotts,statts,stutts
> >other?
> > 
> >
> You need to do an IN query.
> 
> SELECT column,list,here
> FROM tblBMD
> WHERE Surname IN ('stotts','statts','stutts')
> 
> http://dev.mysql.com/doc/mysql/en/Comparison_Operators.html#IDX1235
> 
> Regards
> 
> Stephen
>


Stephen, 

I don't think you understood the question.  She wants to allow for 
multiple values within a single field, not to check several rows of a 
single-value field.  She explained why very well.

I can't make a solid recommendation but I would assume that a full-text 
index on your surname field is what you are after. If you separated each 
name possibility with a simple space, the FT indexer would pick up on the 
word breaks and index each name individually (assuming that the name is 
not on the "stop words" list and that it is at least 4 characters long). 

An only slightly more complex (but more normalized) design would use a 
separate table of surnames. Each surname has its own ID. Actually, you 
would need 2 new tables: a surnames table and a table to associate a 
surname to a record. 

The surnames table could look like:

CREATE TABLE surname (
        id int auto_increment
        , name varchar(150) not null
        , PRIMARY KEY (ID)
        , UNIQUE(name)
)

and the table that would link your surnames list to the "record" table 
(birth record, death record, etc.):

CREATE TABLE record_surname (
        record_id int not null
        , surname_id int not null
        , confidence float
        , PRIMARY KEY(record_id, surname_id)
)

This not only gives you the ability to store each surname individually, it 
helps with _partial_ name matches(!!), and it gives you the ability to 
assign a confidence level to what the guess is. For your example if a name 
looks like stotts, statts, or stutts the transcriptionist could weigh one 
as being more likely than the others. What you use to represent those 
levels are up to you (percents, scale 1-to-10, simple order,... it doesn't 
matter how you rank them but this lets you do it.)

It would be very fast to do a surname search and if not found, stop there. 
If you do find a surname (or list of surnames that match a pattern) you 
could easily take those ids and query the other tables to retrieve 
whatever other details you may have.

You may consider doing the same thing with "given names" (first names) as 
they could be just as hard to decipher from the older records. That way if 
you have Jeb or Job as a first name and Statt or Stott as a last name you 
could store all 4 combinations in a fairly compact and very searchable 
way.

Just a suggestion,

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Re: Retrieving partial field values

Reply via email to