On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote: > rbt wrote: > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. > > This question is grossly OT; it's nothing at all to do with Python. > However .... > > (0) Is this homework?
No, it is not. > > (1) What to do with an SSN that's not in the map? Leave it be. > > (2) How will a user of the system tell the difference between "new ID > numbers" and SSNs? We have documentation. The two sets of numbers (SSNs and new_ids) are exclusive of each other. > > (3) Has the company really been using the SSN as a customer ID instead > of an account number, or have they been merely recording the SSN as a > data item? Will the "new ID numbers" be used in communication with the > clients? Will they be advised of the new numbers? How will you handle > the inevitable cases where the advice doesn't get through? My task is purely technical. > > (4) Under what circumstances will it not be possible to replace *ALL* > the SSNs? I do not understand this question. > > (5) For how long can the data be off-line while it's being transformed? The data is on file servers that are unused on weekends and nights. > > > > You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. > > And what is the source of the SSNs in this file??? Have they been > extracted from the data? How? That is irrelevant. > > > What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > A sensible answer will depend on how the data is structured: > > 1. If it's in a database with tables some of which have a column for > SSN, then there's a solution involving SQL. > > 2. If it's in semi-free-text files where the SSNs are marked somehow: > > ---client header--- > surname: Doe first: John initial: Q SSN:123456789 blah blah > or > <ssn>123456789</ssn> > > then there's another solution which involves finding the markers ... > > 3. If it's really free text, like > """ > File note: Today John Q. Doe telephoned to advise that his Social > Security # is 123456789 not 987654321 (which is his wife's) and the soc > sec numbers of his kids Bob & Carol are .... > """ > then you might be in some difficulty ... google("TREC") > > > AND however you do it, you need to be very aware of the possibility > (especially with really free text) of changing some string of digits > that's NOT an SSN. That's possible, but I think not probably. -- http://mail.python.org/mailman/listinfo/python-list