On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote: > rbt a écrit : > > Here's the scenario: > > > > You have many hundred gigabytes of data... possible even a terabyte or > > two. Within this data, you have private, sensitive information (US > > social security numbers) about your company's clients. Your company has > > generated its own unique ID numbers to replace the social security numbers. > > > > Now, management would like the IT guys to go thru the old data and > > replace as many SSNs with the new ID numbers as possible. You have a tab > > delimited txt file that maps the SSNs to the new ID numbers. There are > > 500,000 of these number pairs. What is the most efficient way to > > approach this? I have done small-scale find and replace programs before, > > but the scale of this is larger than what I'm accustomed to. > > > > Any suggestions on how to approach this are much appreciated. > > Are this huge amount of data to rearch/replace stored in an RDBMS or in > flat file(s) with markup (XML, CSV, ...) ? > > -- > Gilles
The data is in files. Mostly Word documents and excel spreadsheets. The SSN map I have is a plain text file that has a format like this: ssn-xx-xxxx new-id-xxxx ssn-xx-xxxx new-id-xxxx etc. There are a bit more than 500K of these pairs. Thank you, rbt -- http://mail.python.org/mailman/listinfo/python-list