rbt wrote: > Here's the scenario: > > You have many hundred gigabytes of data... possible even a terabyte or > two. Within this data, you have private, sensitive information (US > social security numbers) about your company's clients. Your company has > generated its own unique ID numbers to replace the social security numbers. > > Now, management would like the IT guys to go thru the old data and > replace as many SSNs with the new ID numbers as possible. You have a tab > delimited txt file that maps the SSNs to the new ID numbers. There are > 500,000 of these number pairs. What is the most efficient way to > approach this? I have done small-scale find and replace programs before, > but the scale of this is larger than what I'm accustomed to. > > Any suggestions on how to approach this are much appreciated.
I'm just tossing this out. I don't really have much experience with the algorithm, but when searching for multiple targets, an Aho-Corasick automaton might be appropriate. http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ Good luck! -- Robert Kern [EMAIL PROTECTED] "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list