On Fri, 2005-06-17 at 12:33 +1000, John Machin wrote: > OK then, let's ignore the fact that the data is in a collection of Word > & Excel files, and let's ignore the scale for the moment. Let's assume > there are only 100 very plain text files to process, and only 1000 SSNs > in your map, so it doesn't have to be very efficient. > > Can you please write a few lines of Python that would define your task > -- assume you have a line of text from an input file, show how you would > determine that it needed to be changed, and how you would change it.
The script is too long to post in its entirety. In short, I open the files, do a binary read (in 1MB chunks for ease of memory usage) on them before placing that read into a variable and that in turn into a list that I then apply the following re to ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') like this: for chunk in whole_file: search = ss.findall(chunk) if search: validate(search) The validate function makes sure the string found is indeed in the range of a legitimate SSN. You may read about this range here: http://www.ssa.gov/history/ssn/geocard.html That is as far as I have gotten. And I hope you can tell that I have placed some small amount of thought into the matter. I've tested the find a lot and it is rather accurate in finding SSNs in files. I have not yet started replacing anything. I've only posted here for advice before beginning. > > > > > > >>(4) Under what circumstances will it not be possible to replace *ALL* > >>the SSNs? > > > > > > I do not understand this question. > > Can you determine from the data, without reference to the map, that a > particular string of characters is an SSN? See above. > > If so, and it is not in the map, why can it not be *added* to the map > with a new generated ID? It is not my responsibility to do this. I do not have that authority within the organization. Have you never worked for a real-world business and dealt with office politics and territory ;) > >>And what is the source of the SSNs in this file??? Have they been > >>extracted from the data? How? > > > > > > That is irrelevant. > > Quite the contrary. If they had been extracted from the data, They have not. They are generated by a higher authority and then used by lower authorities such as me. Again though, I think this is irrelevant to the task at hand... I have a map, I have access to the data and that is all I need to have, no? I do appreciate your input though. If you would like to have a further exchange of ideas, perhaps we should do so off list? -- http://mail.python.org/mailman/listinfo/python-list