Thomas Passin <li...@tompassin.net> writes: > On 11/23/2022 11:00 AM, Loris Bennett wrote: >> Hi, >> I am using pandas to parse a file with the following structure: >> Name fileset type KB quota limit >> in_doubt grace | files quota limit in_doubt grace >> shortname sharedhome USR 14097664 524288000 545259520 0 >> none | 107110 0 0 0 none >> gracedays sharedhome USR 774858944 524288000 775946240 0 >> 5 days | 1115717 0 0 0 none >> nametoolong sharedhome USR 27418496 524288000 545259520 0 >> none | 11581 0 0 0 none >> I was initially able to use >> df = pandas.read_csv(file_name, delimiter=r"\s+") >> because all the values for 'grace' were 'none'. Now, however, >> non-"none" values have appeared and this fails. >> I can't use >> pandas.read_fwf >> even with an explicit colspec, because the names in the first column >> which are too long for the column will displace the rest of the data to >> the right. >> The report which produces the file could in fact also generate a >> properly delimited CSV file, but I have a lot of historical data in the >> readable but poorly parsable format above that I need to deal with. >> If I were doing something similar in the shell, I would just pipe >> the >> file through sed or something to replace '5 days' with, say '5_days'. >> How could I achieve a similar sort of preprocessing in Python, ideally >> without having to generate a lot of temporary files? > > This is really annoying, isn't it? A space-separated line with spaces > in data entries. If the example you give is typical, I don't think > there is a general solution. If you know there are only certain > values like that, then you could do a search-and-replace for them in > Python just like the example you gave for "5 days". > > If you know that the field that might contain entries with spaces is > the same one, e.g., the one just before the "|" marker, you could make > use of that. But it could be tricky. > > I don't know how many files like this you will need to process, nor > how many rows they might contain. If I were to do tackle this job, I > would probably do some quality checking first. Using this example > file, figure out how many fields there are supposed to be. First, > split the file into lines: > > with open("filename") as f: > lines = f.readlines() > > # Check space-separated fields defined in first row: > fields = lines[0].split() > num_fields = len(fields) > print(num_fields) # e.g., 100) > > # Find lines that have the wrong number of fields > bad_lines = [] > for line in lines: > fields = line.split() > if len(fields) != num_fields: > bad_lines.append(line) > > print(len(bad_lines)) > > # Inspect a sample > for line in bad_lines[:10]: > print(line) > > This will give you an idea of how many problems lines there are, and > if they can all be fixed by a simple replacement. If they can and > this is the only file you need to handle, just fix it up and run it. > I would replace the spaces with tabs or commas. Splitting a line on > spaces (split()) takes care of the issue of having a variable number > of spaces, so that's easy enough. > > If you will need to handle many files, and you can automate the fixes > - possibly with a regular expression - then you should preprocess each > file before giving it to pandas. Something like this: > > def fix_line(line): > """Test line for field errors and fix errors if any.""" > # .... > return fixed > > # For each file > with open("filename") as f: > lines = f.readlines() > > fixed_lines = [] > for line in lines: > fixed = fix_line(line) > fields = fixed.split() > tabified = '\t'.join(fields) # Could be done by fix_line() > fixed_lines.append(tabified) > > # Now use an IOString to feed the file to pandas > # From memory, some details may not be right > f = IOString() > f.writelines(fixed_lines) > > # Give f to pandas as if it were an external file > # ... >
Thanks to both Gerard and Thomas for the pointer to IOString. I ended up just reading the file line-by-line, using a regex to replace '<n> <units> |' with '<n><units> |' and writing the new lines to an IOString, which I then passed to pandas.read_csv. The wrapper approach looks interesting, but it looks like I need to read up more on contexts before adding that to my own code, otherwise I may not understand it in a month's time. Cheers, Loris -- This signature is currently under constuction. -- https://mail.python.org/mailman/listinfo/python-list