This seems to work. I’m inferring the | is present in each line that needs to be fixed.
import pandas import logging class Wrapper: """Wrap file to fix up data""" def __init__(self, filename): self.filename = filename def __enter__(self): self.fh = open(self.filename,'r') return self def __exit__(self, exc_type, exc_val, exc_tb): self.fh.close() def __iter__(self): """This is required by pandas for some reason, even though it doesn't seem to be called""" raise ValueError("Unsupported operation") def read(self, n: int): """Read data. Replace 'grace' before | if it has underscores in it""" try: data = self.fh.readline() ht = data.split('|', maxsplit=2) if len(ht) == 2: head,tail = ht hparts = head.split(maxsplit=7) assert len(hparts) == 8 if ' ' in hparts[7].strip(): hparts[7] = hparts[7].strip().replace(' ','_') fixed_data = f"{' '.join(hparts)} | {tail}" return fixed_data return data except: logging.exception("read") logging.basicConfig() with Wrapper('data.txt') as f: df = pandas.read_csv(f, delimiter=r"\s+") print(df) From: Python-list <python-list-bounces+gweatherby=uchc....@python.org> on behalf of Loris Bennett <loris.benn...@fu-berlin.de> Date: Wednesday, November 23, 2022 at 2:00 PM To: python-list@python.org <python-list@python.org> Subject: Preprocessing not quite fixed-width file before parsing *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. *** Hi, I am using pandas to parse a file with the following structure: Name fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none I was initially able to use df = pandas.read_csv(file_name, delimiter=r"\s+") because all the values for 'grace' were 'none'. Now, however, non-"none" values have appeared and this fails. I can't use pandas.read_fwf even with an explicit colspec, because the names in the first column which are too long for the column will displace the rest of the data to the right. The report which produces the file could in fact also generate a properly delimited CSV file, but I have a lot of historical data in the readable but poorly parsable format above that I need to deal with. If I were doing something similar in the shell, I would just pipe the file through sed or something to replace '5 days' with, say '5_days'. How could I achieve a similar sort of preprocessing in Python, ideally without having to generate a lot of temporary files? Cheers, Loris -- This signature is currently under constuction. -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$> -- https://mail.python.org/mailman/listinfo/python-list