Re: parsing tab and newline delimited text

Tim Chase Tue, 03 Aug 2010 19:56:57 -0700

On 08/03/10 21:14, elsa wrote:

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.


So, an individual entry might have this form (in printed form):

Title    date   position   data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

My first question is whether the line contains actual newline/tabcharacters within the field data, or the string-representation ofthe line. For one of the lines in question, what does


  print repr(line)

(or "print line.encode('hex')") produce? If the line has extraliteral tabs, then you may be stuck; if the line has escaped text(a backslash followed by an "n" or "t", i.e. 2 characters) thenit's pretty straight-forward. Ideally, you'd see something like


  >>> print repr(line)
  'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
          ^tab        ^tab ^tab        ^backslash^

where the backslashes are literal.

If you know that it's the last ("data") field that can containsuch characters, you can at least catch non-newline characters byonly splitting the first N splits:


  parts = line.split('\t', 3)

That doesn't solve the newline problem, but your file'sdefinition prevents you from being able to discern


filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'

Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are theythe items in the next row?


-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing tab and newline delimited text

Reply via email to