Thank you both for helping me out. I am still rather new to Python and so I'm probably trying to reinvent the wheel here.
When I try to do Paul's response, I get >>>tokens = line.strip().split() [] So I am not quite sure how to read line by line. tokens = input.read().split() gets me all the information from the file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like in the example; however, how can I loop this for the entire document? Also, when I try output.write(tokens), I get "TypeError: coercing to Unicode: need string or buffer, list found". Any ideas? On Oct 14, 4:25 pm, Paul Hankin <[EMAIL PROTECTED]> wrote: > On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > I started Python just a little while ago and I am stuck on something > > that is really simple, but I just can't figure out. > > > Essentially I need to take a text document with some chemical > > information in Czech and organize it into another text file. The > > information is always EINECS number, CAS, chemical name, and formula > > in tables. I need to organize them into lines with | in between. So > > it goes from: > > > 200-763-1 71-73-8 > > nátrium-tiopentál C11H18N2O2S.Na to: > > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > > but if I have a chemical like: kyselina močová > > > I get: > > 200-720-7|69-93-2|kyselina|močová > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > > and then it is all off. > > > How can I get Python to realize that a chemical name may have a space > > in it? > > In the original file, is every chemical on a line of its own? I assume > it is here. > > You might use a regexp (look at the re module), or I think here you > can use the fact that only chemicals have spaces in them. Then, you > can split each line on whitespace (like you're doing), and join back > together all the words between the 3rd (ie index 2) and the last (ie > index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses > the somewhat unusual python syntax for replacing a section of a list > with another list. > > The approach you took involves reading the whole file, and building a > list of all the chemicals which you don't seem to use: I've changed it > to a per-line version and removed the big lists. > > path = "c:\\text_samples\\chem_1_utf8.txt" > path2 = "c:\\text_samples\\chem_2.txt" > input = codecs.open(path, 'r','utf8') > output = codecs.open(path2, 'w', 'utf8') > > for line in input: > tokens = line.strip().split() > tokens[2:-1] = [u' '.join(tokens[2:-1])] > chemical = u'|'.join(tokens) > print chemical + u'\n' > output.write(chemical + u'\r\n') > > input.close() > output.close() > > Obviously, this isn't tested because I don't have your chem_1_utf8.txt > file. > > -- > Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list