On Sep 8, 3:53 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > Mart. wrote: > > On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t...@ubisoft.com> wrote: > >>>>> Hi, > >>>>> I need to extract a string after a matching a regular expression. For > >>>>> example I have the string... > >>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>>> and once I match "FTPHOST" I would like to extract > >>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the > >>>>> problem, I had been trying to match the string using something like > >>>>> this: > >>>>> m = re.findall(r"FTPHOST", s) > >>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov" > >>>>> part. Perhaps I need to find the string and then split it? I had some > >>>>> help with a similar problem, but now I don't seem to be able to > >>>>> transfer that to this problem! > >>>>> Thanks in advance for the help, > >>>>> Martin > >>>> No need for regex. > >>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>> If "FTPHOST" in s: > >>>> return s[9:] > >>>> Cheers, > >>>> Drea > >>> Sorry perhaps I didn't make it clear enough, so apologies. I only > >>> presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I > >>> thought this easily encompassed the problem. The solution presented > >>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But > >>> when I used this on the actual file I am trying to parse I realised it > >>> is slightly more complicated as this also pulls out other information, > >>> for example it prints > >>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', > >>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ > >>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', > >>> etc. So I need to find a way to stop it before the \r > >>> slicing the string wouldn't work in this scenario as I can envisage a > >>> situation where the string lenght increases and I would prefer not to > >>> keep having to change the string. > >> If, as Terry suggested, you do have a tuple of strings and the first > >> element has FTPHOST, then s[0].split(":")[1].strip() will work. > > > It is an email which contains information before and after the main > > section I am interested in, namely... > > > FINISHED: 09/07/2009 08:42:31 > > > MEDIATYPE: FtpPull > > MEDIAFORMAT: FILEFORMAT > > FTPHOST: e4ftl01u.ecs.nasa.gov > > FTPDIR: /PullDir/0301872638CySfQB > > Ftp Pull Download Links: > >ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB > > Down load ZIP file of packaged order: > >ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip > > FTPEXPR: 09/12/2009 08:42:31 > > MEDIA 1 of 1 > > MEDIAID: > > > I have been doing this to turn the email into a string > > > email = sys.argv[1] > > f = open(email, 'r') > > s = str(f.readlines()) > > To me that seems a strange thing to do. You could just read the entire > file as a string: > > f = open(email, 'r') > s = f.read() > > > so FTPHOST isn't the first element, it is just part of a larger > > string. When I turn the email into a string it looks like... > > > 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', > > 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', > > 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r > > \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down > > load ZIP file of packaged order:\r\n', > > > So not sure splitting it like you suggested works in this case. > >
Within the file are a list of files, e.g. TOTAL FILES: 2 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf FILESIZE: 11028908 FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml FILESIZE: 18975 and what i want to do is get the ftp address from the file and collect these files to pull down from the web e.g. MOD13A2.A2007033.h17v08.005.2007101023605.hdf MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml Thus far I have #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = str(f.readlines()) m = re.findall(r"MOD....\.........\.h..v..\.005\..............\.... \....", s) ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) file1 = m[i][:-4] # remove xml bit. file2 = m[i] urllib.urlretrieve(url, file1) urllib.urlretrieve(url, file2) which works, clearly my match for the MOD13A2* files isn't ideal I guess, but they will always occupt those dimensions, so it should work. Any suggestions on how to improve this are appreciated. Thanks. -- http://mail.python.org/mailman/listinfo/python-list