On Sep 9, 4:58 pm, Al Fansome <al_fans...@hotmail.com> wrote: > Mart. wrote: > > On Sep 8, 4:33 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > >>Mart. wrote: > >>> On Sep 8, 3:53 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > >>>>Mart. wrote: > >>>>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t...@ubisoft.com> wrote: > >>>>>>>>> Hi, > >>>>>>>>> I need to extract a string after a matching a regular expression. > >>>>>>>>> For > >>>>>>>>> example I have the string... > >>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>>>>>>> and once I match "FTPHOST" I would like to extract > >>>>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to > >>>>>>>>> the > >>>>>>>>> problem, I had been trying to match the string using something like > >>>>>>>>> this: > >>>>>>>>> m = re.findall(r"FTPHOST", s) > >>>>>>>>> But I couldn't then work out how to return the > >>>>>>>>> "e4ftl01u.ecs.nasa.gov" > >>>>>>>>> part. Perhaps I need to find the string and then split it? I had > >>>>>>>>> some > >>>>>>>>> help with a similar problem, but now I don't seem to be able to > >>>>>>>>> transfer that to this problem! > >>>>>>>>> Thanks in advance for the help, > >>>>>>>>> Martin > >>>>>>>> No need for regex. > >>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov" > >>>>>>>> If "FTPHOST" in s: > >>>>>>>> return s[9:] > >>>>>>>> Cheers, > >>>>>>>> Drea > >>>>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only > >>>>>>> presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I > >>>>>>> thought this easily encompassed the problem. The solution presented > >>>>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But > >>>>>>> when I used this on the actual file I am trying to parse I realised it > >>>>>>> is slightly more complicated as this also pulls out other information, > >>>>>>> for example it prints > >>>>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n', > >>>>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/ > >>>>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n', > >>>>>>> etc. So I need to find a way to stop it before the \r > >>>>>>> slicing the string wouldn't work in this scenario as I can envisage a > >>>>>>> situation where the string lenght increases and I would prefer not to > >>>>>>> keep having to change the string. > >>>>>> If, as Terry suggested, you do have a tuple of strings and the first > >>>>>> element has FTPHOST, then s[0].split(":")[1].strip() will work. > >>>>> It is an email which contains information before and after the main > >>>>> section I am interested in, namely... > >>>>> FINISHED: 09/07/2009 08:42:31 > >>>>> MEDIATYPE: FtpPull > >>>>> MEDIAFORMAT: FILEFORMAT > >>>>> FTPHOST: e4ftl01u.ecs.nasa.gov > >>>>> FTPDIR: /PullDir/0301872638CySfQB > >>>>> Ftp Pull Download Links: > >>>>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB > >>>>> Down load ZIP file of packaged order: > >>>>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip > >>>>> FTPEXPR: 09/12/2009 08:42:31 > >>>>> MEDIA 1 of 1 > >>>>> MEDIAID: > >>>>> I have been doing this to turn the email into a string > >>>>> email = sys.argv[1] > >>>>> f = open(email, 'r') > >>>>> s = str(f.readlines()) > >>>> To me that seems a strange thing to do. You could just read the entire > >>>> file as a string: > >>>> f = open(email, 'r') > >>>> s = f.read() > >>>>> so FTPHOST isn't the first element, it is just part of a larger > >>>>> string. When I turn the email into a string it looks like... > >>>>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n', > >>>>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n', > >>>>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r > >>>>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down > >>>>> load ZIP file of packaged order:\r\n', > >>>>> So not sure splitting it like you suggested works in this case. > >>> Within the file are a list of files, e.g. > >>> TOTAL FILES: 2 > >>> FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf > >>> FILESIZE: 11028908 > >>> FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml > >>> FILESIZE: 18975 > >>> and what i want to do is get the ftp address from the file and collect > >>> these files to pull down from the web e.g. > >>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf > >>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml > >>> Thus far I have > >>> #!/usr/bin/env python > >>> import sys > >>> import re > >>> import urllib > >>> email = sys.argv[1] > >>> f = open(email, 'r') > >>> s = str(f.readlines()) > >>> m = re.findall(r"MOD....\.........\.h..v..\.005\..............\.... > >>> \....", s) > >>> ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1) > >>> ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1) > >>> url = 'ftp://' + ftphost + ftpdir > >>> for i in xrange(len(m)): > >>> print i, ':', len(m) > >>> file1 = m[i][:-4] # remove xml bit. > >>> file2 = m[i] > >>> urllib.urlretrieve(url, file1) > >>> urllib.urlretrieve(url, file2) > >>> which works, clearly my match for the MOD13A2* files isn't ideal I > >>> guess, but they will always occupt those dimensions, so it should > >>> work. Any suggestions on how to improve this are appreciated. > >> Suppose the file contains your example text above. Using 'readlines' > >> returns a list of the lines: > > >> >>> f = open(email, 'r') > >> >>> lines = f.readlines() > >> >>> lines > >> ['TOTAL FILES: 2\n', '\t\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: > >> 11028908\n', '\n', '\t\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: > >> 18975\n'] > > >> Using 'str' on that list then converts it to s string _representation_ > >> of that list: > > >> >>> str(lines) > >> "['TOTAL FILES: 2\\n', '\\t\\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: > >> 11028908\\n', '\\n', '\\t\\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE: > >> 18975\\n']" > > >> That just parsing a lot more difficult. > > >> It's much easier to just read the entire file as a single string and > >> then parse that: > > >> >>> f = open(email, 'r') > >> >>> s = f.read() > >> >>> s > >> 'TOTAL FILES: 2\n\t\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE: > >> 11028908\n\n\t\tFILENAME: > >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n' > >> >>> import re > >> >>> re.findall(r"FILENAME: (.+)", s) > >> ['MOD13A2.A2007033.h17v08.005.2007101023605.hdf', > >> 'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml'] > > > If I do it this way I can't seem to not extract the \r at the end of > > the line. > > > In [26]: m = re.search(r"FTPHOST: (.+)", s) > > > In [27]: m.group(1) > > Out[27]: 'e4ftl01u.ecs.nasa.gov\r' > > > but if I insert \\r at the end as was previously suggested. > > > In [28]: m = re.search(r"FTPHOST: (.+)\\r", s) > > > In [29]: m.group(1) > > > AttributeError: 'NoneType' object has no attribute 'group' > > > Any thoughts? > > > Thanks > > Just use \r at the end, not \\r. \r is the carriage return character, > which ends the line. \\r becomes two characters, the character backslash > "\", followed by the character "r".
Excellent thanks, sorry I thought I had to escape it to access it. If it helps anyone the script is as follows...Many thanks all for the help. #!/usr/bin/env python import sys import re import urllib email = sys.argv[1] f = open(email, 'r') s = f.read() # match the modis files... m = re.findall(r"FILENAME: (.+)\r", s) # get the ftp locations? ftphost = re.search(r"FTPHOST: (.+)\r", s).group(1) ftpdir = re.search(r"FTPDIR: (.+)\r", s).group(1) url = 'ftp://' + ftphost + ftpdir for i in xrange(len(m)): print i, ':', len(m) # counter modis_file = str(m[i]) urllib.urlretrieve(url, modis_file) -- http://mail.python.org/mailman/listinfo/python-list