In article <[EMAIL PROTECTED]>, Dennis Lee Bieber <[EMAIL PROTECTED]> wrote:
<snip> > Rather complicated description... A sample of the real/actual input > /file/ would be useful. Sorry, I didn't want to go on too long about the background, but I guess more context would have helped. The data actually come from a web page; I use a class based on SGMLParser to do the initial collection. The items in the "names" list were originally "title" attributes of anchor tags and are obtained with a "start_a" method, while "cells" holds the contents of the <td> tags, obtained by a "handle_data" method according to the state of a flag that's set to True by a "start_td" method and to False by an "end_td". I don't care about anything else on the page, so I didn't define most of the tag-specific methods available. <snip> > cellRoot = 10 * i + na #where did na come from? > #heck, where do > names and cells > #come from? > Globals? Not recommended.. The variable "na" is the number of 'not applicable' items (headings and whatnot) preceding the data I'm interested in. I'm not clear on what makes an object global, other than appearing as an operand of a "global" statement, which I don't use anywhere. But "na" is assigned its value in the program body, not within any function: does that make it global? Why is this not recommended? If I wrap the assignment in a function, making "na" a local variable, how can "extract_data" then access it? The lists of data are attributes (?) of my SGMLParser class; in my misguided attempt to pare irrelevant details from "extract_data" I obfuscated this aspect. I have a "parse_page(url)" function that returns an instance of the class, as "captured", and the lists in question are actually called "captured.names" and "captured.cells". The "parse_page(url)" function is called in the program body; does that make its output global as well? > use > > def extract_data(names, na, cells): > > and > > return <something> What should it return? A Boolean indicating success or failure? All the data I want should all have been stored in the "found" dictionary by the time the function finishes traversing the list of names. > > for k in ('time', 'score1', 'score2'): > > v = found[name][k] > > if v != "---" and v != "n/a": # skip non-numeric data > > v = ''.join(v.split(",")) # remove commas between 000s > > found[name][k] = float(v) > > I'd suggest splitting this into a short function, and invoking it in > the preceding... say it is called "parsed" > > "time" : parsed(cells[cellRoot + 5]), Will do. I guess part of my problem is that being unsure of myself I'm reluctant to attempt too much in a single complex statement, finding it easier to take small and simple (but inefficient) steps. I'll have to learn to consolidate things as I go. > Did you check the library for time/date parsing/formatting > operations? > > >>> import time > >>> aTime = "03 Feb 2008 20:35:46 UTC" #DD Mth YYYY HH:MM:SS UTC > >>> time.strptime(aTime, "%d %b %Y %H:%M:%S %Z") > (2008, 2, 3, 20, 35, 46, 6, 34, 0) I looked at the documentation for the "time" module, including "strptime", but I didn't realize the "%b" directive would match the month abbreviations I'm dealing with. It's described as "Locale's abbreviated month name"; if someone were to run my program on a French system e.g., wouldn't it try to find a match among "jan", "fév", ..., "déc" (or whatever) and fail? Is there a way to declare a "locale" that will override the user's settings? Are the locale-specific strings documented anywhere? Can one assume them to be identical in all English-speaking countries, at least? Now it's pretty unlikely in this case that such an 'international situation' will arise, but I didn't want to burn any bridges ... I was also somewhat put off "strptime" on reading the caveat "Note: This function relies entirely on the underlying platform's C library for the date parsing, and some of these libraries are buggy. There's nothing to be done about this short of a new, portable implementation of strptime()." If it works, however, it'll be a lot tidier than what I was doing. I'll make a point of testing it on its own, with a variety of inputs. > Note that the %Z is a problematic entry... > ValueError: time data did not match format: data=03 Feb 2008 > 20:35:46 PST fmt=%d %b %Y %H:%M:%S %Z All the times are UTC, so fortunately this is a non-issue for my purposes of the moment. May I assume that leaving the zone out will cause the time to be treated as UTC? Thanks for your help, and for bearing with my elementary questions and my fumbling about. -- Odysseus -- http://mail.python.org/mailman/listinfo/python-list