Hi, I have created a csv file that lists how often each word in the Internet Movie Database occurs with different star-ratings and in different genres. The input file looks something like this--since movies can have multiple genres, there are three genre rows. (This is fake, simplified data.) ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count film1 Drama Thriller Western 1 the 20 film2 Comedy Musical NA 2 the 20 film3 Musical History Biography 1 the 20 film4 Drama Thriller Western 1 the 10 film5 Drama Thriller Western 9 the 20 I can get the program to tell me how many occurrence of "the" there are in Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama "the"'s there are (30). But I need to be able to expand beyond a particular word and say "how many words total are in "Drama"? How many total words are in 1-star ratings? How many words are there in the whole corpus? On these all-word totals, I'm stumped. What I've done so far: I used shelve() to store my input csv in a database format. Here's how I get count information so far: def get_word_count(word, db, genre=None, rating=None): c = 0 vals = db[word] for val in vals: if not genre and not rating: c += val['count'] elif genre and not rating: if genre in val['genres']: c += val['count'] elif rating and not genre: if rating == val['rating']: c += val['count'] else: if rating == val['rating'] and genre in val['genres']: c += val['count'] return c (I think there's something a little wrong with the rating stuff, here, but this code generally works and produces the right counts.) With "get_word_count" I can do stuff like this to figure out how many times "the" appears in a particular genre. vals=db[word] for val in vals: genre_ct_for_word = get_word_count(word, db, genre, rating=None) return genre_ct_for_word I've tried to extend this thinking to get TOTAL genre/rating counts for all words, but it doesn't work. I get a type error saying that string indices must be integers. I'm not sure how to overcome this. # Doesn't work: def get_full_rating_count(db, rating=None): full_rating_ct = 0 vals = db for val in vals: if not rating: full_rating_ct += val['count'] elif rating == val['rating']: if rating == val['rating']: # Um, I know this looks dumb, but in the other code it seems to be necessary for things to work. full_rating_ct += val['count'] return full_rating_ct Can anyone suggest how to do this? Thanks! Tyler Background for the curious: What I really want to know is which words are over- or under-represented in different Genre x Rating categories. "The" should be flat, but something like "wow" should be over-represented in 1-star and 10-star ratings and under-represented in 5-star ratings. Something like "gross" may be over-represented in low-star ratings for romances but if grossness is a good thing in horror movies, then we'll see "gross" over-represented in HIGH-star ratings for horror. To figure out over-representation and under-representation I need to compare "observed" counts to "expected" counts. The expected counts are probabilities and they require me to understand how many words I have in the whole corpus and how many words in each rating category and how many words in each genre category.
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor