On Sat, 2 Oct 2010 06:31:42 am aenea...@priest.com wrote: > Hi, > > I have created a csv file that lists how often each word in the > Internet Movie Database occurs with different star-ratings and in > different genres.
I would have thought that IMDB would probably have already made that information available? http://www.imdb.com/interfaces > The input file looks something like this--since > movies can have multiple genres, there are three genre rows. (This is > fake, simplified data.) [...] > I can get the program to tell me how many occurrence of "the" there > are in Thrillers (50), how many "the"'s in 1-stars (50), and how many > 1-star drama "the"'s there are (30). But I need to be able to expand > beyond a particular word and say "how many words total are in > "Drama"? How many total words are in 1-star ratings? How many words > are there in the whole corpus? On these all-word totals, I'm stumped. The headings of your data look like this: ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count and you want to map words to genres. Can you tell us how big the CSV file is? Depending on its size, you may need to use on-disk storage (perhaps shelve, as you're already doing) but for illustration purposes I'll assume it all fits in memory and just use regular dicts. I'm going to create a table that stores the counts for each word versus the genre: Genre | the | scary | silly | exciting | ... ------------------------------------------------ Western | 934 | 3 | 5 | 256 | Thriller | 899 | 145 | 84 | 732 | Comedy | 523 | 1 | 672 | 47 | ... To do this using dicts, I'm going to use a dict for genres: genre_table = {"Western": table_of_words, ...} and each table_of_words will look like: {'the': 934, 'scary': 3, 'silly': 5, ...} Let's start with a helper function and table to store the data. # Initialise the table. genres = {} def add_word(genre, word, count): genre = genre.title() # force "gEnRe" to "Genre" word = word.lower() # force "wOrD" to "word" count = int(count) row = genres.get(genre, {}) n = row.get(word, 0) row[word] = n + count genres[genre] = row We can simplify this code by using the confusingly named, but useful, setdefault method of dicts: def add_word(genre, word, count): genre = genre.title() word = word.lower() count = int(count) row = genres.setdefault(genre, {}) row[word] = row.get(word, 0) + count Now let's process the CSV file. I'm afraid I can't remember how the CSV module works, and I'm too lazy to look it up, so this is pseudo-code rather than Python: for row in csv file: genre1 = get column Genre1 genre2 = get column Genre2 genre3 = get column Genre3 word = get column Word count = get column Count add_word(genre1, word, count) add_word(genre2, word, count) add_word(genre3, word, count) Now we can easily query our table for useful information: # list of unique words for the Western genre genres["Western"].keys() # count of unique words for the Romance genre len(genres["Romance"]) # number of times "underdog" is used in Sports movies genres["Sport"]["underdog"] # total count of words for the Comedy genre sum(genres["Comedy"].values()) Do you want to do lookups efficiently the other way as well? It's easy to add another table: Word | Western | Thriller | ... ------------------------------------------------ the | 934 | 899 | scary | 3 | 145 | ... Add a second global table: genres = {} words = {} and modify the helper function: def add_word(genre, word, count): genre = genre.title() word = word.lower() count = int(count) # Add word to the genres table. row = genres.setdefault(genre, {}) row[word] = row.get(word, 0) + count # And to the words table. row = words.setdefault(word, {}) row[genre] = row.get(genre, 0) + count -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor