On Thursday, July 28, 2016 at 1:00:17 PM UTC+5:30, c...@zip.com.au wrote: > On 27Jul2016 22:12, Arshpreet Singh <arsh...@gmail.com> wrote: > >I am writing Imdb scrapper, and getting available list of titles from IMDB > >website which provide txt file in very raw format, Here is the one part of > >file(http://pastebin.com/fpMgBAjc) as the file provides tags like > >Distribution > >Votes,Rank,Title I want to parse title names, I tried with readlines() > >method > >but it returns only list which is quite heterogeneous, is it possible that I > >can parse each value comes under title section? > > Just for etiquette: please just post text snippets like that inline in your > text. Some people don't like fetching random URLs, and some of us are not > always online when reading and replying to email. Either way, having the text > in the message, especially when it is small, is preferable. > > To your question: > > Your sample text looks like this: > > New Distribution Votes Rank Title > 0000000125 1680661 9.2 The Shawshank Redemption (1994) > 0000000125 1149871 9.2 The Godfather (1972) > 0000000124 786433 9.0 The Godfather: Part II (1974) > 0000000124 1665643 8.9 The Dark Knight (2008) > 0000000133 860145 8.9 Schindler's List (1993) > 0000000133 444718 8.9 12 Angry Men (1957) > 0000000123 1317267 8.9 Pulp Fiction (1994) > 0000000124 1209275 8.9 The Lord of the Rings: The Return of the > King > (2003) > 0000000123 500803 8.9 Il buono, il brutto, il cattivo (1966) > 0000000133 1339500 8.8 Fight Club (1999) > 0000000123 1232468 8.8 The Lord of the Rings: The Fellowship of the > Ring (2001) > 0000000223 832726 8.7 Star Wars: Episode V - The Empire Strikes > Back > (1980) > 0000000233 1243066 8.7 Forrest Gump (1994) > 0000000123 1459168 8.7 Inception (2010) > 0000000223 1094504 8.7 The Lord of the Rings: The Two Towers (2002) > 0000000232 676479 8.7 One Flew Over the Cuckoo's Nest (1975) > 0000000232 724590 8.7 Goodfellas (1990) > 0000000233 1211152 8.7 The Matrix (1999) > > Firstly, I would suggest you not use readlines(), it pulls all the text into > memory. For small text like this is it ok, but some things can be arbitrarily > large, so it is something to avoid if convenient. Normally you can just > iterate > over a file and get lines. > > You want "text under the Title." Looking at it, I would be inclined to say > that > the first line is a header and the rest consist of 4 columns: a number > (distribution?), a vote count, a rank and the rest (title plus year). > > You can parse data like that like this (untested): > > # presumes `fp` is reading from the text > for n, line in enumerate(fp): > if n == 0: > # heading, skip it > continue > distnum, nvotes, rank, etc = split(line, 3) > ... do stuff with the various fields ... > > I hope that gets you going. If not, return with what code you have, what > happened, and what you actually wanted to happen and we may help further. Thanks I am able to do it with following: https://github.com/alberanid/imdbpy/blob/master/bin/imdbpy2sql.py (it was very helpful)
python imdbpy2sql.py -d <.txt files downloaded from IMDB> -u sqlite:/where/to/save/db --sqlite-transactions -- https://mail.python.org/mailman/listinfo/python-list