On Sun, Jun 12, 2011 at 9:53 PM, Kumar Mainali <kpmain...@gmail.com> wrote:
> I have a huge dataset containing millions of rows and several dozen columns > in a tab delimited text file. I need to extract a small subset of rows and > only three columns. One of the three columns has two word string with header > “Scientific Name”. The other two columns carry numbers for Longitude and > Latitude, as below. > > Sci Name Longitude Latitude Column4 > Gen sp1 82.5 28.4 … > Gen sp2 45.9 29.7 … > Gen sp1 57.9 32.9 … > … … … … > > Of the many species listed under the column “Sci Name”, I am interested in > only one species which will have multiple records interspersed in the > millions of rows, and I will probably have to use filename.readline() to > read the rows one at a time. How would I search for a particular species in > the dataset and create a new dataset for the species with only the three > columns? > > Next, I have to create such datasets for hundreds of species. All these > species are listed in another text file. There must be a way to define an > iterative function that looks at one species at a time in the list of > species and creates separate dataset for each species. The huge dataset > contains more species than those listed in the list of my interest. > > I very much appreciate any help. I am a beginner in Python. So, complete > code would be more helpful > You could use the csv module, in CPython since 2.3. Don't be fooled by the name - it allows you to redefine various aspects making it appropriate for tab-separated values as well: http://docs.python.org/release/3.2/library/csv.html http://docs.python.org/release/2.7.2/library/csv.html
-- http://mail.python.org/mailman/listinfo/python-list