On 6/13/2011 12:53 AM, Kumar Mainali wrote:
I have a huge dataset containing millions of rows and several dozen
columns in a tab delimited text file.  I need to extract a small subset
of rows and only three columns. One of the three columns has two word
string with header “Scientific Name”. The other two columns carry
numbers for Longitude and Latitude, as below.

Sci NameLongitudeLatitudeColumn4
Gen sp182.528.4…
Gen sp245.929.7…
Gen sp157.932.9…
…………

Of the many species listed under the column “Sci Name”, I am interested
in only one species which will have multiple records interspersed in the
millions of rows, and I will probably have to use filename.readline() to
read the rows one at a time. How would I search for a particular species
in the dataset and create a new dataset for the species with only the
three columns?

Next, I have to create such datasets for hundreds of species. All these
species are listed in another text file. There must be a way to define
an iterative function that looks at one species at a time in the list of
species and creates separate dataset for each species. The huge dataset
contains more species than those listed in the list of my interest.

Consider using a real dataset program with Sci_name indexed. Then you can extract the rows for any species as needed. You should only need separate files if you want to export them or more or less permanently split the database. You could try sqlite, which come with python, or one of the other free database programs.

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to