Update
I just got this task done.
I did not use Pandas, mainly because I wanted to work in SQL with a real
database instead of a CSV file.

I used sqliteBrowser to import the data from the CSV file.  It worked well
except that it got the data type wrong on some of the 150+ columns (fields)
and I had to go in and edit them manually.  It did this because the data in
the rows that it looked at was inconsistent.  It didn't make a mistake, the
data it was given was garbage.

I was able to edit the fields easily enough, but it was slow because the
user interface is a bit clunky.  No biggie, just took longer than it might
have.  Still worked well though.

At that point I had a big table that needed to be converted into actual
database records so that I could manipulate them properly with SQL.  My
table was 110K x 150 (row x  column).  I needed to convert that to ~16.5M
individual records.

I tried doing this with SQL statements and an SQL wizard could probably do
it, but I found it frustrating.  It wasn't that I couldn't convert the
individual cells into a record, it was that the records needed some data
checking and manipulation before being inserted into a new table.

I turned to my trusty friend, Python and found sqlite3, which is a Python
driver for accessing sqlite.  From there things went really well.

I encountered a small problem when I hit empty cells that were NULL in the
source table.  I needed to convert them to 0.0.   I found out that they
read as None in Python and so my fix was:

output = row[column]
       if (output is None):  #this is for a NULL field in the db
            output = 0.0
My hangup was that I tried comparing (using ==) them to "", '', "None",
NULL, Null, etc.  Finally I realized that Python read them as None and then
it worked.

We are using this data to make some large decisions.  It is great to have
such fantastic tools available on a moment's notice, for free, to be able
to do this work.  The supplier of the data was pushing to the manipulation
for a hefty hourly fee.  I'm glad we didn't have to resort to that.  Now
that we have good tools to do this work, we can mine this data any way we
want, using SQL and/or Python.

Linux rocks !   Fedora rocks !

Thanks for the quick reply late in the day on a Friday.  Having finished
this up, I don't have to worry about it for the rest of the weekend.
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org

Reply via email to