RE: finding rows in a large file (22 millions of rows)

Westgate, Jared Tue, 11 Feb 2003 08:18:04 -0800

Madhu Reddy Wrote:
> We are trying to load date into teradata [which is 
> data warehousing, stores Terabytes of data, and which 
> is 10 times faster than any other database..)


Data warehousing is always an exciting subject!  However, I'd be surprised to
see this kind of performance increase.  A major factor in database performance
is the database design.  Many database designers do not know how to build 
data warehouses, they are stuck on normal relational concepts.  Anyway, sorry 
to be off topic...  I just can't turn down a database debate! :)

> before loading data into Teradata, we need to do some
> massaging on data..basically eliminating..duplicate
> rows and invalid rows...

I don't know anything about the Teradata database system, but I know how I 
would do this on other systems: 

1. Load the data as it is into a temporary database
2. Do a select (or a report), returning unique (distinct) rows.  This same
   select could also filter out your invalid rows and massage data.
3. Load the result of the select into the final database.

If you are really looking to do this with Perl, I guess you load the data
into a hash, sort it, and then print the unique values.  I have no idea how
long this would take to run, but the code would be fairly straight-forward:

Just load the data into a hash using each column as a key.  Then sort the
hash (this may take a little while).  Finally, write a conditional that 
cycles through the hash, checking the first key.  If the hash record you 
last read is the same as the current one, don't print it to a file.  
Otherwise, do print it to a file.  At this point you could also do some 
formatting, etc. 

I guess it all just depends on which you are more comfortable with.

Hope this helps,

Jared

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: finding rows in a large file (22 millions of rows)

Reply via email to