-----Original Message----- From: Kevin Bartz [mailto:[EMAIL PROTECTED] Sent: Monday, August 09, 2004 10:37 AM To: '[EMAIL PROTECTED]' Subject: RE: [GENERAL] Out of swap space & memory
Thanks for your reply, Mike! Theoretically, I should need only six of the columns, but as a means of verifying integrity, I would like to de-dup using all the columns. For instance, if there are two rows identical everywhere but some column outside the six, I would like to know about it so I can report back to the data provider. Maybe there's some other way to do this kind of check? Kevin -----Original Message----- From: mike g [mailto:[EMAIL PROTECTED] Sent: Friday, August 06, 2004 9:07 PM To: Kevin Bartz Subject: RE: [GENERAL] Out of swap space & memory hmmmm, Can you determine which are rows are duplicates by examining one column or must you examine all the columns? On Fri, 2004-08-06 at 22:56, Kevin Bartz wrote: > Thanks for your reply, but how can I then solve the problem of duplicates? > Using your example, if one duplicate lives between rows 1 and 1,000,000 and > another between rows 1,000,001 and 2,000,000, de-duping them individually > will result in the duplicate showing up twice. > > Kevin > > -----Original Message----- > From: mike g [mailto:[EMAIL PROTECTED] > Sent: Friday, August 06, 2004 8:52 PM > To: Kevin Bartz > Subject: RE: [GENERAL] Out of swap space & memory > > Can octanenights_raw be altered? If so using the Sequence feature in > postgres you could autopopulate that column with a number starting at 1 > on up to 36 million. Then your select queries could be select X from > octanenights_raw where sequence_column_name > 1 and sequence_column_name > < 1000000 etc. > > Mike > > On Fri, 2004-08-06 at 22:40, Kevin Bartz wrote: > > Well, all I'm doing right now is using psql from the command line. A bit > > unintelligent, I know, but I want to make sure things will work > > appropriately before I dive into the world of query editors. > > > > Thanks for your suggestions. My data won't need to be updated regularly. > In > > this case, as in all others, loading each set of data is a one-shot > process, > > so I don't think I'll need to worry about truncating. > > > > I can't think of any way to break the problem into steps, since the > > duplicates may well be scattered throughout the table. If I split > > octanenights into octanenights1 and octanenights2 and then de-dup each > > individually, I would have to first be sure that octanenights1 does not > > share a duplicate with octanenights2, or that duplicate would appear in > the > > UNION ALLed version. Maybe I'm missing something? > > > > Thanks for your kind response. > > > > Kevin > > > > -----Original Message----- > > From: mike g [mailto:[EMAIL PROTECTED] > > Sent: Friday, August 06, 2004 8:19 PM > > To: Kevin Bartz > > Cc: 'Manfred Koizar'; [EMAIL PROTECTED] > > Subject: Re: [GENERAL] Out of swap space & memory > > > > Ok, > > > > This is a long shot but how are you executing your code? In say a > > pgadminIII sql window with the below entered line after line? > > > > If so I believe it will be treated as one transaction. With the default > > settings postgres would have to keep track of everything done to be able > > to rollback all the changes if it failed. I would believe that would > > force it to keep track of all 56 million rows combined in memory > > (probably just the oid column - I am sure the other more experienced > > postgresql wizards can verify) but still that can take a lot of > > resources. > > > > If by chance you are doing it one sweep try executing it in separate > > steps so the commit can be executed. > > > > Hopefully then you won't run out of resources then. > > > > Are you doing a drop / create say everynight to update your data? If so > > perhaps using TRUNCATE octanenights might be more efficient. > > > > If you must drop a full table perhaps a vacuum should be done > > afterwards??? > > > > Mike > > > > On Fri, 2004-08-06 at 21:32, Kevin Bartz wrote: > > > Mike, thanks so much for your reply. I'm sorry for not showing you my > SQL. > > I > > > didn't show it because I couldn't manage to boil it down to something > > > reproducible that everyone could try. But here's what it was: > > > > > > drop table octanenights; > > > CREATE TABLE octanenights (member_id varchar(100), campaign_id > > varchar(100), > > > catalog_type varchar(100), pushed int, delivered int, clicks int, opened > > > int, month varchar(100), type1 int, type2 int, type3 int, type4 int, > type5 > > > int); > > > > > > copy octanenights from > > > '/home/kevin/octanenights/proc/uberfile/uberfile1.txt' null as ''; > > > copy octanenights from > > > '/home/kevin/octanenights/proc/uberfile/uberfile2.txt' null as ''; > > > copy octanenights from > > > '/home/kevin/octanenights/proc/uberfile/uberfile3.txt' null as ''; > > > > > > select * from octanenights limit 10; > > > alter table octanenights rename to octanenights_raw; > > > > > > -- de-dup the table > > > select member_id, campaign_id, catalog_type, pushed, delivered, clicks, > > > opened, month, type1, type2, type3, type4, type5 > > > into octanenights > > > from octanenights_raw > > > group by member_id, campaign_id, catalog_type, pushed, delivered, > clicks, > > > opened, month, type1, type2, type3, type4, type5; > > > > > > Let me tell you a little about octanenights. It's a file of about > > 36,000,000 > > > rows, each describing an e-mail sent. Unfortunately, there are duplicate > > > records scattered throughout the table, which I do not care about. One > > might > > > suggest that I could've used uniq from the command line for this, but > the > > > data were not sorted originally and the duplicate records may be > scattered > > > anywhere in the table. The objective in the final line is to de-dup the > > > table and place it into octanenights, leaving the original in > > > octanenights_raw in case I ever need to refer back to it. > > > > > > MS SQL Server, with as much RAM and less clock speed, de-dups the table > in > > > about six minutes. The de-duped version has about 26,000,000 rows. The > > final > > > line is where Postgres gobbles up all my swap and RAM and then conks out > > > completely. > > > > > > Am I doing something wrong? Maybe there was a better way to approach > this > > > problem? I'd be open to suggestions of any kind, since I'm still very, > > very > > > new to the world of optimizing Postgres. > > > > > > Kevin > > > > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Manfred Koizar > > > Sent: Tuesday, August 03, 2004 3:04 AM > > > To: Kevin Bartz > > > Cc: [EMAIL PROTECTED] > > > Subject: Re: [GENERAL] Out of swap space & memory > > > > > > On Mon, 2 Aug 2004 20:08:45 -0700, "Kevin Bartz" > > > <[EMAIL PROTECTED]> wrote: > > > >is there any way I can run this query? > > > > > > What query? You didn't show us your SQL. > > > > > > Servus > > > Manfred > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 3: if posting/reading through Usenet, please send an appropriate > > > subscribe-nomail command to [EMAIL PROTECTED] so that your > > > message can get through to the mailing list cleanly > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 9: the planner will ignore your desire to choose an index scan if > your > > > joining column's datatypes do not match ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings