[PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
is never going to be that fast, what should one do to make it faster?

I set work_mem to 2048MB, but it currently is only using a little bit of memory
and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
it was using 70% of the memory).

The data is not particularly sensitive; if something happened and it rolled
back, that wouldnt be the end of the world.  So I don't know if I can use
"dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
concurrent hits on the DB, though a few.

I am loathe to create a new table from a select, since the indexes themselves
take a really long time to build.

As the title alludes, I will also be doing GROUP BY's on the data, and would
love to speed these up, mostly just for my own impatience...  




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Merlin Moncure  gmail.com> writes:

> > I am loathe to create a new table from a select, since the indexes 
> > themselves
> > take a really long time to build.
> 
> you are aware that updating the field for the entire table, especially
> if there is an index on it (or any field being updated), will cause
> all your indexes to be rebuilt anyways?  when you update a record, it
> gets a new position in the table, and a new index entry with that
> position.  
> insert/select to temp, + truncate + insert/select back is
> usually going to be faster and will save you the reindex/cluster.
> otoh, if you have foreign keys it can be a headache.

Hmph.  I guess I will have to find a way to automate it, since there will be a
lot of times I want to do this. 

> > As the title alludes, I will also be doing GROUP BY's on the data, and would
> > love to speed these up, mostly just for my own impatience...
> 
> need to see the query here to see if you can make them go faster.

I guess I was hoping for a blog entry on general guidelines given a DB that is
really only for batch analysis versus transaction processing.  Like "put all
your temp tables on a different disk" or whatever.  I will post specifics later.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-10 Thread fork
Steve Atkins  blighty.com> writes:

> 
> 
> On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
> 
> > Hi postgressers -
> > 
> > As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
> to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
> this converges and sometimes it doesn't, seemingly regardless of how I index
things. 

By "converge" you mean "finish running" -- "converge" has a lot of other
overtones for us amateur math types.

Note that I think you are doing "record linkage" which is a stepchild academic
of its own these days.  It might bear some research.  THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records.  Then try equality on a few columns.  Then try some super fuzzy regexes
on a few columns.  Etc.  

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend.  I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-11 Thread fork
Dan Ancona  vizbang.com> writes:

>  his is a choice between  
> developing some in-house capacity for this and sending people to  
> various vendors so we'll probably lean on the vendors for now, at  
> least while we work on it. 

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a "good" and an
"iffy" table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use "=" and it is still slow, ask again here --
that shouldn't happen.

> And you're right fork, Record Linkage is in fact an entire academic  
> discipline!

Indeed.  Look for "blocking" and "editing" with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels  I wish I was
using PG ...


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-11 Thread fork
Marti Raudsepp  juffo.org> writes:

> If you don't mind long recovery times in case of a crash, set
> checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
> will improve write throughput significantly.

Sounds good.

> Also, if you don't mind CORRUPTing your database after a crash,
> setting fsync=off and full_page_writes=off gives another significant
> boost.

I probably won't do this... ;)

> UPDATE on a table with many indexes will probably be slower. If you
> want to speed up this part, use INSERT INTO x SELECT and take this
> chance to partition your table, 

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

> such that each individual partition
> and most indexes will fit in your cache. 

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be "SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc", but often it would be "GROUP BY state" or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance