[PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread hans wulf
Hi,

I need an ANTI-JOIN (not exists SELECT something from table.../ left join table 
WHERE table.id IS NULL) on the same table. Acutally I have an index to serve 
the not exists question, but the query planner chooses to to a bitmap heap scan.

The table has 100 Mio rows, so doing a heap scan is messed up...

It would be really fast if Postgres could compare the to indicies. Does 
Postgres have to visit the table for this ANTI-JOIN?

I know the table has to be visitied at some point to serve the MVCC, but why so 
early? Can NOT ESISTS only be fixed by the table, because it could miss 
soemthing otherwise?

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-11 Thread fork
Dan Ancona  vizbang.com> writes:

>  his is a choice between  
> developing some in-house capacity for this and sending people to  
> various vendors so we'll probably lean on the vendors for now, at  
> least while we work on it. 

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a "good" and an
"iffy" table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use "=" and it is still slow, ask again here --
that shouldn't happen.

> And you're right fork, Record Linkage is in fact an entire academic  
> discipline!

Indeed.  Look for "blocking" and "editing" with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels  I wish I was
using PG ...


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Maciek Sakrejda
> I know the table has to be visitied at some point to serve the MVCC, but why 
> so early? Can NOT ESISTS only be fixed by the table, because it could miss 
> soemthing otherwise?

Possibly because the index entries you're anti-joining against may
point to deleted tuples, so you would erroneously omit rows from the
join result if you skip the visibility check?

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Table partitioning problem

2011-03-11 Thread Robert Haas
On Thu, Mar 10, 2011 at 3:25 AM, Samba GUEYE  wrote:
> The measure insertion is successful but problems raise up when inserting in
> the simpleMeasure table because it can't find the foreign key inserted the
> measure table and do not look at the partitionned tables

Yes, that's how it works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-11 Thread fork
Marti Raudsepp  juffo.org> writes:

> If you don't mind long recovery times in case of a crash, set
> checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
> will improve write throughput significantly.

Sounds good.

> Also, if you don't mind CORRUPTing your database after a crash,
> setting fsync=off and full_page_writes=off gives another significant
> boost.

I probably won't do this... ;)

> UPDATE on a table with many indexes will probably be slower. If you
> want to speed up this part, use INSERT INTO x SELECT and take this
> chance to partition your table, 

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

> such that each individual partition
> and most indexes will fit in your cache. 

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be "SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc", but often it would be "GROUP BY state" or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread hans wulf
Thanks for the answer.

so there's no way around this problem? A nice index bitmap merge thing would be 
super fast. Big table ANTI JOIN queries with only a few results expected, are 
totally broken, if this is true. 

This way the query breaks my neck. This is a massive downside of postgres which 
makes this kind of query impossible. Mysql gives you the answer in a few 
seconds :-(



> Possibly because the index entries you're anti-joining against may
> point to deleted tuples, so you would erroneously omit rows from the
> join result if you skip the visibility check?
> 
> ---
> Maciek Sakrejda | System Architect | Truviso
> 
> 1065 E. Hillsdale Blvd., Suite 215
> Foster City, CA 94404
> (650) 242-3500 Main
> www.truviso.com

-- 
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Kenneth Marshall
On Fri, Mar 11, 2011 at 06:54:39PM +0100, hans wulf wrote:
> Thanks for the answer.
> 
> so there's no way around this problem? A nice index bitmap merge thing would 
> be super fast. Big table ANTI JOIN queries with only a few results expected, 
> are totally broken, if this is true. 
> 
> This way the query breaks my neck. This is a massive downside of postgres 
> which makes this kind of query impossible. Mysql gives you the answer in a 
> few seconds :-(
> 
> 

Super! I am glad that MySQL can meet your needs. No software is
perfect and you should definitely chose based on your use-case.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Kevin Grittner
Kenneth Marshall  wrote:
> On Fri, Mar 11, 2011 at 06:54:39PM +0100, hans wulf wrote:
>> so there's no way around this problem? A nice index bitmap merge
>> thing would be super fast. Big table ANTI JOIN queries with only
>> a few results expected, are totally broken, if this is true. 
>> 
>> This way the query breaks my neck. This is a massive downside of
>> postgres which makes this kind of query impossible. Mysql gives
>> you the answer in a few seconds :-(
> 
> Super! I am glad that MySQL can meet your needs. No software is
> perfect and you should definitely chose based on your use-case.
 
Well, as far as I can see we haven't yet seen near enough
information to diagnose the issue, suggest alternative ways to write
the query which might perform better, or determine whether there's
an opportunity to improve the optimizer here.
 
Hans, please read this page and provide more detail:
 
http://wiki.postgresql.org/wiki/SlowQueryQuestions
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance