Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote: > John A Meinel wrote: > > >>Surely this isn't what you have. You have *no* loop here, and you have >>stuff like: >> AND >>(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); >> >>I'm guessing this isn't your last version of the function. >> >>As far as putting

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
John A Meinel wrote: > Surely this isn't what you have. You have *no* loop here, and you have > stuff like: > AND > (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); > > I'm guessing this isn't your last version of the function. > > As far as putting the CREATE TEMP TABLE inside th

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote: > Matthew Schumacher wrote: > >>Tom Lane wrote: >> >> >> >>>I don't really see why you think that this path is going to lead to >>>better performance than where you were before. Manipulation of the >>>temp table is never going to be free, and IN (sub-select) is always >>>

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Matthew Schumacher wrote: > Tom Lane wrote: > > >>I don't really see why you think that this path is going to lead to >>better performance than where you were before. Manipulation of the >>temp table is never going to be free, and IN (sub-select) is always >>inherently not fast, and NOT IN (sub-

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Tom Lane wrote: > I don't really see why you think that this path is going to lead to > better performance than where you were before. Manipulation of the > temp table is never going to be free, and IN (sub-select) is always > inherently not fast, and NOT IN (sub-select) is always inherently > aw

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Tom Lane
John A Meinel <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> I don't really see why you think that this path is going to lead to >> better performance than where you were before. > So for an IN (sub-select), does it actually pull all of the rows from > the other table, or is the planner smart en

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
John A Meinel wrote: > Matthew Schumacher wrote: > > I recommend that you drop and re-create the temp table. There is no > reason to have it around, considering you delete and re-add everything. > That means you never have to vacuum it, since it always only contains > the latest rows. Whenever I

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Tom Lane wrote: > Matthew Schumacher <[EMAIL PROTECTED]> writes: > >> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) >> LOOP >>_token := intokenary[i]; >>INSERT INTO bayes_token_tmp VALUES (_token); >> END LOOP; > > >> UPDATE >>bayes_token >> SET >>spam_count

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Tom Lane
Matthew Schumacher <[EMAIL PROTECTED]> writes: > for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) > LOOP > _token := intokenary[i]; > INSERT INTO bayes_token_tmp VALUES (_token); > END LOOP; > UPDATE > bayes_token > SET > spam_count = greatest_int(spam_co

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote: > Okay, > > Here is the status of the SA updates and a question: > > Michael got SA changed to pass an array of tokens to the proc so right > there we gained a ton of performance due to connections and transactions > being grouped into one per email instead of one per toke

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread PFC
What I really want to do is have the token array available as a record so that I can query against it, but not have it take up the resources of a real table. If I could copy from an array into a record then I can even get rid of the loop. Anyone have any thoughts on how to do this? You coul

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Okay, Here is the status of the SA updates and a question: Michael got SA changed to pass an array of tokens to the proc so right there we gained a ton of performance due to connections and transactions being grouped into one per email instead of one per token. Now I am working on making the pro

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Michael Parker
Jim C. Nasby wrote: >I'm not sure who's responsible for DBI::Pg (Josh?), but would it make >sense to add better support for bytea to DBI::Pg? ISTM there should be a >better way of doing this than adding gobs of \'s. > > It has support for binding a bytea parameter, but in this case we're trying

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Jim C. Nasby
On Mon, Aug 01, 2005 at 01:28:24PM -0800, Matthew Schumacher wrote: > PFC wrote: > > > > > >> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1); > > > > > > Try adding more backslashes until it works (seems that you need > > or something). > > Don't DBI convert t

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Matthew Schumacher
PFC wrote: > > >> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1); > > > Try adding more backslashes until it works (seems that you need > or something). > Don't DBI convert the language types to postgres quoted forms on its > own ? > Your right I am find

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread PFC
select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1); Try adding more backslashes until it works (seems that you need or something). Don't DBI convert the language types to postgres quoted forms on its own ? ---(end of broadcast)-

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Matthew Schumacher
Tom Lane wrote: > > Revised insertion procedure: > > > CREATE or replace FUNCTION put_tokens (_id INTEGER, > _tokens BYTEA[], > _spam_count INTEGER, > _ham_count INTEGER, > _at

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-01 Thread Matthew Schumacher
Tom Lane wrote: > Michael Parker <[EMAIL PROTECTED]> writes: > >>sub bytea_esc { >> my ($str) = @_; >> my $buf = ""; >> foreach my $char (split(//,$str)) { >>if (ord($char) == 0) { $buf .= "000"; } >>elsif (ord($char) == 39) { $buf .= "047"; } >>elsif (ord($char) == 92) { $b

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Tom Lane
Michael Parker <[EMAIL PROTECTED]> writes: > sub bytea_esc { > my ($str) = @_; > my $buf = ""; > foreach my $char (split(//,$str)) { > if (ord($char) == 0) { $buf .= "000"; } > elsif (ord($char) == 39) { $buf .= "047"; } > elsif (ord($char) == 92) { $buf .= "134"; } >

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Tom Lane
Michael Parker <[EMAIL PROTECTED]> writes: > The next hurdle, and I've just posted to the DBD::Pg list, is > escaping/quoting the token strings. If you're trying to write a bytea[] literal, I think the most reliable way to write the individual bytes is nnn where nnn is *octal*. The id

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Michael Parker
Hi All, As a SpamAssassin developer, who by my own admission has real problem getting PostgreSQL to work well, I must thank everyone for their feedback on this issue. Believe me when I say what is in the tree now is a far cry from what used to be there, orders of magnitude faster for sure. I thi

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Andreas Pflug
Jim C. Nasby wrote: On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote: Ok, here is the current plan. Change the spamassassin API to pass a hash of tokens into the storage module, pass the tokens to the proc as an array, start a transaction, load the tokens into a temp table us

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Jim C. Nasby
On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote: > Ok, here is the current plan. > > Change the spamassassin API to pass a hash of tokens into the storage > module, pass the tokens to the proc as an array, start a transaction, > load the tokens into a temp table using copy, sele

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread Matthew Schumacher
Ok, here is the current plan. Change the spamassassin API to pass a hash of tokens into the storage module, pass the tokens to the proc as an array, start a transaction, load the tokens into a temp table using copy, select the tokens distinct into the token table for new tokens, update the token t

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread John Arbash Meinel
John Arbash Meinel wrote: >Matthew Schumacher wrote: > > > >>All it's doing is trying the update before the insert to get around the >>problem of not knowing which is needed. With only 2-3 of the queries >>implemented I'm already back to running about the same speed as the >>original SA proc th

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread John Arbash Meinel
Matthew Schumacher wrote: >Tom Lane wrote: > > > >>I looked into this a bit. It seems that the problem when you wrap the >>entire insertion series into one transaction is associated with the fact >>that the test does so many successive updates of the single row in >>bayes_vars. (VACUUM VERBOSE

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread John Arbash Meinel
Matthew Schumacher wrote: >All it's doing is trying the update before the insert to get around the >problem of not knowing which is needed. With only 2-3 of the queries >implemented I'm already back to running about the same speed as the >original SA proc that is going to ship with SA 3.1.0. > >A

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread Matthew Schumacher
Tom Lane wrote: > I looked into this a bit. It seems that the problem when you wrap the > entire insertion series into one transaction is associated with the fact > that the test does so many successive updates of the single row in > bayes_vars. (VACUUM VERBOSE at the end of the test shows it cl

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread Tom Lane
Matthew Schumacher <[EMAIL PROTECTED]> writes: > One thing that is still very puzzling to me is why this runs so much > slower when I put the data.sql in a transaction. Obviously transactions > are acting different when you call a proc a zillion times vs an insert > query. I looked into this a bi

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-30 Thread Karim Nassar
On Sat, 2005-07-30 at 00:46 -0800, Matthew Schumacher wrote: > I'll do some more testing on Monday, perhaps grouping even 200 tokens at > a time using your method will yield significant gains, but probably not > as dramatic as it does using my loading benchmark. In that case, some of the clauses

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-30 Thread Matthew Schumacher
Karim Nassar wrote: > > [EMAIL PROTECTED]:~/k-bayesBenchmark$ time ./test.pl > <-- snip db creation stuff --> > 17:18:44 -- START > 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records > 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into > bayes_token > 17:19:50 -- AFTER bayes_

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Karim Nassar
On Fri, 2005-07-29 at 09:47 -0700, Josh Berkus wrote: > Try changing: > wal_buffers = 256 > > and try Bruce's stop full_page_writes patch. > > > I guess we see the real culprit here. Anyone surprised it's the WAL? > > Nope. On high-end OLTP stuff, it's crucial that the WAL have its own > dedi

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Matthew Schumacher
Ok, here is where I'm at, I reduced the proc down to this: CREATE FUNCTION update_token (_id INTEGER, _token BYTEA, _spam_count INTEGER, _ham_count INTEGER, _atime INTEGER) RETU

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Tom Lane
Andrew McMillan <[EMAIL PROTECTED]> writes: > On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote: >> How often should this table be vacuumed, every 5 minutes? > I would be tempted to vacuum after each e-mail, in this case. Perhaps the bulk of the transient states should be done in a temp

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Matthew Schumacher
Andrew McMillan wrote: > > For the data in question (i.e. bayes scoring) it would seem that not > much would be lost if you did have to restore your data from a day old > backup, so perhaps fsync=false is OK for this particular application. > > Regards, > And

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Andrew McMillan
On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote: > > > > On my laptop this takes: > > > > real1m33.758s > > user0m4.285s > > sys 0m1.181s > > > > One interesting effect is the data in bayes_vars has a huge number of > > updates and needs vacuum _frequently_. After the ru

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread PFC
Also, this test goes a bit faster with sync turned off, if mysql isn't using sync that would be why it's so much faster. Anyone know what the default for mysql is? For InnoDB I think it's like Postgres (only slower) ; for MyISAM it's no fsync, no transactions, no crash tolerance of any ki

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Matthew Schumacher
Ok, Here is something new, when I take my data.sql file and add a begin and commit at the top and bottom, the benchmark is a LOT slower? My understanding is that it should be much faster because fsync isn't called until the commit instead of on every sql command. I must be missing something here

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Luke Lonergan
Tom, On 7/27/05 11:19 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote: > Matthew Schumacher <[EMAIL PROTECTED]> writes: >> After playing with various indexes and what not I simply am unable to >> make this procedure perform any better. Perhaps someone on the list can >> spot the bottleneck and reveal w

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread John Arbash Meinel
Josh Berkus wrote: >Dennis, > > > >> EXCEPTION WHEN unique_violation THEN >> >> > >I seem to remember that catching an exception in a PL/pgSQL procedure was a >large performance cost. It'd be better to do UPDATE ... IF NOT FOUND. > > > Actually, he was doing an implicit UPDATE IF N

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Matthew Schumacher
Andrew McMillan wrote: > On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote: > >>Ok, I finally got some test data together so that others can test >>without installing SA. >> >>The schema and test dataset is over at >>http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz >> >>I hav

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread Josh Berkus
Dennis, >       EXCEPTION WHEN unique_violation THEN I seem to remember that catching an exception in a PL/pgSQL procedure was a large performance cost. It'd be better to do UPDATE ... IF NOT FOUND. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(en

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Luke Lonergan
Alvaro, On 7/29/05 6:23 AM, "Alvaro Herrera" <[EMAIL PROTECTED]> wrote: > On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote: > >> I guess we see the real culprit here. Anyone surprised it's the WAL? > > So what? Are you planning to suggest people to turn fsync=false? That's not t

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Luke Lonergan
Tom, On 7/29/05 7:12 AM, "Tom Lane" <[EMAIL PROTECTED]> wrote: > "Luke Lonergan" <[EMAIL PROTECTED]> writes: >> I guess we see the real culprit here. Anyone surprised it's the WAL? > > You have not proved that at all. As Alvaro pointed out, fsync has impact on more than WAL, so good point. Int

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Josh Berkus
Luke, > work_mem = 131072 # min 64, size in KB Incidentally, this is much too high for an OLTP application, although I don't think this would have affected the test. > shared_buffers = 16000 # min 16, at least max_connections*2, 8KB > each checkpoint_segments = 128

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Tom Lane
"Luke Lonergan" <[EMAIL PROTECTED]> writes: > I guess we see the real culprit here. Anyone surprised it's the WAL? You have not proved that at all. I haven't had time to look at Matthew's problem, but someone upthread implied that it was doing a separate transaction for each word. If so, collap

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Alvaro Herrera
On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote: > I guess we see the real culprit here. Anyone surprised it's the WAL? So what? Are you planning to suggest people to turn fsync=false? I just had a person lose 3 days of data on some tables because of that, even when checkpoints w

Re: [PERFORM] Performance problems testing with Spamassassin

2005-07-29 Thread Luke Lonergan
PM To: Matthew Schumacher Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0 On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote: > > Ok, I finally got some test data together so that others can test > without installi

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Dennis Bjorklund
On Wed, 27 Jul 2005, Matthew Schumacher wrote: > Then they do this to insert the token: > > INSERT INTO bayes_token ( > id, > token, > spam_count, > ham_count, > atime > ) VALUES ( > ?, > ?, > ?, > ?, > ? > ) ON DUPLICATE KEY > UPDATE > spam_count = GREATEST(spam_count +

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Gavin Sherry
zOn Thu, 28 Jul 2005, Matthew Schumacher wrote: > Gavin Sherry wrote: > > > > > I had a look at your data -- thanks. > > > > I have a question though: put_token() is invoked 120596 times in your > > benchmark... for 616 messages. That's nearly 200 queries (not even > > counting the 1-8 (??) inside

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Andrew McMillan
On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote: > > Ok, I finally got some test data together so that others can test > without installing SA. > > The schema and test dataset is over at > http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz > > I have a pretty fast machine w

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Matthew Schumacher
Gavin Sherry wrote: > > I had a look at your data -- thanks. > > I have a question though: put_token() is invoked 120596 times in your > benchmark... for 616 messages. That's nearly 200 queries (not even > counting the 1-8 (??) inside the function itself) per message. Something > doesn't seem ri

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Gavin Sherry
On Thu, 28 Jul 2005, Matthew Schumacher wrote: > Karim Nassar wrote: > > On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote: > > > > > >>I put the rest of the schema up at > >>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone > >>needs to see it too. > > > > > > Do you h

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-28 Thread Matthew Schumacher
Karim Nassar wrote: > On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote: > > >>I put the rest of the schema up at >>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone >>needs to see it too. > > > Do you have sample data too? > Ok, I finally got some test data togeth

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0 Bayes module.

2005-07-28 Thread Merlin Moncure
> I'm not sure how much this has been discussed on the list, but wasn't > able to find anything relevant in the archives. > > The new Spamassassin is due out pretty soon. They are currently testing > 3.1.0pre4. One of the things I hope to get out of this release is bayes > word stats moved to a

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-27 Thread Karim Nassar
On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote: > I put the rest of the schema up at > http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone > needs to see it too. Do you have sample data too? -- Karim Nassar Collaborative Computing Lab of NAU Office: (928) 523 5868 -

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0 Bayes module.

2005-07-27 Thread Tom Lane
Matthew Schumacher <[EMAIL PROTECTED]> writes: > After playing with various indexes and what not I simply am unable to > make this procedure perform any better. Perhaps someone on the list can > spot the bottleneck and reveal why this procedure isn't performing that > well or ways to make it bette

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-27 Thread Matthew Schumacher
Josh Berkus wrote: > Matt, > > Well, it might be because we don't have a built-in GREATEST or LEAST prior to > 8.1. However, it's pretty darned easy to construct one. I was more talking about min() and max() but yea, I think you knew where I was going with it... > > Well, there's the genera

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0 Bayes module.

2005-07-27 Thread Josh Berkus
Matt, > UPDATE bayes_vars SET > $token_count_update > newest_token_age = GREATEST(newest_token_age, ?), > oldest_token_age = LEAST(oldest_token_age, ?) > WHERE id = ? > > > I think the reason why the procedure was written for postgres was > because of the greatest and least statements perf

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-27 Thread Matthew Schumacher
Josh Berkus wrote: > Matt, > > >>After playing with various indexes and what not I simply am unable to >>make this procedure perform any better. Perhaps someone on the list can >>spot the bottleneck and reveal why this procedure isn't performing that >>well or ways to make it better. > > > Wel

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0 Bayes module.

2005-07-27 Thread Josh Berkus
Matt, > After playing with various indexes and what not I simply am unable to > make this procedure perform any better.  Perhaps someone on the list can > spot the bottleneck and reveal why this procedure isn't performing that > well or ways to make it better. Well, my first thought is that this