Matthew Schumacher wrote:
> John A Meinel wrote:
>
>
>>Surely this isn't what you have. You have *no* loop here, and you have
>>stuff like:
>> AND
>>(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>>
>>I'm guessing this isn't your last version of the function.
>>
>>As far as putting
John A Meinel wrote:
> Surely this isn't what you have. You have *no* loop here, and you have
> stuff like:
> AND
> (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
> I'm guessing this isn't your last version of the function.
>
> As far as putting the CREATE TEMP TABLE inside th
Matthew Schumacher wrote:
> Matthew Schumacher wrote:
>
>>Tom Lane wrote:
>>
>>
>>
>>>I don't really see why you think that this path is going to lead to
>>>better performance than where you were before. Manipulation of the
>>>temp table is never going to be free, and IN (sub-select) is always
>>>
Matthew Schumacher wrote:
> Tom Lane wrote:
>
>
>>I don't really see why you think that this path is going to lead to
>>better performance than where you were before. Manipulation of the
>>temp table is never going to be free, and IN (sub-select) is always
>>inherently not fast, and NOT IN (sub-
Tom Lane wrote:
> I don't really see why you think that this path is going to lead to
> better performance than where you were before. Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> aw
John A Meinel <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> I don't really see why you think that this path is going to lead to
>> better performance than where you were before.
> So for an IN (sub-select), does it actually pull all of the rows from
> the other table, or is the planner smart en
John A Meinel wrote:
> Matthew Schumacher wrote:
>
> I recommend that you drop and re-create the temp table. There is no
> reason to have it around, considering you delete and re-add everything.
> That means you never have to vacuum it, since it always only contains
> the latest rows.
Whenever I
Tom Lane wrote:
> Matthew Schumacher <[EMAIL PROTECTED]> writes:
>
>> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>> LOOP
>>_token := intokenary[i];
>>INSERT INTO bayes_token_tmp VALUES (_token);
>> END LOOP;
>
>
>> UPDATE
>>bayes_token
>> SET
>>spam_count
Matthew Schumacher <[EMAIL PROTECTED]> writes:
> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
> LOOP
> _token := intokenary[i];
> INSERT INTO bayes_token_tmp VALUES (_token);
> END LOOP;
> UPDATE
> bayes_token
> SET
> spam_count = greatest_int(spam_co
Matthew Schumacher wrote:
> Okay,
>
> Here is the status of the SA updates and a question:
>
> Michael got SA changed to pass an array of tokens to the proc so right
> there we gained a ton of performance due to connections and transactions
> being grouped into one per email instead of one per toke
What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table. If I could copy from an array into a record then I can
even get rid of the loop. Anyone have any thoughts on how to do this?
You coul
Okay,
Here is the status of the SA updates and a question:
Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.
Now I am working on making the pro
Jim C. Nasby wrote:
>I'm not sure who's responsible for DBI::Pg (Josh?), but would it make
>sense to add better support for bytea to DBI::Pg? ISTM there should be a
>better way of doing this than adding gobs of \'s.
>
>
It has support for binding a bytea parameter, but in this case we're
trying
On Mon, Aug 01, 2005 at 01:28:24PM -0800, Matthew Schumacher wrote:
> PFC wrote:
> >
> >
> >> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1);
> >
> >
> > Try adding more backslashes until it works (seems that you need
> > or something).
> > Don't DBI convert t
PFC wrote:
>
>
>> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1);
>
>
> Try adding more backslashes until it works (seems that you need
> or something).
> Don't DBI convert the language types to postgres quoted forms on its
> own ?
>
Your right I am find
select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 1);
Try adding more backslashes until it works (seems that you need or
something).
Don't DBI convert the language types to postgres quoted forms on its
own ?
---(end of broadcast)-
Tom Lane wrote:
>
> Revised insertion procedure:
>
>
> CREATE or replace FUNCTION put_tokens (_id INTEGER,
> _tokens BYTEA[],
> _spam_count INTEGER,
> _ham_count INTEGER,
> _at
Tom Lane wrote:
> Michael Parker <[EMAIL PROTECTED]> writes:
>
>>sub bytea_esc {
>> my ($str) = @_;
>> my $buf = "";
>> foreach my $char (split(//,$str)) {
>>if (ord($char) == 0) { $buf .= "000"; }
>>elsif (ord($char) == 39) { $buf .= "047"; }
>>elsif (ord($char) == 92) { $b
Michael Parker <[EMAIL PROTECTED]> writes:
> sub bytea_esc {
> my ($str) = @_;
> my $buf = "";
> foreach my $char (split(//,$str)) {
> if (ord($char) == 0) { $buf .= "000"; }
> elsif (ord($char) == 39) { $buf .= "047"; }
> elsif (ord($char) == 92) { $buf .= "134"; }
>
Michael Parker <[EMAIL PROTECTED]> writes:
> The next hurdle, and I've just posted to the DBD::Pg list, is
> escaping/quoting the token strings.
If you're trying to write a bytea[] literal, I think the most reliable
way to write the individual bytes is
nnn
where nnn is *octal*. The id
Hi All,
As a SpamAssassin developer, who by my own admission has real problem
getting PostgreSQL to work well, I must thank everyone for their
feedback on this issue. Believe me when I say what is in the tree now
is a far cry from what used to be there, orders of magnitude faster
for sure. I thi
Jim C. Nasby wrote:
On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
Ok, here is the current plan.
Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table us
On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
> Ok, here is the current plan.
>
> Change the spamassassin API to pass a hash of tokens into the storage
> module, pass the tokens to the proc as an array, start a transaction,
> load the tokens into a temp table using copy, sele
Ok, here is the current plan.
Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table using copy, select the tokens distinct
into the token table for new tokens, update the token t
John Arbash Meinel wrote:
>Matthew Schumacher wrote:
>
>
>
>>All it's doing is trying the update before the insert to get around the
>>problem of not knowing which is needed. With only 2-3 of the queries
>>implemented I'm already back to running about the same speed as the
>>original SA proc th
Matthew Schumacher wrote:
>Tom Lane wrote:
>
>
>
>>I looked into this a bit. It seems that the problem when you wrap the
>>entire insertion series into one transaction is associated with the fact
>>that the test does so many successive updates of the single row in
>>bayes_vars. (VACUUM VERBOSE
Matthew Schumacher wrote:
>All it's doing is trying the update before the insert to get around the
>problem of not knowing which is needed. With only 2-3 of the queries
>implemented I'm already back to running about the same speed as the
>original SA proc that is going to ship with SA 3.1.0.
>
>A
Tom Lane wrote:
> I looked into this a bit. It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars. (VACUUM VERBOSE at the end of the test shows it cl
Matthew Schumacher <[EMAIL PROTECTED]> writes:
> One thing that is still very puzzling to me is why this runs so much
> slower when I put the data.sql in a transaction. Obviously transactions
> are acting different when you call a proc a zillion times vs an insert
> query.
I looked into this a bi
On Sat, 2005-07-30 at 00:46 -0800, Matthew Schumacher wrote:
> I'll do some more testing on Monday, perhaps grouping even 200 tokens at
> a time using your method will yield significant gains, but probably not
> as dramatic as it does using my loading benchmark.
In that case, some of the clauses
Karim Nassar wrote:
>
> [EMAIL PROTECTED]:~/k-bayesBenchmark$ time ./test.pl
> <-- snip db creation stuff -->
> 17:18:44 -- START
> 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records
> 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into
> bayes_token
> 17:19:50 -- AFTER bayes_
On Fri, 2005-07-29 at 09:47 -0700, Josh Berkus wrote:
> Try changing:
> wal_buffers = 256
>
> and try Bruce's stop full_page_writes patch.
>
> > I guess we see the real culprit here. Anyone surprised it's the WAL?
>
> Nope. On high-end OLTP stuff, it's crucial that the WAL have its own
> dedi
Ok, here is where I'm at, I reduced the proc down to this:
CREATE FUNCTION update_token (_id INTEGER,
_token BYTEA,
_spam_count INTEGER,
_ham_count INTEGER,
_atime INTEGER)
RETU
Andrew McMillan <[EMAIL PROTECTED]> writes:
> On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
>> How often should this table be vacuumed, every 5 minutes?
> I would be tempted to vacuum after each e-mail, in this case.
Perhaps the bulk of the transient states should be done in a temp
Andrew McMillan wrote:
>
> For the data in question (i.e. bayes scoring) it would seem that not
> much would be lost if you did have to restore your data from a day old
> backup, so perhaps fsync=false is OK for this particular application.
>
> Regards,
> And
On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
> >
> > On my laptop this takes:
> >
> > real1m33.758s
> > user0m4.285s
> > sys 0m1.181s
> >
> > One interesting effect is the data in bayes_vars has a huge number of
> > updates and needs vacuum _frequently_. After the ru
Also, this test goes a bit faster with sync turned off, if mysql isn't
using sync that would be why it's so much faster. Anyone know what the
default for mysql is?
For InnoDB I think it's like Postgres (only slower) ; for MyISAM it's no
fsync, no transactions, no crash tolerance of any ki
Ok,
Here is something new, when I take my data.sql file and add a begin and
commit at the top and bottom, the benchmark is a LOT slower?
My understanding is that it should be much faster because fsync isn't
called until the commit instead of on every sql command.
I must be missing something here
Tom,
On 7/27/05 11:19 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> Matthew Schumacher <[EMAIL PROTECTED]> writes:
>> After playing with various indexes and what not I simply am unable to
>> make this procedure perform any better. Perhaps someone on the list can
>> spot the bottleneck and reveal w
Josh Berkus wrote:
>Dennis,
>
>
>
>> EXCEPTION WHEN unique_violation THEN
>>
>>
>
>I seem to remember that catching an exception in a PL/pgSQL procedure was a
>large performance cost. It'd be better to do UPDATE ... IF NOT FOUND.
>
>
>
Actually, he was doing an implicit UPDATE IF N
Andrew McMillan wrote:
> On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
>>Ok, I finally got some test data together so that others can test
>>without installing SA.
>>
>>The schema and test dataset is over at
>>http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>>
>>I hav
Dennis,
> EXCEPTION WHEN unique_violation THEN
I seem to remember that catching an exception in a PL/pgSQL procedure was a
large performance cost. It'd be better to do UPDATE ... IF NOT FOUND.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
---(en
Alvaro,
On 7/29/05 6:23 AM, "Alvaro Herrera" <[EMAIL PROTECTED]> wrote:
> On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote:
>
>> I guess we see the real culprit here. Anyone surprised it's the WAL?
>
> So what? Are you planning to suggest people to turn fsync=false?
That's not t
Tom,
On 7/29/05 7:12 AM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> "Luke Lonergan" <[EMAIL PROTECTED]> writes:
>> I guess we see the real culprit here. Anyone surprised it's the WAL?
>
> You have not proved that at all.
As Alvaro pointed out, fsync has impact on more than WAL, so good point.
Int
Luke,
> work_mem = 131072 # min 64, size in KB
Incidentally, this is much too high for an OLTP application, although I don't
think this would have affected the test.
> shared_buffers = 16000 # min 16, at least max_connections*2, 8KB
> each checkpoint_segments = 128
"Luke Lonergan" <[EMAIL PROTECTED]> writes:
> I guess we see the real culprit here. Anyone surprised it's the WAL?
You have not proved that at all.
I haven't had time to look at Matthew's problem, but someone upthread
implied that it was doing a separate transaction for each word. If so,
collap
On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote:
> I guess we see the real culprit here. Anyone surprised it's the WAL?
So what? Are you planning to suggest people to turn fsync=false?
I just had a person lose 3 days of data on some tables because of that,
even when checkpoints w
PM
To: Matthew Schumacher
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
> Ok, I finally got some test data together so that others can test
> without installi
On Wed, 27 Jul 2005, Matthew Schumacher wrote:
> Then they do this to insert the token:
>
> INSERT INTO bayes_token (
> id,
> token,
> spam_count,
> ham_count,
> atime
> ) VALUES (
> ?,
> ?,
> ?,
> ?,
> ?
> ) ON DUPLICATE KEY
> UPDATE
> spam_count = GREATEST(spam_count +
zOn Thu, 28 Jul 2005, Matthew Schumacher wrote:
> Gavin Sherry wrote:
>
> >
> > I had a look at your data -- thanks.
> >
> > I have a question though: put_token() is invoked 120596 times in your
> > benchmark... for 616 messages. That's nearly 200 queries (not even
> > counting the 1-8 (??) inside
On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
> Ok, I finally got some test data together so that others can test
> without installing SA.
>
> The schema and test dataset is over at
> http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>
> I have a pretty fast machine w
Gavin Sherry wrote:
>
> I had a look at your data -- thanks.
>
> I have a question though: put_token() is invoked 120596 times in your
> benchmark... for 616 messages. That's nearly 200 queries (not even
> counting the 1-8 (??) inside the function itself) per message. Something
> doesn't seem ri
On Thu, 28 Jul 2005, Matthew Schumacher wrote:
> Karim Nassar wrote:
> > On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
> >
> >
> >>I put the rest of the schema up at
> >>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
> >>needs to see it too.
> >
> >
> > Do you h
Karim Nassar wrote:
> On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
>
>
>>I put the rest of the schema up at
>>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
>>needs to see it too.
>
>
> Do you have sample data too?
>
Ok, I finally got some test data togeth
> I'm not sure how much this has been discussed on the list, but wasn't
> able to find anything relevant in the archives.
>
> The new Spamassassin is due out pretty soon. They are currently
testing
> 3.1.0pre4. One of the things I hope to get out of this release is
bayes
> word stats moved to a
On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
> I put the rest of the schema up at
> http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
> needs to see it too.
Do you have sample data too?
--
Karim Nassar
Collaborative Computing Lab of NAU
Office: (928) 523 5868 -
Matthew Schumacher <[EMAIL PROTECTED]> writes:
> After playing with various indexes and what not I simply am unable to
> make this procedure perform any better. Perhaps someone on the list can
> spot the bottleneck and reveal why this procedure isn't performing that
> well or ways to make it bette
Josh Berkus wrote:
> Matt,
>
> Well, it might be because we don't have a built-in GREATEST or LEAST prior to
> 8.1. However, it's pretty darned easy to construct one.
I was more talking about min() and max() but yea, I think you knew where
I was going with it...
>
> Well, there's the genera
Matt,
> UPDATE bayes_vars SET
> $token_count_update
> newest_token_age = GREATEST(newest_token_age, ?),
> oldest_token_age = LEAST(oldest_token_age, ?)
> WHERE id = ?
>
>
> I think the reason why the procedure was written for postgres was
> because of the greatest and least statements perf
Josh Berkus wrote:
> Matt,
>
>
>>After playing with various indexes and what not I simply am unable to
>>make this procedure perform any better. Perhaps someone on the list can
>>spot the bottleneck and reveal why this procedure isn't performing that
>>well or ways to make it better.
>
>
> Wel
Matt,
> After playing with various indexes and what not I simply am unable to
> make this procedure perform any better. Perhaps someone on the list can
> spot the bottleneck and reveal why this procedure isn't performing that
> well or ways to make it better.
Well, my first thought is that this
61 matches
Mail list logo