from:"nathan"

Re: [HACKERS] AIX compile ?? libcrypt HELP !!

2000-10-03 Thread nathan


Tom Lane wrote:

> Nathan Boeger <[EMAIL PROTECTED]> writes:
> > I am trying to compile an app (compiles on *BSD, Linux) but for some
> > reason I cant get it to compile on AIX 4.3.3.3. It seems not to like the
> > libcrypt ??
>
> > ld: 0711-317 ERROR: Undefined symbol: __crypt_r
> > ld: 0711-317 ERROR: Undefined symbol: __setkey_r
> > ld: 0711-317 ERROR: Undefined symbol: __encrypt_r
>
> I'll bet AIX doesn't have the "reentrant" versions crypt_r(), setkey_r(),
> etc.  Did you check its man page for the crypt functions?
>
> regards, tom lane

well, no

but...  I did however link agianst the ssl libcrypto (-L/usr/local/ssl/lib
-lcrypto ) and it works...

why ?

thank you for the response !!

nathan

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-10-03 Thread Bossart, Nathan

On 10/3/17, 5:59 PM, "Tom Lane"  wrote:
> I thought it would be a good idea to get this done before walking away
> from the commitfest and letting all this info get swapped out of my
> head.  So I've reviewed and pushed this.

Thanks!

> I took out most of the infrastructure you'd put in for constructing
> RangeVars for tables not explicitly named on the command line.  It
> was buggy (eg you can't assume a relcache entry will stick around)
> and I don't believe it's necessary.  I don't think that warnings
> should be issued for any tables not explicitly named.
>
> In any case, though, the extent to which we should add more warning
> or log output seems like a fit topic for a new thread and a separate
> patch.  Let's call this one done.

I'll look into submitting that to the next commitfest.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional logging for VACUUM and ANALYZE

2017-10-05 Thread Bossart, Nathan

On 10/5/17, 12:29 AM, "Michael Paquier"  wrote:
> On Thu, Oct 5, 2017 at 1:23 AM, Bossart, Nathan  wrote:
>> Presently, there are a few edge cases in vacuum_rel() and analyze_rel() that 
>> I
>> believe do not have sufficient logging.  This was discussed a bit in the
>> vacuum-multiple-relations thread [0], but it was ultimately decided that any
>> logging changes should be proposed separately.
>
> I think that I agree with that, especially now with VACUUM allowing
> multiple relations. The discussion then would be how much logging we
> want. WARNING looks adapted per the discussions we had on the other
> thread as manual VACUUMs can now involve much more relations, even
> with partitioned tables. More opinions would be welcome.

Thanks for taking a look.

>> and a test that exercises a bit of this functionality.
>
> My take on those test would be to not include them. This is a lot just
> to test two logging lines where the relation has been dropped.

Yeah, I'm not terribly concerned about those tests.

>> If this change were to be considered for back-patching, we would likely want 
>> to
>> also apply Michael's RangeVar fix for partitioned tables to 10 [1].  Without
>> this change, log messages for unspecified partitions will be emitted with the
>> parent's RangeVar information.
>
> Well, that's assuming that we begin logging some information for
> manual VACUUMs using the specified RangeVar, something that does not
> happen at the top of upstream REL_10_STABLE, but can happen if we were
> to include the patch you are proposing on this thread for
> REL_10_STABLE. But the latter is not going to happen. Or did you patch
> your version of v10 to do so in your stuff? For v10 the ship has
> already sailed, so I think that it would be better to just let it go,
> and rely on v11 which has added all the facility we wanted.

I agree.  I didn't mean to suggest that it should be back-patched, just
that v10 has a small quirk that needs to be handled if others feel
differently.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enhancements to passwordcheck

2017-11-01 Thread Bossart, Nathan

On 11/1/17, 9:33 PM, "Peter Eisentraut"  
wrote:
> On 9/25/17 14:04, Bossart, Nathan wrote:
>> I'd like to use this thread to gauge community interest in adding this
>> functionality to this module.  If there is interest, I'll add it to the next
>> commitfest.
>
> This thread has no patch, which is not really the intended use of the
> commit fest, so I'm closing the commit fest entry for now.

Sorry about that.  I'll reopen it when I have a patch to share.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2017-11-04 Thread Bossart, Nathan

On 10/5/17, 11:53 PM, "Jing Wang"  wrote:
> The patch has been updated according to Nathan's comments. 
> Thanks Nathan's review.

Thanks for the new versions of the patches.  I apologize for
the long delay for this new review.

It looks like the no-pgdump patch needs a rebase at this point.
I was able to apply the code portions with a 3-way merge, but
the documentation changes still did not apply.  I didn't have
any problems applying the pgdump patch.

+
+ 
+  The name of a database or keyword current_database. When using 
+  current_database,it means using the name of the connecting database.
+ 
+

For commands that accept the CURRENT_USER and SESSION_USER
keywords, the keywords are typically listed in the 'Synopsis'
section.  I think CURRENT_DATABASE should be no different.  For
example, the parameter type above could be
"database_specification," and the following definition could be
included at the bottom of the synopsis:

where database_specification can be:

object_name
  | CURRENT_DATABASE

Then, in the parameters section, the CURRENT_DATABASE keyword
would be defined:

CURRENT_DATABASE

Comment on the current database instead of an
explicitly identified role.

Also, it looks like only the COMMENT and SECURITY LABEL
documentation is being updated in this patch set.  However, this
keyword seems applicable to many other commands, too (e.g.
GRANT, ALTER DATABASE, ALTER ROLE, etc.).

+static ObjectAddress
+get_object_address_database(ObjectType objtype, DbSpec *object, bool 
missing_ok)
+{
+   char*dbname;
+   ObjectAddress   address;
+
+   dbname = get_dbspec_name(object);
+
+   address.classId = DatabaseRelationId;
+   address.objectId = get_database_oid(dbname, missing_ok);
+   address.objectSubId = 0;
+
+   return address;
+}

This helper function is only used once, and it seems simple
enough to build the ObjectAddress in the switch statement.
Also, instead of get_database_oid(), could we use
get_dbspec_oid()?

if (stmt->objtype == OBJECT_DATABASE)
{
-   char   *database = strVal((Value *) stmt->object);
-
-   if (!OidIsValid(get_database_oid(database, true)))
+   if (!OidIsValid(get_dbspec_oid((DbSpec*)stmt->object, true)))
{
+   char*dbname = NULL;
+
+   dbname = get_dbspec_name((DbSpec*)stmt->object);
+
ereport(WARNING,
(errcode(ERRCODE_UNDEFINED_DATABASE),
-errmsg("database \"%s\" does not 
exist", database)));
+errmsg("database \"%s\" does not 
exist", dbname)));
return address;
}
}

This section seems to assume that the DbSpec will be of type
DBSPEC_CSTRING in the error handling.  That should be safe for
now, as you cannot drop the current database, but I would
suggest adding assertions here to be sure.

+   dbname = get_dbspec_name((DbSpec*)stmt->dbspec);

As a general note, casts are typically styled as "(DbSpec *)
stmt" (with the spaces) in PostgreSQL.

+   case DBSPEC_CURRENT_DATABASE:
+   {
+   HeapTuple   tuple;
+   Form_pg_database dbForm;

Can we just declare "tuple" and "dbForm" at the beginning of
get_dbspec_name() so we don't need the extra set of braces?

+   if (fout->remoteVersion >= 10)
+   appendPQExpBuffer(dbQry, "COMMENT ON DATABASE 
CURRENT_DATABASE IS ");
+   else
+   appendPQExpBuffer(dbQry, "COMMENT ON DATABASE 
%s IS ", fmtId(datname));

This feature would probably only be added to v11, so the version
checks in the pgdump patch will need to be updated.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_upgrade to clusters with a different WAL segment size

2017-11-10 Thread Bossart, Nathan

Hi hackers,

The thread for the recent change to allow setting the WAL segment size at
initdb time [0] included a bit of discussion regarding pg_upgrade [1],
where it was suggested that relaxing an error check (presumably in
check_control_data()) might be enough to upgrade servers to a different WAL
segment size.

We've had success with our initial testing of upgrades to larger WAL
segment sizes, including post-upgrade pgbench runs.  Beyond adjusting
check_control_data(), it looks like the 'pg_resetwal -l' call in
copy_xact_xlog_xid() may need to be adjusted to ensure that the WAL
starting address is set to a valid value.

Allowing changes to the WAL segment size during pg_upgrade seems like a
nice way to avoid needing a dump and load, so I would like to propose
adding support for this.  I'd be happy to submit patches for this in the
next commitfest.

Nathan

[0] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fc49e24fa69a15efacd5b8958115ed9c43c48f9a
[1] 
https://www.postgresql.org/message-id/20160826043926.pwkz45ksxpmfn4g6%40alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-10 Thread Bossart, Nathan

Great, I’ll keep working on this patch and will update this thread with a more 
complete implementation.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-10 Thread Bossart, Nathan

On 5/10/17, 8:10 PM, "Masahiko Sawada"  wrote:
> I agree to report per-table information. Especially In case of one of
> tables specified failed during vacuuming, I think we should report at
> least information of tables that is done successfully so far.

+1

I believe you already get all per-table information when you do not specify a 
table name (“VACUUM VERBOSE;”), so I would expect to get this for free as long 
as we are building this into the existing ExecVacuum(…) and vacuum(…) functions.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-11 Thread Bossart, Nathan

On 5/11/17, 6:32 PM, "Tom Lane"  wrote:
> You probably won't get much in the near term, because we're in
> stabilize-the-release mode not develop-new-features mode.
> Please add your patch to the pending commitfest
> https://commitfest.postgresql.org/14/
> so that we remember to look at it when the time comes.

Understood.  I’ve added it to the commitfest.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-11 Thread Bossart, Nathan

On 5/11/17, 7:20 PM, "Michael Paquier"  wrote:
> Browsing the code

Thanks for taking a look.

> It seems to me that you don't need those extra square brackets around
> the column list. Same remark for vacuum.sgml.

I’ll remove them.  My intent was to ensure that we didn’t accidentally suggest 
that statements like “VACUUM , foo, bar;” were accepted, but the extra brackets 
don’t seem to fix that, and I don’t foresee much confusion anyway.  

> In short for all that, it is enough to have "[, ... ]" to document
> that a list is accepted.

That makes sense, I’ll fix it.

> It seems to me that it would have been less invasive to loop through
> vacuum() for each relation. Do you foresee advantages in allowing
> vacuum() to handle multiple? I am not sure if is is worth complicating
> the current logic more considering that you have as well so extra
> logic to carry on option values.

That was the approach I first prototyped.  The main disadvantage that I found 
was that the command wouldn’t fail-fast if one of the tables or columns didn’t 
exist, and I thought that it might be frustrating to encounter such an error in 
the middle of vacuuming several large tables.  It’s easy enough to change the 
logic to emit a warning and simply move on to the next table, but that seemed 
like it could be easily missed among the rest of the vacuum log statements 
(especially with the verbose option specified).  What are your thoughts on this?

In the spirit of simplifying things a bit, I do think it is possible to 
eliminate one of the new node types, since the fields for each are almost 
identical.

> + * used for error messages.  In the case that relid is valid, rels
> + * must have exactly one element corresponding to the specified relid.
> s/rels/relations/ as variable name?

Agreed, that seems nicer.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-17 Thread Bossart, Nathan

On 5/16/17, 11:21 PM, "Michael Paquier"  wrote:
> On Wed, May 17, 2017 at 7:56 AM, Bossart, Nathan  wrote:
>> I think this issue already exists, as this comment in get_rel_oids(…) seems 
>> to indicate:
>>
>> /*
>>  * Since we don't take a lock here, the relation might be gone, or the
>>  * RangeVar might no longer refer to the OID we look up here.  In the
>>  * former case, VACUUM will do nothing; in the latter case, it will
>>  * process the OID we looked up here, rather than the new one. Neither
>>  * is ideal, but there's little practical alternative, since we're
>>  * going to commit this transaction and begin a new one between now
>>  * and then.
>>  */
>>  relid = RangeVarGetRelid(vacrel, NoLock, false);
>>
>> With the patch applied, I believe this statement still holds true.  So if 
>> the relation disappears before we get to vacuum_rel(…), we will simply skip 
>> it and move on to the next one.  The vacuum_rel(…) code provides a WARNING 
>> in many cases (e.g. the user does not have privileges to VACUUM the table), 
>> but we seem to silently skip the table when it disappears before the call to 
>> vacuum_rel(…).
>
> Yes, that's the bits using try_relation_open() which returns NULL if
> the relation is gone. I don't think that we want VACUUM to be noisy
> about that when running it on a database.

Agreed.

>> If we added a WARNING to the effect of “skipping vacuum of  — 
>> relation no longer exists” for this case, I think what you are suggesting 
>> would be satisfied.
>
> We would do no favor by reporting nothing to the user. Without any
> information, the user triggering a manual vacuum may believe that the
> relation has been vacuumed as it was listed but that would not be the
> case.

Agreed.  Unfortunately, I think this is already possible when you specify a 
table to be vacuumed.

>> However, ANALYZE has a slight caveat.  While analyze_rel(…) silently skips 
>> the relation if it no longer exists like vacuum_rel(…) does, we do not 
>> pre-validate the columns list at all.  So, in an ANALYZE statement with 
>> multiple tables and columns specified, it’ll only fail once we get to the 
>> undefined column.  To fix this, we could add a check for the column lists 
>> near get_rel_oids(…) and adjust do_analyze_rel(…) to emit a WARNING and skip 
>> any columns that vanish in the meantime.
>>
>> Does this seem like a sane approach?
>>
>>   1. Emit WARNING when skipping if relation disappears before we get to it.
>>   2. Early in vacuum(…), check that the specified columns exist.
>
> And issue an ERROR, right?

Correct.  This means that both the relations and columns specified would cause 
an ERROR if they do not exist and a WARNING if they disappear before we can 
actually process them.

> +   RelationAndColumns *relinfo = (RelationAndColumns *)
> lfirst(relation);
> +   int per_table_opts = options | relinfo->options;  /*
> VACOPT_ANALYZE may be set per-table */
> +   ListCell *oid;
> I have just noticed this bit in your patch. So you can basically do
> something like that:
> VACUUM (ANALYZE) foo, (FULL) bar;
> Do we really want to go down to this level of control? I would keep
> things a bit more restricted as users may be confused by the different
> lock levels involved by each operation, and make use of the same
> options for all the relations listed. Opinions from others is welcome.

I agree with you here, too.  I stopped short of allowing customers to 
explicitly provide per-table options, so the example you provided wouldn’t work 
here.  This is more applicable for something like the following:

VACUUM (FREEZE, VERBOSE) foo, bar (a);

In this case, the FREEZE and VERBOSE options are used for both tables.  
However, we have a column list specified for ‘bar’, and the ANALYZE option is 
implied when we specify a column list.  So when we process ‘bar’, we need to 
apply the ANALYZE option, but we do not need it for ‘foo’.  For now, that is 
all that this per-table options variable is used for.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-18 Thread Bossart, Nathan

On 5/18/17, 6:12 PM, "Michael Paquier"  wrote:
> Fine for me as well. I would suggest to split the patch into two parts
> to ease review then:
> - Rework this error handling for one relation.
> - The main patch.

I’d be happy to do so, but I think part one would be pretty small, and almost 
all of the same code needs to be changed in the main patch anyway.  I do not 
foresee a huge impact on review-ability either way.  If others disagree, I can 
split it up.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-05-18 Thread Bossart, Nathan

On 5/18/17, 8:03 PM, "Tom Lane"  wrote:
>”Bossart, Nathan"  writes:
>> On 5/18/17, 6:12 PM, "Michael Paquier"  wrote:
>>> Fine for me as well. I would suggest to split the patch into two parts
>>> to ease review then:
>>> - Rework this error handling for one relation.
>>> - The main patch.
>>
>> I’d be happy to do so, but I think part one would be pretty small, and 
>> almost all of the same code needs to be changed in the main patch anyway.  I 
>> do not foresee a huge impact on review-ability either way.  If others 
>> disagree, I can split it up.
>
>Yeah, I'm dubious that that's really necessary.  If the change proves
>bigger than you're anticipating, maybe it's worth a two-step approach,
>but I share your feeling that it probably isn’t.

Just in case it was missed among the discussion, I’d like to point out that v5 
of the patch includes the “ERROR if ANALYZE not specified” change.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-08-24 Thread Bossart, Nathan

On 8/23/17, 11:59 PM, "Michael Paquier"  wrote:
> Robert, Amit and other folks working on extending the existing
> partitioning facility would be in better position to answer that, but
> I would think that we should have something as flexible as possible,
> and storing a list of relation OID in each VacuumRelation makes it
> harder to track the uniqueness of relations vacuumed. I agree that the
> concept of a partition with multiple parents induces a lot of
> problems. But the patch as proposed worries me as it complicates
> vacuum() with a double loop: one for each relation vacuumed, and one
> inside it with the list of OIDs. Modules calling vacuum() could also
> use flexibility, being able to analyze a custom list of columns for
> each relation has value as well.

I understand your concern, and I'll look into this for v9.  I think the
majority of this change will go into get_rel_oids(...).  Like you, I am
also curious to what the partitioning folks think.

> + * relations is a list of VacuumRelations to process.  If it is NIL, all
> + * relations in the database are processed.
> Typo here, VacuumRelation is singular.

I'll make this change in v9.

I should also note that the dedupe_relations(...) function needs another
small fix for column lists.  Since the lack of a column list means that we
should ANALYZE all columns, a duplicate table name with an empty column
list should effectively null out any other specified columns.  For example,
"ANALYZE table (a, b), table;" currently dedupes to the equivalent of
"ANALYZE table (a, b);" when it should dedupe to "ANALYZE table;".

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-08-30 Thread Bossart, Nathan

On 8/30/17, 5:37 PM, "Michael Paquier"  wrote:
> Yeah... Each approach has its cost and its advantages. It may be
> better to wait for more opinions, no many people have complained yet
> that for example a list of columns using twice the same one fails.

Sounds good to me.

> +VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [  class="PARAMETER">table_name ] [, ...]
> I just noticed that... But regarding the docs, I think that you have
> misplaced the position of "[, ...]", which should be inside the
> table_name portion in the case of what I quote here, no?

I think that's what I had initially, but it was changed somewhere along
the line.  It is a little more complicated for the versions that accept
column lists.

VACUUM ... ANALYZE [ [ table_name [ (column_name [, ...] ) ] ] [, ...] ]

ISTM that we need the extra brackets here to clarify that the table and
column list combination is what can be provided in a list.  Does that
make sense?  Or do you think we can omit the outermost brackets here?

> +VacuumRelation *
> +makeVacuumRelation(RangeVar *relation, List *va_cols, Oid oid)
> +{
> +   VacuumRelation *vacrel = makeNode(VacuumRelation);
> +   vacrel->relation = relation;
> +   vacrel->va_cols = va_cols;
> +   vacrel->oid = oid;
> +   return vacrel;
> +}
> Perhaps in makefuncs.c instead of vacuum.c? That's usually the place
> used for node constructions like that.

Ah, yes.  That is a much better place.  I'll make this change.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-04 Thread Bossart, Nathan

On 9/3/17, 11:46 PM, "Michael Paquier"  wrote:
> I did not consider first that the list portion also needed to be
> modified, perhaps because I am not coding that myself... So now that
> it is becoming more complicated what about just using AutovacMemCxt?
> This would simplify the list of VacuumRelation entries and the
> RangeVar creation as well, and honestly this is ugly and there are no
> other similar patterns in the backend code:

+1

> This would become way more readable by using makeRangeVar() and the
> new makeVacuumRelation. As this is partly my fault that we are at this
> state, I am fine as well to remove this burden from you, Nathan, and
> fix that myself in a new version. And I don't want to step on your
> toes either :)

No worries, I can take care of it.  I appreciate your patience with all
of these reviews.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-05 Thread Bossart, Nathan

On 9/4/17, 10:32 PM, "Simon Riggs"  wrote:
> ISTM there is no difference between
>   VACUUM a, b
> and
>   VACUUM a; VACUUM b;
>
> If we want to keep the code simple we must surely consider whether the
> patch has any utility.

Yes, this is true, but I think the convenience factor is a bit
understated with that example.  For example, if you need to manually
cleanup several tables for XID purposes,
VACUUM FREEZE VERBOSE table1;
VACUUM FREEZE VERBOSE table2;
VACUUM FREEZE VERBOSE table3;
VACUUM FREEZE VERBOSE table4;
VACUUM FREEZE VERBOSE table5;
becomes
VACUUM FREEZE VERBOSE table1, table2, table3, table4, table5;

I would consider even this to be a relatively modest example compared
to the sorts of things users might do.

In addition, I'd argue that this feels like a natural extension of the
VACUUM command, one that I, like others much earlier in this thread,
was surprised to learn wasn't supported.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-05 Thread Bossart, Nathan

On 9/4/17, 8:16 PM, "Michael Paquier"  wrote:
> So vacuum_multiple_tables_v14.patch is good for a committer in my
> opinion. With this patch, if the same relation is specified multiple
> times, then it gets vacuum'ed that many times. Using the same column
> multi-times results in an error as on HEAD, but that's not a new
> problem with this patch.

Thanks!

> So I would tend to think that the same column specified multiple times
> should cause an error, and that we could let VACUUM run work N times
> on a relation if it is specified this much. This feels more natural,
> at least to me, and it keeps the code simple.

I think that is a reasonable approach.  Another option I was thinking
about was to de-duplicate only the individual column lists.  This
alternative approach might be a bit more user-friendly, but I am
beginning to agree with you that perhaps we should not try to infer
the intent of the user in these "duplicate" scenarios.

I'll work on converting the existing de-duplication patch into
something more like what you suggested.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-10 Thread Bossart, Nathan

On 9/9/17, 7:28 AM, "Michael Paquier"  wrote:
> In the duplicate patch, it seems to me that you can save one lookup at
> the list of VacuumRelation items by checking for column duplicates
> after checking that all the columns are defined. If you put the
> duplicate check before closing the relation you can also use the
> schema name associated with the Relation.

I did this so that the ERROR prioritization would be enforced across all
relations.  For example:

VACUUM ANALYZE table1 (a, a), table2 (does_not_exist);

If I combine the 'for' loops to save a lookup, this example behaves
differently.  Instead of an ERROR for the nonexistent column, you would
hit an ERROR for the duplicate column in table1's list.  However, I do
not mind changing this.

> +   if (i == InvalidAttrNumber)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_UNDEFINED_COLUMN),
> +errmsg("column \"%s\" of relation \"%s\" does not exist",
> +   col, RelationGetRelationName(rel;
> This could use the schema name unconditionally as you hold a Relation
> here, using RelationGetNamespace().

Sure, I think this is a good idea.  I'll make this change in the next
version of the patch.

> if (!onerel)
> +   {
> +   /*
> +* If one of the relations specified by the user has disappeared
> +* since we last looked it up, let them know so that they do not
> +* think it was actually analyzed.
> +*/
> +   if (!IsAutoVacuumWorkerProcess() && relation)
> +   ereport(WARNING,
> + (errmsg("skipping \"%s\" --- relation no longer exists",
> + relation->relname)));
> +
> return;
> +   }
> Hm. So if the relation with the defined OID is not found, then we'd
> use the RangeVar, but this one may not be set here:
> +   /*
> +* It is safe to leave everything except the OID empty here.
> +* Since no tables were specified in the VacuumStmt, we know
> +* we don't have any columns to keep track of.  Also, we do
> +* not need the RangeVar, because it is only used for error
> +* messaging when specific relations are chosen.
> +*/
> +   rel_oid = HeapTupleGetOid(tuple);
> +   relinfo = makeVacuumRelation(NULL, NIL, rel_oid);
> +   vacrels_tmp = lappend(vacrels_tmp, relinfo);
> So if the relation is analyzed but skipped, we would have no idea that
> it actually got skipped because there are no reports about it. That's
> not really user-friendly. I am wondering if we should not instead have
> analyze_rel also enforce the presence of a RangeVar, and adding an
> assert at the beginning of the function to undertline that, and also
> do the same for vacuum(). It would make things also consistent with
> vacuum() which now implies on HEAD that a RangeVar *must* be
> specified.

I agree that it is nice to see when relations are skipped, but I do not
know if the WARNING messages would provide much value for this
particular use case (i.e. 'VACUUM;').  If a user does not provide a list
of tables to VACUUM, they might not care too much about WARNINGs for
dropped tables.

> Are there opinions about back-patching the patch checking for
> duplicate columns? Stable branches now complain about an unhelpful
> error message.

I wouldn't mind drafting something up for the stable branches.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2017-09-15 Thread Bossart, Nathan

A few general comments.

While this patch applies, I am still seeing some whitespace errors:

  comment_on_current_database_no_pgdump_v4.1.patch:488: trailing whitespace.
ColId
  comment_on_current_database_no_pgdump_v4.1.patch:490: trailing whitespace.
| CURRENT_DATABASE
  comment_on_current_database_no_pgdump_v4.1.patch:491: trailing whitespace.
{
  comment_on_current_database_no_pgdump_v4.1.patch:501: trailing whitespace.
ColId
  comment_on_current_database_no_pgdump_v4.1.patch:502: trailing whitespace.
{
  warning: squelched 9 whitespace errors
  warning: 14 lines add whitespace errors.

It looks like the 'dbname' test is currently failing when you run
'make check-world'.  The Travis CI build log [1] shows the details
of the failure.  From a brief glance, I would guess that some of
the queries are returning unexpected databases that are created in
other tests.

Also, I think this change will require updates to the
documentation.

+   if (dbspecname->dbnametype == DBSPEC_CURRENT_DATABASE )
+   dbname = get_database_name(MyDatabaseId);
+   else
+   dbname = dbspecname->dbname;

This pattern is repeated in the patch several times.  It looks like
the end goal of these code blocks is either to get the database
name or the database OID, so perhaps we should have
get_dbspec_name() and get_dbspec_oid() helper functions (like
get_rolespec_oid() for RoleSpec nodes).

+static bool
+_equalDatabaseSpec(const DBSpecName *a, const DBSpecName *b)
+{
+   COMPARE_SCALAR_FIELD(dbnametype);
+   COMPARE_STRING_FIELD(dbname);
+   COMPARE_LOCATION_FIELD(location);
+
+   return true;
+}

There are some inconsistencies in the naming strategy.  For
example, this function is called _equalDatabaseSpec(), but the
struct is DBSpecName.  I would suggest calling it DatabaseSpec or
DbSpec throughout the patch.

Nathan

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/275747367


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-20 Thread Bossart, Nathan

On 9/20/17, 2:29 PM, "Tom Lane"  wrote:
> I started to look at this...

Thanks for taking a look.

> I think that the way this ought to work is you process the VacuumRelation
> structs purely sequentially, each in its own transaction, so you don't
> need leaps of faith about what to do if a relation changes between the
> first time you look at it and when you actually come to process it.

How might this work for VACUUM statements with no tables specified?  It
seems like we must either handle the case when the relation changes before
it is processed or hold a lock for the duration of the vacuuming.

> The idea of "fast failing" because some later VacuumRelation struct might
> contain an error seems like useless complication, both in terms of the
> implementation and the user-visible behavior.

I agree that the patch might be simpler without this, but the user-visible
behavior is the reason I had included it.  In short, my goal was to avoid
errors halfway through a long-running VACUUM statement because the user
misspelled a relation/column name or the relation/column was dropped.
It's true that the tests become mostly pointless if the relations or
columns change before they are processed, but this seems like it would be
a rarer issue in typical use cases.

I'll continue to look into switching to a more sequential approach and
eliminating the leaps of faith.

> As for the other patch, I wonder if it might not be better to
> silently ignore duplicate column names.  But in either case, we don't
> need a pre-check, IMO.  I'd just leave it to the lookup loop in
> do_analyze_rel to notice if it's looked up the same column number
> twice.  That would be more likely to lead to a back-patchable fix,
> too.  0002 as it stands is useless as a back-patch basis, since it
> proposes modifying code that doesn't even exist in the back branches.

I'd be happy to write something up that is more feasibly back-patched.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rethinking autovacuum.c memory handling

2017-09-23 Thread Bossart, Nathan

On 9/23/17, 5:27 AM, "Michael Paquier"  wrote:
>On Sat, Sep 23, 2017 at 6:09 AM, Tom Lane  wrote:
>> I notice that autovacuum.c calls autovacuum_do_vac_analyze, and
>> thereby vacuum(), in TopTransactionContext.  This doesn't seem
>> like a terribly great idea, because it doesn't correspond to what
>> happens during a manually-invoked vacuum.
>
> Indeed, the inconsistency is not good here.
>
>> What I think we should do instead is invoke autovacuum_do_vac_analyze
>> in the PortalContext that do_autovacuum has created, which we already
>> have a mechanism to reset once per table processed in do_autovacuum.
>>
>> The attached patch does that, and also modifies perform_work_item()
>> to use the same approach.  Right now perform_work_item() has a
>> copied-and-pasted MemoryContextResetAndDeleteChildren(PortalContext)
>> call in its error recovery path, but that seems a bit out of place
>> given that perform_work_item() isn't using PortalContext otherwise.
>
> I have spent some time looking at your patch and testing it. This
> looks sane. A small comment that I have would be to add an assertion
> at the top of perform_work_item to be sure that it is called in the
> memory context of AutovacMemCxt.

This looks reasonable to me as well.  I haven't noticed any issues after
a couple hours of pgbench with aggressive autovacuum settings, either.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rethinking autovacuum.c memory handling

2017-09-23 Thread Bossart, Nathan

On 9/23/17, 12:36 PM, "Tom Lane"  wrote:
>"Bossart, Nathan"  writes:
>> This looks reasonable to me as well.  I haven't noticed any issues after
>> a couple hours of pgbench with aggressive autovacuum settings, either.
>
> Thanks for looking.  As I'm sure you realize, what motivated that was
> not liking the switch into AutovacMemCxt that you'd added in
> autovacuum_do_vac_analyze ... with this patch, we can drop that.

Yup.  I’ll go ahead and update that patch now.

Nathan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix number skipping in to_number

2017-09-24 Thread Nathan Wagner

On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:

> Ok I've made that change in the attached v3. I'm not sure as I'm on
> en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer.  We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is.  Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code.  I do use the
to_char functions on occasion.  If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-25 Thread Bossart, Nathan

On 9/24/17, 10:12 PM, "Michael Paquier"  wrote:
> Attached is a proposal of patch.

The patch seems reasonable to me, and I haven't encountered any issues in
my tests, even after applying the vacuum-multiple-relations patch on top
of it.

+* Take a lock here for the relation lookup. If ANALYZE or 
VACUUM spawn
+* multiple transactions, the lock taken here will be gone once 
the
+* current transaction running commits, which could cause the 
relation
+* to be gone, or the RangeVar might not refer to the OID 
looked up here.

I think this could be slightly misleading.  Perhaps it would be more
accurate to say that the lock will be gone any time vacuum() creates a new
transaction (either in vacuum_rel() or when use_own_xacts is true).

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Enhancements to passwordcheck

2017-09-25 Thread Bossart, Nathan

Hi hackers,

Currently, the passwordcheck module provides a few basic checks to strengthen
passwords.  However, any configuration must be ready at compile time, and many
common password requirements cannot be enforced without creating a custom
version of this module.  I think there are a number of useful parameters that
could be added to enable common password restrictions, including the following
list, which is based on some asks from our customers:

passwordcheck.min_password_length
passwordcheck.min_uppercase_letters
passwordcheck.min_lowercase_letters
passwordcheck.min_numbers
passwordcheck.min_special_chars
passwordcheck.superuser_can_bypass
passwordcheck.max_expiry_period
passwordcheck.force_new_password

I'd like to use this thread to gauge community interest in adding this
functionality to this module.  If there is interest, I'll add it to the next
commitfest.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-25 Thread Bossart, Nathan

On 9/25/17, 6:51 PM, "Michael Paquier"  wrote:
>> +* Take a lock here for the relation lookup. If ANALYZE or 
>> VACUUM spawn
>> +* multiple transactions, the lock taken here will be gone 
>> once the
>> +* current transaction running commits, which could cause 
>> the relation
>> +* to be gone, or the RangeVar might not refer to the OID 
>> looked up here.
>>
>> I think this could be slightly misleading.  Perhaps it would be more
>> accurate to say that the lock will be gone any time vacuum() creates a new
>> transaction (either in vacuum_rel() or when use_own_xacts is true).
>
> The comment of the proposed patch matches as much as possible what is
> currently on HEAD, so I would still go with something close to that.

Sure.  This is just a minor point, and I could see the argument that your
phrasing is more concise, anyway.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enhancements to passwordcheck

2017-09-25 Thread Bossart, Nathan

On 9/25/17, 8:31 PM, "Michael Paquier"  wrote:
> Yes, I have developped a couple of years back a fork of passwordcheck
> which had much similar enhancements, so getting something more modular
> in upstream would be really welcome.

Awesome.

> On top of that I think that you should have a parameter that specifies
> a string full of special characters. For example I have been using
> things like !@#$%^&*()_+{}|<>?=. But you likely want to make that
> modular, people have their own set of character characters, and it is
> just something that could be compared with strchr(), still the default
> should be meaningful.

+1

>> passwordcheck.superuser_can_bypass
>
> Not sure that this one has much meaning. Superusers could easily
> unload the module.

True.  I can leave it out for now.

>> passwordcheck.force_new_password
>
> So this is to make sure that the new password is not the same as the
> old one? This would be better with the last N passwords, still this
> would require more facilities. I would discard this one as what you
> are proposing here is not modular enough IMO, and needs a separate
> feature set.

Fair enough.  I'll definitely start with a set of very non-controversial
parameters, as this will likely require rewriting most of the module.

>> I'd like to use this thread to gauge community interest in adding this
>> functionality to this module.  If there is interest, I'll add it to the next
>> commitfest.
>
> I think that's a good idea. Great to see that you are contributing
> back some stuff.

Thanks for the detailed feedback.

On 9/25/17, 8:37 PM, "Euler Taveira"  wrote:
>> passwordcheck.max_expiry_period
>>
> How would you enforce that? If the password expires, you can log in to
> change the password. If you let him/her to get in and change the
> password, you can't obligate the user to do that. You could send a
> message to remember that the password will expire but you can't
> enforce that (unless you make a change in the protocol).

My idea was to tie into the existing password expiration functionality
(VALID UNTIL in CREATE/ALTER ROLE statements).  I don't think this would
involve any changes to how password expiration works.  Instead, it would
just be a simple check to make sure the specified expiration date is not
too far in the future.

>> passwordcheck.force_new_password
>>
> Does it mean a password different from the old one? +1. It could be
> different from the last 3 passwords but we don't store a password
> history.

Yes.  As Michael pointed out, this might be better to do as a separate
effort since we'll almost certainly need to introduce a way to store
password history.

One interesting design challenge will be how to handle pre-hashed
passwords, since the number of checks we can do on those is pretty
limited.  I'm currently thinking of a parameter that can be used to
block, allow, or force pre-hashed passwords.  If we take that route,
perhaps we will also need to ensure that PostgreSQL fails to start when
invalid combinations are specified (e.g. pre-hashed passwords are forced
and min_password_length is nonzero).  Thoughts?

Also, I imagine we'll want to keep the cracklib and "password cannot
contain role name" features around, although perhaps these could become
configurable like the rest of the options.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix number skipping in to_number

2017-09-26 Thread Nathan Wagner

On Mon, Sep 25, 2017 at 07:52:19PM +0100, Oliver Ford wrote:

> Thanks for your review. The issue is that Oracle throws errors on many
> more input cases than Postgres does, so making it exactly like Oracle
> could break a lot of existing users. E.g. to_number ('123,000', '999')
> returns '123' on Postgres, but throws an error on Oracle. So making it
> exactly Oracle-like could break existing users who might rely on the
> current behavior.

I wouldn't use to_number for anything other than oracle compatibility,
and then just so I didn't have to wade through all the ported oracle
code.  I would use a regex or some such to clean up the number, and then
cast the result.  For an input string of '123,000' I might do a
translate('123,000', ',', '')::integer or perhaps use regexp_replace().
Either way I would want a more positive decision as to what was valid or
not, based on the input.

> My view is that we shouldn't deliberately introduce errors in order to be
> exactly like Oracle if we don't currently and there's a sane use case for
> current behavior. Do you have any examples of results that are different
> between Oracle and Postgres, and you think the Oracle result makes more
> sense?

Not really, other than I think an error report might make more sense.
'123,000' doesn't really match the format of '999'.  If anything it
seems like we're guessing rather than validating input.  It is
surprising (to me at least) that

to_char(to_number('123,000', '999'), '999')

doesn't give us the original input (in the sense that identical formats
don't preserve the original string).  So I'm not sure the current
behavior is a sane use case, but perhaps more people are using
to_number() to get *some* numeric result, rather than for wanting it to
be like oracle.  I would generally prefer to throw an exception instead
of getting a number I wasn't expecting, but I can see cases where that
might not be the case.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enhancements to passwordcheck

2017-09-26 Thread Bossart, Nathan

On 9/26/17, 2:38 AM, "Albe Laurenz"  wrote:
> Nathan Bossart wrote:
>>>> passwordcheck.force_new_password
>>>>
>>> Does it mean a password different from the old one? +1. It could be
>>> different from the last 3 passwords but we don't store a password
>>> history.
>> 
>> Yes.  As Michael pointed out, this might be better to do as a separate
>> effort since we'll almost certainly need to introduce a way to store
>> password history.
>
> That increases the number of passwords stored on the server and
> consequently the damage when that list is stolen.
> Of course the old passwords are invalid, but if someone cracks them
> they could still try them on other systems the person uses.
>
> I think we should accept such a risk only if the benefits are clear, and
> my opinion has always been that if you forbid password reuse, people
> tend to come up with password generation schemes that are no better
> than the original passwords.

Right.  However, without this, they may not change the password at
all, which would make the expiry functionality less effective.  I
suppose there's not a great way to guard against these sorts of
password generation schemes without being able to judge the proposed
password against the previous password, too.

Perhaps the max_expiry_period parameter should be left out for now
as well.

>> One interesting design challenge will be how to handle pre-hashed
>> passwords, since the number of checks we can do on those is pretty
>> limited.  I'm currently thinking of a parameter that can be used to
>> block, allow, or force pre-hashed passwords.  If we take that route,
>> perhaps we will also need to ensure that PostgreSQL fails to start when
>> invalid combinations are specified (e.g. pre-hashed passwords are forced
>> and min_password_length is nonzero).  Thoughts?
>
> As was pointed out in the original discussion
> d960cb61b694cf459dcfb4b0128514c203937...@exadv11.host.magwien.gv.at
> the weak point of "passwordcheck" is that it does not work very well
> for encrypted passwords.
> The only saving grace is that you can at least check against
> username equals password.

Thanks for linking the original thread.  There are a lot of
interesting points.  I wonder if enhanced password checking in core
or contrib might be received differently with the introduction of
SCRAM authentication, since the weaknesses of MD5 were often cited.

> Disabling pre-hashed passwords in order to allow better password
> checks is a problem rather than a solution, because it exposes you
> to password theft of the clear-text password.  I think we shouldn't
> go there.
>
> The overall opinion in the above thread was that if you *really* care
> about security, you don't use database passwords, but external
> authentication with a centralized identity management system.
>
> So I think it is fine to extend "passwordcheck", but we shouldn't
> take it serious enough to reduce security elsewhere in order to
> improve the module.

I understand the points made here, but not allowing configurability
here really hinders the module's ability to enforce much of
anything.  However, I did say I wanted to avoid controversial
parameters, so I'll plan to count this one out as well.

This leaves us with the following proposed parameters for now:

passwordcheck.min_password_length
    passwordcheck.min_uppercase_letters
passwordcheck.min_lowercase_letters
passwordcheck.min_numbers
passwordcheck.min_special_chars
passwordcheck.special_chars
passwordcheck.allow_username_in_password
passwordcheck.use_cracklib

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-26 Thread Bossart, Nathan

On 9/25/17, 12:42 AM, "Michael Paquier"  wrote:
> +   if (!IsAutoVacuumWorkerProcess())
> +   ereport(WARNING,
> + (errmsg("skipping \"%s\" --- relation no longer exists",
> + relation->relname)));
> I like the use of WARNING here, but we could use as well a LOG to be
> consistent when a lock obtention is skipped.

It looks like the LOG statement is only emitted for autovacuum, so maybe
we should keep this at WARNING for consistency with the permission checks
below it.

> +* going to commit this transaction and begin a new one between 
> now
> +* and then.
> +*/
> +   relid = RangeVarGetRelid(relinfo->relation, NoLock, false);
> We will likely have to wait that the matters discussed in
> https://www.postgresql.org/message-id/25023.1506107...@sss.pgh.pa.us
> are settled.

Makes sense.

> +VACUUM FULL vactst, vactst, vactst, vactst;
> This is actually a waste of cycles.

I'll clean this up in v22.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enhancements to passwordcheck

2017-09-27 Thread Bossart, Nathan

On 9/27/17, 7:41 AM, "Peter Eisentraut"  
wrote:
> On 9/25/17 23:10, Bossart, Nathan wrote:
>> One interesting design challenge will be how to handle pre-hashed
>> passwords, since the number of checks we can do on those is pretty
>> limited.  I'm currently thinking of a parameter that can be used to
>> block, allow, or force pre-hashed passwords.
>
> Pre-hashed passwords are the normal case.  You can't break that without
> making this module a net loss in security.

A strength in making this configurable is that you could use it to
enforce that passwords are pre-hashed.  But yes, blocking pre-
hashed passwords could be undesirable given an untrusted network or
server.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of RangeVar for partitioned tables in autovacuum

2017-09-27 Thread Bossart, Nathan

On 9/26/17, 9:28 PM, "Michael Paquier"  wrote:
> In conclusion, I think that the open item of $subject should be
> removed from the list, and we should try to get the multi-VACUUM patch
> in to cover any future problems. I'll do so if there are no
> objections.

If someone did want to add logging for vacuum_rel() and analyze_rel() in
v10 after your patch was applied, wouldn't the NULL RangeVars force us to
skip the new log statements for partitions?  I think we might instead
want to back-patch the VacuumRelation infrastructure so that we can
appropriately log for partitions.

However, I'm dubious that it is necessary to make such a big change so
close to release for hypothetical log statements. So, in the end, I agree
with you.

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-28 Thread Bossart, Nathan

On 9/28/17, 8:46 PM, "Michael Paquier"  wrote:
> On Fri, Sep 29, 2017 at 10:44 AM, Tom Lane  wrote:
>> Michael Paquier  writes:
>>> On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan  
>>> wrote:
>>>> Alright, I've added logging for autovacuum in v23.  I ended up needing to
>>>> do a little restructuring to handle the case when the relation was skipped
>>>> because the lock could not be obtained.  While doing so, I became
>>>> convinced that LOG was probably the right level for autovacuum logs.
>>
>>> OK, of course let's not change the existing log levels. This could be
>>> always tuned later on depending on feedback from others. I can see
>>> that guc.c also uses elevel == 0 for some logic, so we could rely on
>>> that as you do.
>>
>> FWIW, I don't think this patch should be mucking with logging behavior
>> at all; that's not within its headline charter, and I doubt many people
>> are paying attention.  I propose to commit it without that.  If you feel
>> hot about changing the logging behavior, you can resubmit that as a new
>> patch in a new thread where it will get some visibility and debate on
>> its own merits.
>
> Okay. I am fine with that as well.

Sure, that seems reasonable to me.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-29 Thread Bossart, Nathan

On 9/29/17, 11:18 AM, "Robert Haas"  wrote:
> I don't think I understand problem #2.  I think the concern is about
> reporting the proper relation name when VACUUM cascades from a
> partitioned table to its children and then some kind of concurrent DDL
> happens, but I don't see a clear explanation on the thread as to what
> exactly the failure scenario is, and I didn't see a problem in some
> simple tests I just ran.  Furthermore, it sounds like this might get
> fixed as part of committing the patch to allow VACUUM to mention
> multiple tables, which Tom has indicated he will handle.

Yes.  It looks like v10 is safe, and the vacuum-multiple-relations
patch should help prevent any future logging issues caused by this.

Discussion here: 
http://postgr.es/m/CAB7nPqRX1465FP%2Bameysxxt63tCQDDW6KvaTPMfkSxaT1TFGfw%40mail.gmail.com

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] patch for geqo tweaks

2015-09-08 Thread Nathan Wagner

I have two patches to make the geqo initialization and mutation
slightly better.

The first adjusts the mutation swaps to avoid having to re-pick
ties.  The second changes the initialization and shuffling algorithm
for the gene array to use an in-place Fisher-Yates shuffling
algorithm.

Diffs against commit 49124613f134b04594b1a5c46368eb0a5db16d4b
(i.e. tip of master as of when I re-made the diff).

On my system the patches pass a ./configure; make; make check

-- 
nw
>From c7f3c7cc37f943481b2358149210789be3d39ee9 Mon Sep 17 00:00:00 2001
From: Nathan Wagner 
Date: Sun, 21 Sep 2014 09:30:01 +
Subject: [PATCH 1/2] cleanup geqo_mutation.c

Avoid a possible random number collision by choosing random
ranges better.  This will potentially change the output
of the algorithm since it avoids extra calls to geqo_randint().
---
 src/backend/optimizer/geqo/geqo_mutation.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_mutation.c 
b/src/backend/optimizer/geqo/geqo_mutation.c
index 1a06d49..c78bd2c 100644
--- a/src/backend/optimizer/geqo/geqo_mutation.c
+++ b/src/backend/optimizer/geqo/geqo_mutation.c
@@ -43,20 +43,18 @@ geqo_mutation(PlannerInfo *root, Gene *tour, int num_gene)
int num_swaps = geqo_randint(root, num_gene / 3, 0);
Genetemp;
 
-
while (num_swaps > 0)
{
swap1 = geqo_randint(root, num_gene - 1, 0);
-   swap2 = geqo_randint(root, num_gene - 1, 0);
+   swap2 = geqo_randint(root, num_gene - 2, 0);
 
-   while (swap1 == swap2)
-   swap2 = geqo_randint(root, num_gene - 1, 0);
+   if (swap2 >= swap1)
+   swap2++;
 
temp = tour[swap1];
tour[swap1] = tour[swap2];
tour[swap2] = temp;
 
-
num_swaps -= 1;
}
 }
-- 
2.5.0

>From 47979bb844eda3821f08c1f00afc081ee0a3a260 Mon Sep 17 00:00:00 2001
From: Nathan Wagner 
Date: Sun, 21 Sep 2014 13:02:48 +
Subject: [PATCH 2/2] rewrote array shuffler for geqo_recombination

This simplifies the array shuffling by doing it inplace
and eliminating a memory allocation and array initialization.

The output is identical in the sense that it still randomizes
an array, but the actual shuffled array will be different.
---
 src/backend/optimizer/geqo/geqo_recombination.c | 32 +++--
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_recombination.c 
b/src/backend/optimizer/geqo/geqo_recombination.c
index 652fadc..b327776 100644
--- a/src/backend/optimizer/geqo/geqo_recombination.c
+++ b/src/backend/optimizer/geqo/geqo_recombination.c
@@ -37,31 +37,17 @@
 void
 init_tour(PlannerInfo *root, Gene *tour, int num_gene)
 {
-   Gene   *tmp;
-   int remainder;
-   int next,
-   i;
+   int i, j;
 
-   /* Fill a temp array with the IDs of all not-yet-visited cities */
-   tmp = (Gene *) palloc(num_gene * sizeof(Gene));
-
-   for (i = 0; i < num_gene; i++)
-   tmp[i] = (Gene) (i + 1);
-
-   remainder = num_gene - 1;
-
-   for (i = 0; i < num_gene; i++)
-   {
-   /* choose value between 0 and remainder inclusive */
-   next = geqo_randint(root, remainder, 0);
-   /* output that element of the tmp array */
-   tour[i] = tmp[next];
-   /* and delete it */
-   tmp[next] = tmp[remainder];
-   remainder--;
+   /* shuffle tour in-place, uses Fisher-Yates inside-out algorithm */
+   for (i=0; i < num_gene; i++) {
+   j = geqo_randint(root, i, 0);
+   /* we could skip the compare and just assign */
+   if (i != j) {
+   tour[i] = tour[j];
+   }
+   tour[j] = (Gene) i+1;
}
-
-   pfree(tmp);
 }
 
 /* alloc_city_table
-- 
2.5.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No Issue Tracker - Say it Ain't So!]

2015-10-04 Thread Nathan Wagner

I don't have the original message for this thread, so I arbitrarily picked a
message to reply to.

Since what has been asked for is a bug *tracker*, and we already have a bugs
mailing list, I put together something.

I downloaded the archives for pgsql-bugs, and fed them into a database.  This
part was easy, since I have already written a pg backed usenet server and had
the code hand for storing and parsing out bits of rfc 2822 messages.

It's dirt simple.  If the system sees a message with 'Bug #(\d+)' in the
subject line, it creates an entry in a bugs table with that bug number (if
needed), and then marks the message as belonging to that bug.  If there seems
to be metadata about the bug in the format of the (unquoted)

Bug reference:
Logged by:  
Email address:
PostgreSQL version:
Operating system:
Description:
Details:

it pulls that out and puts it in the bugs table.  There's also an "open"
boolean in the table, defaulting to true.

The results can be found at https://granicus.if.org/pgbugs/

Ok.  So now we have a bug tracker, but...

Some open questions that I don't think have really been addressed, with my
commentary interspersed:

1: Can a bug be more than "open" or "closed"?

I think yes.  At least we probably want to know why a bug is closed.  Is it not
a bug at all, not our bug, a duplicate submission, a duplicate of another bug,
something we won't fix for some reason (e.g. a bug against version 7)

2: Who can declare a bug closed.

Ugh.  I'm going to close some of them if it seems obvious to me that they
should be closed.  But what if it's not obvious?  I could probably maintain it
to some extent, but I don't know how much time that would actually take.

Related to the next point, it probably makes sense to just close up front
bugs that are marked against unsupported pg versions, or haven't had
any activity for too long, perhaps two years.  Just closing bugs with no
mailing list activity for two years closes 5280 of 6376 bugs.

3: How far back should I actually import data from the bugs list?

I have imported each archived month from December of 1998.  It looks like the
bug sequence was started at 1000 in December of 2003.  Emails with no bug id in
the subject line don't get associated with any bug, they're in the DB bug not
really findable.

4: What should I do with emails that don't reference a bug id but seem to be
talking about a bug?

I suggest we do nothing with them as far as the bug tracker is concerned.  If
people want to mark their message as pertaining to a bug, they can put that in
the subject line.  However, I don't think a bug id can be assigned via email,
that is, I think you have to use a web form to create a bug report with a bug
id.  Presumably that could change if whoever runs the bug counter wants it to.

5: How can we use email to update the status of a bug?

I suggest using email headers to do this.  'X-PGBug-Fixed: ' and the
like.  I assume here that everyone who might want to do such a thing uses an
MUA that would allow this, and they know how.

6: Does there need to be any security on updating the status?

Probably not.  I don't think it's the sort of thing that would attract
malicious adjustments.  If I'm wrong, I'd need to rethink this.  I realize I'm
making security an afterthought, which makes my teeth itch, but I think layers
of security would make it much less likely to be actually adopted.

Just to be clear, this is both a work in progress and a proof of concept.  It's
slow, it's ugly.  I haven't created any indexes, written any css or javascript,
or implemented any caching.  I may work on that before you read this though.

Comments are welcome, and no, I don't really expect that this will be what gets
adopted, mainly I wanted to show that we can probably just build something
rather effective off our existing infrastructure, and I had Saturday free (as
of this writing, I've got maybe 5 hours into it total, albeit with lots of
code re-use from other projects).  There are some obvious todo items,
filtering and searching being the most salient.

Some data import issues:

March 10, 2003, bad Date header, looked like junk anyway, so I deleted it.

Time zone offsets of --0400 and -0500 (at least), I took these as being -0400
(or whathaveyou).

Date header of Sat, 31 May 2008 12:12:18 +1930, judging by the name on the
email, this is probably posted from Venezuela, which switched to time one -0430
in 2007, which could also be thought of as +1930 if you ignore the implied date
change.

And, by way of some statistics, since I've got the archives in a database:

Emails: 43870
Bugs: 6376
Distinct 'From' headers: 10643

The bugs have 3.5 messages each on average, with 2 being the most common
number, and 113 at the most, for bug 12990.  1284 bugs have only one message
associated with them.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No Issue Tracker - Say it Ain't So!]

2015-10-04 Thread Nathan Wagner

On Sun, Oct 04, 2015 at 04:30:49PM -0700, Josh Berkus wrote:
> That would be the key part, wouldn't it?  Nice that you have [code to
> store and parse email messages].

Yeah.  It actually made most of the work pretty easy.  It's available
with a bunch of other code at https://pd.if.org/git/ if anyone wants it.
I did find a bug in my header processing though, so I'll need to commit
that fix.

> We'd also want a way to link a bug fix to a commit, and probably a way
> to give the bug a list of searchable keywords (and add to that list).

I've been thinking of hooking it up to the fti machinery and providing
a search box.  I've never really used fti before, so this might be a
good opportunity to learn it for real.

> > it probably makes sense to just close up front bugs that are marked
> > against unsupported pg versions, or haven't had any activity for too
> > long, perhaps two years.

> I'm reluctant to close all of those unexamined, since part of the
> purpose of this is to find bugs which were never fixed.  Probably we
> should organize a posse to comb trhough all of the old bugs and
> hand-close them.

I'm doing some of that as I poke at the bugs pages.  Perhaps it would
make sense to have a closed status of 'stale' or the like (perhaps
that's what you meant by 'timed out') which could be used to get bugs
out of the main list but still be marked as 'not human reviewed'.  These
could then be reviewed by the posse.

> Yeah, fixing this [email's without a bug id] would probably be tied to
> the possible change to mailman.  Unless someone already has a way to
> get majordomo to append a bug ID.

How are bug ids assigned now?  From the evidence, I assume there is a
sequence in a database that the web submission form queries to format a
resulting email to the bugs mailing list.  Do the mailing lists live on
the same server?  If so, perhaps it would be easy to assign a new bug id
to a new thread on -bugs.  But perhaps that's too aggressive in creating
bugs.

> > 5: How can we use email to update the status of a bug?
> > 
> > I suggest using email headers to do this.  'X-PGBug-Fixed:
> > ' and the like.  I assume here that everyone who might
> > want to do such a thing uses an MUA that would allow this, and they
> > know how.
> 
> I guess that depends on who we expect to use this, at least for
> closing stuff.

I could certainly support more than one mechanism.  A web interface for
those who would prefer such and emails would seem to be the basic
requirements.

> > 6: Does there need to be any security on updating the status?
> > 
> > Probably not.  I don't think it's the sort of thing that would
> > attract malicious adjustments.  If I'm wrong, I'd need to rethink
> > this.  I realize I'm making security an afterthought, which makes my
> > teeth itch, but I think layers of security would make it much less
> > likely to be actually adopted.
> 
> I think there needs to be some kind of administrative access which
> allows, for example, an issue to be closed so that it can't be
> reopened.

I guess technically we have that now since I'm the only one who can
close or open a bug in the db I've set up :)

Seriously though, I think it probably makes the most sense to tie the
system in with the existing pg community accounts if it goes that far.

> Anyway, I'm not convinced we want to reinvent this particular wheel, but
> if we do, you've done a yeoman's job.

Thank you.  (Assuming I've interpreted the phrase correctly, I'm not
familiar with it, and a web search was only semi-enlightening).

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] No Issue Tracker - Say it Ain't So!

2015-10-05 Thread Nathan Wagner

I don't have the original message for this thread, so I arbitrarily picked a
message to reply to.

Since what has been asked for is a bug *tracker*, and we already have a bugs
mailing list, I put together something.

I downloaded the archives for pgsql-bugs, and fed them into a database.  This
part was easy, since I have already written a pg backed usenet server and had
the code hand for storing and parsing out bits of rfc 2822 messages.

It's dirt simple.  If the system sees a message with 'Bug #(\d+)' in the
subject line, it creates an entry in a bugs table with that bug number (if
needed), and then marks the message as belonging to that bug.  If there seems
to be metadata about the bug in the format of the (unquoted)

Bug reference:
Logged by:  
Email address:
PostgreSQL version:
Operating system:
Description:
Details:

it pulls that out and puts it in the bugs table.  There's also an "open"
boolean in the table, defaulting to true.

The results can be found at https://granicus.if.org/pgbugs/

Ok.  So now we have a bug tracker, but...

Some open questions that I don't think have really been addressed, with my
commentary interspersed:

1: Can a bug be more than "open" or "closed"?

I think yes.  At least we probably want to know why a bug is closed.  Is it not
a bug at all, not our bug, a duplicate submission, a duplicate of another bug,
something we won't fix for some reason (e.g. a bug against version 7)

2: Who can declare a bug closed.

Ugh.  I'm going to close some of them if it seems obvious to me that they
should be closed.  But what if it's not obvious?  I could probably maintain it
to some extent, but I don't know how much time that would actually take.

Related to the next point, it probably makes sense to just close up front
bugs that are marked against unsupported pg versions, or haven't had
any activity for too long, perhaps two years.  Just closing bugs with no
mailing list activity for two years closes 5280 of 6376 bugs.

3: How far back should I actually import data from the bugs list?

I have imported each archived month from December of 1998.  It looks like the
bug sequence was started at 1000 in December of 2003.  Emails with no bug id in
the subject line don't get associated with any bug, they're in the DB bug not
really findable.

4: What should I do with emails that don't reference a bug id but seem to be
talking about a bug?

I suggest we do nothing with them as far as the bug tracker is concerned.  If
people want to mark their message as pertaining to a bug, they can put that in
the subject line.  However, I don't think a bug id can be assigned via email,
that is, I think you have to use a web form to create a bug report with a bug
id.  Presumably that could change if whoever runs the bug counter wants it to.

5: How can we use email to update the status of a bug?

I suggest using email headers to do this.  'X-PGBug-Fixed: ' and the
like.  I assume here that everyone who might want to do such a thing uses an
MUA that would allow this, and they know how.

6: Does there need to be any security on updating the status?

Probably not.  I don't think it's the sort of thing that would attract
malicious adjustments.  If I'm wrong, I'd need to rethink this.  I realize I'm
making security an afterthought, which makes my teeth itch, but I think layers
of security would make it much less likely to be actually adopted.

Just to be clear, this is both a work in progress and a proof of concept.  It's
slow, it's ugly.  I haven't created any indexes, written any css or javascript,
or implemented any caching.  I may work on that before you read this though.

Comments are welcome, and no, I don't really expect that this will be what gets
adopted, mainly I wanted to show that we can probably just build something
rather effective off our existing infrastructure, and I had Saturday free (as
of this writing, I've got maybe 5 hours into it total, albeit with lots of
code re-use from other projects).  There are some obvious todo items,
filtering and searching being the most salient.

Some data import issues:

March 10, 2003, bad Date header, looked like junk anyway, so I deleted it.

Time zone offsets of --0400 and -0500 (at least), I took these as being -0400
(or whathaveyou).

Date header of Sat, 31 May 2008 12:12:18 +1930, judging by the name on the
email, this is probably posted from Venezuela, which switched to time one -0430
in 2007, which could also be thought of as +1930 if you ignore the implied date
change.

And, by way of some statistics, since I've got the archives in a database:

Emails: 43870
Bugs: 6376
Distinct 'From' headers: 10643

The bugs have 3.5 messages each on average, with 2 being the most common
number, and 113 at the most, for bug 12990.  1284 bugs have only one message
associated with them.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] bugs and bug tracking

2015-10-06 Thread Nathan Wagner

So, in order to do some clean up and see how my pgbugs page
(https://granicus.if.org/pgbugs/) might actually work I've been going
through bugs and marking their status.  A lot of questions arise.

A lot of the reports aren't bugs at all, but requests for help.  My
guess is that the users either don't know where to ask or don't
understand the difference between a bug and not knowing how to do what
they want to do.  Perhaps a more thorough explaination on the submission
form would be useful.

What is the difference between a bug and a feature request?  Ok, I know
the difference, but do we want to treat them differently.  For example,
see bug 9457 (https://granicus.if.org/pgbugs/9457).  As Pavel Stehule
noted, this isn't a *bug* per se, but what should we do with it?  Close
it as 'not a bug'?  I don't like this because it's not really the same
as the other 'not a bug's.  Create another 'closed' status of 'feature
request'?  Except that if we decide to implement the feature, in some
sense it becomes a bug until we actually implement it.  Create an 'open'
status of 'feature request' to mark it until it is either implemented or
rejected.  At least then it's tracked.  This last choice is my
preference.

I conflate open bugs in the sense of 'not closed so we still need to do
something with the bug even if it is just closing it' and open bugs in
the sense of 'this seems to actually be a bug in postgres'.  I'm not
sure what terminology I should use.

I have lots of different types of 'not a bug'.

Not a bug, the use should have posted to a different list. (e.g. 13602)
Not a bug, probably user error, which is similar to the above.
Not a bug, but a bug in the libraries we use (e.g. openssl, 10184)
Not a bug, works as intended, i.e. the user didn't make a mistake, but
had an incorrect notion of what it was supposed to do. (11596)
Not a bug, but the user never got a reply.  That is, I decided
personally that this wasn't actually a bug.  (13367)
And bug 1000 is not a bug, system test.

Do we care about the difference between any of these?  I track them
differently in my update script, but they all get the same status in the
db.

Can a bug be 'fixed' if there's no actually identifiable commit that
fixes the bug?  See 13516, which Tom Lane claims was fixed in 9.1.  A
grep for 13516 of the git log for both master and REL9_1_STABLE don't
turn up anything.

I can't as yet figure out how to match up git commit messages to
identify every branch in which a fix was applied.  I could of course
load all of the commit messages into a table and compare them that way.

Should I mark as "open" (i.e. not "new) any report that has a response?
More than one response?  That would at least distinguish between bug
reports that at least someone, in theory, has taken a look at, and those
that haven't been addressed at all.

I have created the following statuses:

Fixed - bug is presumably fixed

Unreproduceable - we can't make the system demonstrate this error

Timed Out - the reporter was asked to provide more information and
didn't respond for a while.  I would probably suggest somewhere around a
month for this.  Should there be a 'waiting on feedback' to mark the
'pre timed out' phase?

Stale   5281 - the bug hasn't had any activity for >2 years, so just
close it.  If people want to troll through these to give them a better
status, that would probably be good, but there's still a thousand open
bugs newer than that.

Not Our Bug - looks like a bug, but it's not a bug in postgres.  What
exactly are our bugs?  Just what you'd get out of the release tarballs
or the git repo?  Or is there more?

Not a Bug - see above discussion

Won't Fix - this is arguably a bug, but for some reason we're not going
to fix it.  Perhaps we intentionally violate the standard, or it's a bug
against a version we don't support, or we're not going to backpatch it.

Open - this seems to be a bug, or at least we're talking about it and
it's not where we want to close it.  Note of course that "closing" a bug
just means it will show up somewhere else in my tracker, obviously it
doesn't affect the mailing list at all.

New - this hasn't been looked at enough for someone to change the status
to something better.

I don't have a "reopened" status.  I'm not sure what it means, other
than it used to be closed, but someone changed it back to open.  I don't
immediately see why we would want to distinguish between this and a
regular open bug, other than perhaps as a way of conflating status with
priority.  It's easy to make one though if people really want it.  I
probably have too many statuses already.

I will post later on my thoughts on how to control the system.  Are
people, in principle, willing to put magic incantations in their emails
and commit messages?  I'm not asking for a commitment to my tool here,
I'm just exploring what the bounds of people's, and committer's in
particular, willingness to adjust their workflow or message texts a bit
to make it easier to automate bu

Re: [HACKERS] bugs and bug tracking

2015-10-06 Thread Nathan Wagner

On Tue, Oct 06, 2015 at 12:04:11PM -0500, Jaime Casanova wrote:

> I like how this page is looking now, good work.

Thank you.

> Now, AFAIU from previous mails part of the reason to have a bug
> tracker is to make easy to know when a bug was fixed. Which should be
> interpreted as: which versions this bug affected? and which minor
> versions on those branches fix the issue
> 
> for example bug # 13636 was reported for 9.4.4 but it existed in older
> branches so Tom fixed it in all active branches (ie:
> http://www.postgresql.org/message-id/e1zfjgx-0005lu...@gemulon.postgresql.org).
> is it possible (even if not yet implemented) to add that information?

It is possible.  I don't know yet how easy it will be to automate it for
all the back patches.  I think I can look for matching commit messages
but I haven't written any code yet.  It might be possible to infer
this information after the fact by looking at where on which branches
a commit fix was applied.

> also i like that we can search on bugs but we can filter by version.
> i'm just guessing that if someone looks for a bug he's probably
> worrying about the specific version he is using.

I'll probably get to adding filtering soon-ish.  Maybe even today.  I
haven't decided if I want to do that on the server side, or add some
javascript to let the user sort and filter whatever the page has already
returned.  I'm not really a web guy, so it takes me a while to figure
out what to do.

> having a bug tracker that allow us to not lose track of bugs is good,
> having a bug tracker that allow us to have the whole information on a
> bug is better, IMHO.

I agree.  It's just a matter of somehow automating the process.  I'm
under no illusions though that I have any control over things.  I'm
hoping that one or more of the committers will say "we'd like to do it
this way" and I'll work with that.  Personally, I'd like to see either
'[Fixes #12345]' anywhere in a commit message, or 'Fixes: #12345' at the
beginning of a line.  But it has to come from them.

Also, the version numbers are user reported and a bit of a mess.  I
don't think they could really be relied on as is for users trying to
find out if a bug affects their version.  Someone would have to update
that information, and communicate that update to the tracker.  The
general concensus seems to be that magic phrases at the beginning of a
line in a message body is the way to go.  I don't necessarily agree, but
any consistent system can be made to work.  This obviously applies to
any metadata, not just version numbers.

> > A lot of the reports aren't bugs at all, but requests for help.  My
> > guess is that the users either don't know where to ask or don't
> > understand the difference between a bug and not knowing how to do what
> > they want to do.  Perhaps a more thorough explaination on the submission
> > form would be useful.
> >
> 
> the real problem here is that fill the bug report doesn't force you to
> register in a ML, while asking for help in a ML will. and a lot of
> people don't want to register in a ML, they just want a specific
> question answered so i don't think any change in the form will avoid
> that.

True.  Perhaps we could provide another form for other lists.  Probably
tilting at windmills here, but it would be nice if we could cut down
on the amount of time taken up by "this isn't a bug, you should go ask
down the hall".

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-06 Thread Nathan Wagner

On Tue, Oct 06, 2015 at 10:57:42AM -0700, Josh Berkus wrote:

> On 10/06/2015 10:17 AM, Bruce Momjian wrote:

> > Therefore, our current default behavior is to ignore user reports,
> > unless someone takes an action to reply, record, or retain the email for
> > later review.  What a tracker does is to make the default user report be
> > _retained_, meaning we have to take action to _not_ retain a user report
> > as an open item.
> 
> Well, we can determine how that's handled.  There are bug trackers out
> there that automatically archive unconfirmed bug reports after a certain
> amount of time.  I'd personally recommend it.
> 
> Of course, that requires a bug tracker which can have an "unconfirmed"
> status.

This is essentially what I have done with the 'Stale' status.  Though
I have done at two years to be conservative, rather than 60 days,
which I think is entirely more reasonable.

> > Second, we have a mix of user reports.  Some bug reports are not bugs
> > and must be reclassified.  In other cases, uses ask questions via
> > non-tracked communicate channels, e.g. pgsql-general, but they are
> > really bugs.  So, to do this right, we need a way of marking tracked
> > bugs as not bugs, and a way of adding bugs that were reported in a
> > non-tracked manner.
> 
> Yeah, I was wondering about that.

I think I have suggested that there be a way to generate a bug id via
email.  Presumably someone could just copy that email address to make a
not-tracked discussion get a bug id.  If the system archived all the
lists (not hard) it would be possible to pull the other emails from the
thread into the bug (also not hard).  As for marking as 'not-a-bug'
this can easily be done via whatever mechanism might be used.
Something along the lines of:

Bug Status: not a bug

would probably work.  It's really whatever people are willing to
actually do.

> FWIW, when I talk about bugs which we lost track of, they're not
> generally unconfirmed bug reports.  Usually, it's stuff which a
> contributor replied to, but the bug was low-impact, circumstantial, and
> hard to reproduce, and came in during a busy period (like release time).
>  So I'd be perfectly OK with the idea that unconfirmed bugs hang around
> in the system for 60 days, then automatically convert to "stale" status.

My tracker would do this trivially if I changed the query to 60 days
instead of two years.

> Until we build up a team of volunteers for bug triage, we might have to
> do that.
> 
> Speaking of which ... this project is rich in skilled users who are
> involved in the community but don't code.  Bug triage is exactly the
> kind of thing very part-time community supporters can do, if we make it
> easy for them to do.

I can make it easy.  But that doesn't directly address two of the other
points:

1: Do we need any system for who is allowed to triage bugs?
2: Should an equivalent email be sent to the list?

Specifically with point number 2, suppose the above mechanism is
used.  When a triager marks a bug as (say) not a bug, should
the system just update the database, or should it actually
send a formatted email to the bugs list with the 'Bug Status: not a bug'
line (among others, presumably)?  I think it should send the email,
but I can see how that could be construed as junk.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-06 Thread Nathan Wagner

On Tue, Oct 06, 2015 at 01:17:48PM -0400, Bruce Momjian wrote:

> I do think we rushed to choose a tracker a little too quickly.

Have we chosen one?

> Let me explain, from a high level, what a new tracker will change in
> our workflow.

[snip]

I won't quote your whole message, which I essentially agree with.  Let
me say that the questions I have brought up have several purposes.

One, I think it's important to identify what exactly we're after.  I
hope my questions have help put some light on that.

Two, I think any attempt to tell the developers and committers that they
need to change their workflow to adapt to some system is bound to fail,
so, I have asked, just what changed would you all be willing to actually
*do*?  Tom Lane is pretty good at noting a bug number in his commit
messages, for example.  Would he be willing to modify that slightly to
make it easier to machine parse?  Would you be willing to add a bug
number to your commit messages?  I'm not asking for guarantees.
Actually I'm not really asking for anything, I'm just trying to figure
out what the parameters of a solution might be.  If the answer to that
is "no, I'm not willing to change anything at all", that's fine, it just
colors what might be done and how much automation I or someone else
might be able to write.

I think even with a bug tracker the default "ignore" behavior can still
be done.  In principle, we could even mark bugs as "unconfirmed" or
"logged" or something right away and only mark them as new or open or
something if they actually draw a reply.  We could even require a reply
from a committer if that was wanted.

If I haven't made it clear by now, I am trying to write a system that
requires the absolute minimal amount of change to the existing way of
doing things.  As I noted in my original email, I've put together a bug
tracker, not a ticket system.  If people would like to make some small
adjustments to make it easier to automate a bug tracker, that would be
great, but if not, that's fine too, it's no *worse* than what we already
have.  And if people really wanted a ticket system, it wouldn't be hard
to enhance a tracker.

> My point is that we have our current workflow not because we are
> idiots, but because it fit our workflow and resources best.  I am not
> sure if we have succeeded because of our current non-retain mode, or
> in spite of it.  It might be time to switch to a default-retain mode,
> especially since most other projects have that mode, but we should be
> clear what we are getting into.

I thinking having a bug tracker and retention vs non-retention are
orthogonal questions.  But I agree that knowing what might be involved
is a good idea.  I think perhaps one of the problems is that different
people want different things, so it's probably going to be hard to make
everyone happy.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-07 Thread Nathan Wagner

I have added full text searching to my tracker.  I only index the first
50 KB of each message.  There's apparently a one MB limit on that
anyway, which a few messages exceed.  I figure anything important is
probably in the first 50KB.  I could be wrong.  I could re-index fairly
easily.  It seems to work pretty well.

I have changed the 'stale' status to 90 days.  I have triaged most of
the remaining bugs with a message count > 1.  This is in a separate
file if it some other system wants to collect the work.

I am still working out exactly how to find multiple matching commits
along different branches.  I'm running git patch-id on every commit,
but that will take a while.

Later tonight I will implement some in-message-body update syntax I'm
also going to create (internally) a web page for the bugs that allows
editing the status from that page.  I will post the details on that
once I've tested it a bit.  I'll probably make a web page allowing
updates as well.

How often are the bugs mail list archives updated?  I've got a bunch of
posts to the bugs list in my inbox that aren't in the archive download.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-08 Thread Nathan Wagner

On Wed, Oct 07, 2015 at 03:06:50PM -0400, Stephen Frost wrote:
> * Nathan Wagner (nw...@hydaspes.if.org) wrote:
> > I have added full text searching to my tracker.  I only index the first
> > 50 KB of each message.  There's apparently a one MB limit on that
> > anyway, which a few messages exceed.  I figure anything important is
> > probably in the first 50KB.  I could be wrong.  I could re-index fairly
> > easily.  It seems to work pretty well.
> 
> Note that we have FTS for the -bugs, and all the other, mailing lists..

True, but that finds emails.  The search I have finds bugs (well, bug reports
anyway).  Specifically, I have the following function:

create or replace function bugvector(bugid bigint)
returns tsvector language 'sql' as $$
select tsvagg(
setweight(to_tsvector(substr(body(msg), 1, 50*1024)), 'D')
||
setweight(to_tsvector(header_value(msg, 'Subject')), 'C')
)
from emails
where bug = $1
$$ strict;

which, as you can see, collects into one tsvector all the emails associated
with that particular bug.  So a search hit is for the whole bug.  There's
probably some search artifacts here.  I suspect a bug with a long email thread
will be ranked higher than a one with a short thread.  Perhaps that's ok
though.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-21 Thread Nathan Wagner

On Tue, Oct 20, 2015 at 10:39:55AM -0700, Joshua D. Drake wrote:
> So where are we at on this?

Well, I can't speak to where we are, but my system is up, running, and
seems to work well,  It even attracts a few visitors.

I have been meaning to write a triage interface, but I have been stuck
doing postgis work for an anthropology paper I am working on for the
last week.  I hope to be able to take a break from that tomorrow and
through the weekend.  I have read the various comments, and will also do
some simple scanning for phrases in emails and commit messages
indicating status changes which I will reflect on the main web page.
Once I write that code, I will email the list with what I have done.  If
people want to use it, or criticize it, they would be welcome to do so.
If not, well, the interface is useful for me in any event, so I will
probably maintain it for the forseeable future.

https://granicus.if.org/pgbugs/ for anyone who hasn't and wants to take a
look.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] fortnight interval support

2015-10-27 Thread Nathan Wagner

On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote:
> Aw, you're no fun.  select '1 fortnight'::interval => '14 days' would be cool.

Patch attached...

:)

% psql -p 5433 -d template1 -h localhost
psql (9.4.5, server 9.6devel)
WARNING: psql major version 9.4, server major version 9.6.
 Some psql features might not work.
Type "help" for help.

template1=# select current_date;
date

 2015-10-27
(1 row)

template1=# select '1 fortnight'::interval;
 interval 
--
 14 days
(1 row)

template1=# select current_date + '1 fortnight'::interval;
  ?column?   
-
 2015-11-10 00:00:00
(1 row)

template1=# select current_date + '1.3 fortnight'::interval;
  ?column?   
-
 2015-11-14 04:48:00
(1 row)

template1=# select current_date + '1.3 fortnights'::interval;
  ?column?   
-
 2015-11-14 04:48:00
(1 row)

-- 
nw
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 926358e..2032fe0 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -186,6 +186,8 @@ static const datetkn deltatktbl[] = {
{DDECADE, UNITS, DTK_DECADE},   /* "decade" relative */
{"decades", UNITS, DTK_DECADE}, /* "decades" relative */
{"decs", UNITS, DTK_DECADE},/* "decades" relative */
+   {DFORTNIGHT, UNITS, DTK_FORTNIGHT}, /* "fortnights" relative */
+   {"fortnights", UNITS, DTK_FORTNIGHT}, /* "fortnights" relative */
{"h", UNITS, DTK_HOUR}, /* "hour" relative */
{DHOUR, UNITS, DTK_HOUR},   /* "hour" relative */
{"hours", UNITS, DTK_HOUR}, /* "hours" relative */
@@ -3281,6 +3283,12 @@ DecodeInterval(char **field, int *ftype, int nf, int 
range,
tmask = DTK_M(DAY);
break;
 
+   case DTK_FORTNIGHT:
+   tm->tm_mday += val * 14;
+   AdjustFractDays(fval, tm, fsec, 
14);
+   tmask = DTK_M(WEEK);
+   break;
+
case DTK_WEEK:
tm->tm_mday += val * 7;
AdjustFractDays(fval, tm, fsec, 
7);
diff --git a/src/include/utils/datetime.h b/src/include/utils/datetime.h
index e9a1ece..3641292 100644
--- a/src/include/utils/datetime.h
+++ b/src/include/utils/datetime.h
@@ -52,6 +52,7 @@ struct tzEntry;
 #define DHOUR  "hour"
 #define DDAY   "day"
 #define DWEEK  "week"
+#define DFORTNIGHT "fortnight"
 #define DMONTH "month"
 #define DQUARTER   "quarter"
 #define DYEAR  "year"
@@ -181,6 +182,7 @@ struct tzEntry;
 #define DTK_TZ_MINUTE  35
 #define DTK_ISOYEAR36
 #define DTK_ISODOW 37
+#define DTK_FORTNIGHT  38
 
 
 /*
diff --git a/src/test/regress/expected/interval.out 
b/src/test/regress/expected/interval.out
index c873a99..7a72f2a 100644
--- a/src/test/regress/expected/interval.out
+++ b/src/test/regress/expected/interval.out
@@ -40,6 +40,12 @@ SELECT INTERVAL '1.5 weeks' AS "Ten days twelve hours";
  10 days 12:00:00
 (1 row)
 
+SELECT INTERVAL '1 fortnight' AS "Fourteen days";
+ Fourteen days 
+---
+ 14 days
+(1 row)
+
 SELECT INTERVAL '1.5 months' AS "One month 15 days";
  One month 15 days 
 ---
diff --git a/src/test/regress/sql/interval.sql 
b/src/test/regress/sql/interval.sql
index 789c3de..285266a 100644
--- a/src/test/regress/sql/interval.sql
+++ b/src/test/regress/sql/interval.sql
@@ -12,6 +12,7 @@ SELECT INTERVAL '-08:00' AS "Eight hours";
 SELECT INTERVAL '-1 +02:03' AS "22 hours ago...";
 SELECT INTERVAL '-1 days +02:03' AS "22 hours ago...";
 SELECT INTERVAL '1.5 weeks' AS "Ten days twelve hours";
+SELECT INTERVAL '1 fortnight' AS "Fourteen days";
 SELECT INTERVAL '1.5 months' AS "One month 15 days";
 SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years...";
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fortnight interval support

2015-10-27 Thread Nathan Wagner

On Tue, Oct 27, 2015 at 12:04:55PM -0500, Merlin Moncure wrote:
> On Tue, Oct 27, 2015 at 8:52 AM, Nathan Wagner  wrote:
> > On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote:
> >> Aw, you're no fun.  select '1 fortnight'::interval => '14 days' would be 
> >> cool.
> >
> > Patch attached...

> This is very cool (you are 100% certain there are no performance
> impacts on current cases, right?)!   :-)

It passed the regression test.  It must be perfect :)

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fortnight interval support

2015-10-27 Thread Nathan Wagner

On Wed, Oct 28, 2015 at 08:17:25AM +1300, Gavin Flower wrote:
> You trying to get PostgreSQL banned in France???  :-)
> 
> When I was learning French many years ago, I was told that the French
> consider their fortnight to be 15 days!!!

What, it's a "fortnight", not a "quinzaine".

You have no idea how hard it was to resist updating the patch...

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fortnight interval support

2015-10-27 Thread Nathan Wagner

On Tue, Oct 27, 2015 at 01:52:11PM +, Nathan Wagner wrote:
> On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote:
> > Aw, you're no fun.  select '1 fortnight'::interval => '14 days' would be 
> > cool.
> 
> Patch attached...

This isn't necessarily bad, but I observe that it would be difficult or
impossible to add fortnight support to intervals as an extension rather
than by modifying the scanner tables that the interval parser uses at
original compile time.  You might be able to override symbols with a C
extension, but parts of the parser are static (in the C sense), so you'd
need to override and duplicate a lot of the existing functions.  Of
course, a hookable interval parser is absurd in the first place.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add EXTRA_CFLAGS to configure

2015-10-28 Thread Nathan Wagner

On Wed, Oct 28, 2015 at 02:42:19PM +0100, Robert Haas wrote:
> On Wed, Oct 28, 2015 at 2:17 PM, Andres Freund  wrote:
> >> I use COPT for this purpose.
> >
> > Unless I miss something you can't just pass that to configure though,
> > right? I.e. it has to be passed to each make invocation?
> 
> What I do is:
> 
> echo COPT=-Wall -Werror > src/Makefile.custom

Make will pick up variables from the environment.  So unless the
makefile goes out of its way to circumvent it, you can just do

COPT=-Werror
export COPT

and then run your usual configure/compile cycle.  There's no
specific need to modify the makefiles or pass extra arguments
into make.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] September 2015 Commitfest

2015-10-31 Thread Nathan Wagner

On Sat, Oct 31, 2015 at 08:03:59AM +0100, Robert Haas wrote:
> +1.  FWIW, I'm willing to review some patches for this CommitFest, but
> if the committers have to do first-round review as well as
> committer-review of every patch in the CommitFest, this is going to be
> long, ugly, and painful.  We need to have a substantial pool of
> non-committers involved in the review process so that at least some of
> the work gets spread out.

As a non-committer, let me offer my thoughts.

First, I would ask that every patch include a commit id that the patch
was generated against.  For example, I was looking at the "group command
option for psql" patch.  I created a branch off of master, but the patch
doesn't apply cleanly.  On further investigation, it looks like Adam
Brightwell noted this on September 21, but the patch hasn't been
updated.  If I knew which commit id the patch was created against, I
could create a branch from there and test the patch, but without, I need
to figure that out which is more work, or I need to re-create the patch,
which is also more work.

Second, it would be convenient if there were a make target that would
set up a test environment.  Effectively do what the 'make check' does,
but don't run the tests and leave the database up.  It should probably
drop you into a shell that has the paths set up as well.  Another
target should be available to shut it down.  So, what would be cool,
and make testing easier is if I could do a 'git checkout -b feature
abcdef' (or whatever the commit id is and branch name you want to use)
Then from there a

make
make check
make testrig

make testshutdown

These two would go a long way to making the process of actually
doing a review easier.

Back to the patch in question, so Mr Brightwell noted that the patch
doesn't apply against master.  Should someone then mark the patch as
waiting on author?  Is failing to apply against master a 'waiting on
author' cause?  Is the answer different if the patch author has supplied
a commit id that the patch was generated from?

There was then some further discussion on the interface, and what to do
with startup files, and nothing was really decided, and then the thread
petered out.  This potential reviewer is then left with the conclusion
that this patch really can't be reviewed, and it's not sure if it's even
wanted as is anyway.  So I move on to something else.  There was a bunch
of discussion by a bunch of committers, and no-one updated the status of
the patch in the commitfest, and there's no clear guidelines on what the
status should be in this case.

If I were needing to decide, I would say that the patch should either be
marked as "Waiting on Author" on the grounds that the patch doesn't
currently apply, or "Returned with feedback" on the grounds that there
was unaddressed feedback on the details of the patch, and it was noted
as a "proof of concept" when submitted anyway.  But I'm unwilling to
just change it to that without more clear definitions of the meaning of
each status.  A link to definitions and when the status should be
changed would help.

What is "ready for committer" anyway?  It's clearly not "a committer
will apply the patch", since things sit in that status without being
committed.  If I think the patch is good and should be applied, do I
mark it as ready for committer, and then later a committer will also
review the patch?  If so, is doing anything other than the sanity
checks, and possibly offering an opinion, on the patch even useful?

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] September 2015 Commitfest

2015-10-31 Thread Nathan Wagner

On Sat, Oct 31, 2015 at 12:08:58PM -0400, Tom Lane wrote:
> Nathan Wagner  writes:
> > Second, it would be convenient if there were a make target that would
> > set up a test environment.  Effectively do what the 'make check' does,
> > but don't run the tests and leave the database up.  It should probably
> > drop you into a shell that has the paths set up as well.  Another
> > target should be available to shut it down.
> 
> As far as that goes, I don't think it's really the makefiles' place to
> establish a manual-testing convention.  What I do, and what I think
> most other longtimers do, is create test installations in nondefault
> places.

[snip description on how to set this up ]

> You could imagine putting something into the standard makefiles
> that did some subset of this, but I think it would be too rigid
> to be useful.

I think it would be very useful to just be able to tell the system "fire
this up for me so I can test it".  I don't think it needs to handle
every possible testing scenario, just making it easier to leave up the
test postmaster from make check would be very useful, at least to me.

> As an example, what if you wanted to compare the behaviors of both
> unmodified HEAD and your patched code?  It's not very hard to set up
> two temporary installations along the lines of the recipe I've just
> given, but I can't see the makefiles handling that.

They could pick up make or environment variables.  We already do that
for psql.  Something like

PGPORT=5495 PGPATH=~/pg95 make startit

or some such.  I'm not actually proposing this, I'm just noting how the
makefiles could handle it fairly easily.  All I'd really like is a way
to leave the database used for 'make check' running so I can do any
additional poking around by hand that I might want to do more easily.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for geqo tweaks

2015-11-05 Thread Nathan Wagner

On Wed, Nov 04, 2015 at 12:51:52PM -0500, Tom Lane wrote:
> Nathan Wagner  writes:
> > I have two patches to make the geqo initialization and mutation
> > slightly better.
> 
> > The first adjusts the mutation swaps to avoid having to re-pick
> > ties.  The second changes the initialization and shuffling algorithm
> > for the gene array to use an in-place Fisher-Yates shuffling
> > algorithm.
> 
> I took a quick look at this.
> 
> I'm not very impressed with the first patch: it might save a few
> geqo_randint() calls, but it seems to do so at the price of making the
> swap choices less random --- for instance it sure looks to me like the
> last array element is now less likely to participate in swaps than
> other elements.  Unless you can prove that actually the swapping is
> still unbiased, I'm inclined to reject this part.

If I have understood the original code correctly, we need to select two
different random integers between 0 and num_gene-1, inclusive.  That
happens to be num_gene possible results.

Having chosen the first one, which I will call "swap1", we now only have
num_gene-1 possible results, which need to range from either 0 to
swap1-1 or from swap1+1 to num_gene-1, which is num_gene-1 possible
results.  I treat this as a single range from 0 to num_gene-2 and
generate a number within that range, which I will call "swap2".

If swap2 is between 0 and swap1-1, it is in the first range, and no
adjustment is necessary.  If it is greater than or equal to swap1, then
it is in the second range.  However the generated swap2 in the second
range will be between swap1 and num_gene-2, whereas we need it to be
between swap1+1 and num_gene-1, so I add one to swap2, adjusting the
range to the needed range.

It would be equivalent to set swap2 to num_gene-1 in that case,
effectively remapping the first value to the last, but an increment was
more intuitive to me.

> As for the second part, I had to look up Fisher-Yates ;-) but after
> having read Wikipedia's entry about it I think this is a good change.
> The code's shorter and more efficient, and it should mathematically
> provide an equally-unbiased initial shuffle.  It could do with a
> better comment, and I'd be inclined to handle the first element
> outside the loop rather than uselessly computing geqo_randint(0,0),

I could do that.  It will make the code slightly harder to read.  I
wonder if it would be worth having geqo_randint() handle the special
case instead.

> Having said that, though, I believe that it's also probably a
> *different* initial shuffle, which may well mean that GEQO gives
> different plans in some cases.

Yes, I would expect it to be different in the general case.  I think my
commit message noted that, but perhaps it could be more explicit.

> That doesn't bother me as long as we only make the change in HEAD, but
> does anyone want to complain?

I don't see any need to backport either of these patches.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for geqo tweaks

2015-11-05 Thread Nathan Wagner

On Wed, Nov 04, 2015 at 12:51:52PM -0500, Tom Lane wrote:

> As for the second part, I had to look up Fisher-Yates ;-) but after
> having read Wikipedia's entry about it I think this is a good change.
> The code's shorter and more efficient, and it should mathematically
> provide an equally-unbiased initial shuffle.  It could do with a
> better comment, and I'd be inclined to handle the first element
> outside the loop rather than uselessly computing geqo_randint(0,0),
> but those are trivial changes.

I see you committed a modified version of my patch in commit
59464bd6f928ad0da30502cbe9b54baec9ca2c69.

You changed the tour[0] to be hardcoded to 1, but it should be any of
the possible gene numbers from 0 to remainder.  If you want to pull the
geqo_randint(0,0) out of the loop, it would be the last element, not the
first (i.e. where remainder == 0).

We might be able to just skip the last swap, and the loop could be

for (i=0; i < num_gene-1; i++) {

but I'd need to re-read the details of the Fisher-Yates algorithm to be
sure.  It may be that the last swap needs to happen for the shuffle to
be fully random.  In any case, tour[0] certainly shouldn't be hardcoded
to 1.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for geqo tweaks

2015-11-06 Thread Nathan Wagner

On Fri, Nov 06, 2015 at 11:19:00AM -0500, Tom Lane wrote:
> Nathan Wagner  writes:
> > I see you committed a modified version of my patch in commit
> > 59464bd6f928ad0da30502cbe9b54baec9ca2c69.
> 
> > You changed the tour[0] to be hardcoded to 1, but it should be any
> > of the possible gene numbers from 0 to remainder.
> 
> How so?  The intent is to replace the first iteration of the
> Fisher-Yates loop, not the old loop.  That iteration will certainly
> end by assigning 1 to tour[0], because it must choose j = i = 0.

You are correct.  I got confused between reading the original code, my
patch, and your modified patch.

I wonder why the algorithm bothers with the first iteration at all, in
the case of an initialized array, it would just swap the first element
with itself.  I must be missing something.  I'll need to do some more
reading.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for geqo tweaks

2015-11-06 Thread Nathan Wagner

On Fri, Nov 06, 2015 at 11:45:38AM -0500, Tom Lane wrote:

> However, really the whole argument is moot, because I notice that
> geqo_mutation() is only called in the "#ifdef CX" code path, which
> we don't use.

I suppose someone could turn it on via a compiler define.

> So there's little point in improving it.

No, probably not.

> (There's a fair amount of dead code in /geqo/, which I've never had
> the energy to clean up, but maybe we should do that sometime.  It
> seems unlikely that anyone will ever be interested in experimenting
> with the ifdef'ed-out code paths.)

I also note that in  src/backend/optimizer/path/allpaths.c there is a
join_search_hook variable apparently intended for plugins (extensions?)
to be able to control the search path optimizer.  And the geqo code is
AFAICT turned off by default anyway, so none of the code is used in
probably the vast majority of systems, with standard_join_search() being
called instead.

Would it be worth either of removing at least the non-ERX portions of
the geqo code, or removing the geqo code entirely (presumably with a
deprecation cycle) and moving it to an extension?  If there's any
interest, I can work up a patch for either or both.

There is only one test in the regression suite that turns on geqo that I
could find.  It's labeled "check for failure to generate a plan with
multiple degenerate IN clauses" in join.sql.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for geqo tweaks

2015-11-06 Thread Nathan Wagner

On Fri, Nov 06, 2015 at 02:16:41PM -0500, Tom Lane wrote:
> Uh, what?  It's not by any means turned off by default.
> 
> postgres=# select name,setting from pg_settings where name like '%geqo%';
> name | setting 
> -+-
>  geqo| on

[snip]

My day to make a fool of myself in public I guess.  You're right of
course.  I can only plead distraction by having too many projects in
mind at once and not focusing properly.  Sorry for taking up your time
on things I should have checked better.

> I'm inclined to think that removing all the ifdefd-out-by-default logic
> would be a fine thing to do, though.

I'll work up a patch.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Nathan Boley

> I actually don't know much about the I/O subsystem, but, no, WAL is
> not separated from data.  I believe $PGDATA is on a SAN, but I don't
> know anything about its characteristics.

The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm

$PGDATA is on a 5 disk SATA software raid 5.

Is there anything else that would be useful to know? ( Sorry, this
isn't a subject that I'm very familiar with )

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley

On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane  wrote:
> Alexander Korotkov  writes:
>> On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch  wrote:
>>> I've attached a new version that includes the UINT64_FMT fix, some edits of
>>> your newest comments, and a rerun of pgindent on the new files.  I see no
>>> other issues precluding commit, so I am marking the patch Ready for
>>> Committer.
>>> If I made any of the comments worse, please post another update.
>
>> Changes looks reasonable for me. Thanks!
>
> I am starting to look at this patch now.  I'm wondering exactly why the
> decision was made to continue storing btree-style statistics for arrays,
> in addition to the new stuff.

If I understand you're suggestion, queries of the form

SELECT * FROM rel
WHERE ARRAY[ 1,2,3,4 ] <= x
 AND x <=ARRAY[ 1, 2, 3, 1000];

would no longer use an index. Is that correct?

Are you suggesting removing MCV's in lieu of MCE's as well?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley

On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane  wrote:
> Nathan Boley  writes:
>> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane  wrote:
>>> I am starting to look at this patch now.  I'm wondering exactly why the
>>> decision was made to continue storing btree-style statistics for arrays,
>>> in addition to the new stuff.
>
>> If I understand you're suggestion, queries of the form
>
>> SELECT * FROM rel
>> WHERE ARRAY[ 1,2,3,4 ] <= x
>>      AND x <=ARRAY[ 1, 2, 3, 1000];
>
>> would no longer use an index. Is that correct?
>
> No, just that we'd no longer have statistics relevant to that, and would
> have to fall back on default selectivity assumptions.

Which, currently, would mean queries of that form would typically use
a table scan, right?

> Do you think that
> such applications are so common as to justify bloating pg_statistic for
> everybody that uses arrays?

I have no idea, but it seems like it will be a substantial regression
for the people that are.

What about MCV's? Will those be removed as well?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley

>> What about MCV's? Will those be removed as well?
>
> Sure.  Those seem even less useful.

Ya, this will destroy the performance of several queries without some
heavy tweaking.

Maybe this is bad design, but I've gotten in the habit of storing
sequences as arrays and I commonly join on them. I looked through my
code this morning, and I only have one 'range' query ( of the form
described up-thread ), but there are tons of the form

SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

I can provide some examples if that would make my argument more compelling.

Sorry to be difficult,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley

[ sorry Tom, reply all this time... ]

> What do you mean by "storing sequences as arrays"?

So, a simple example is, for transcripts ( sequences of DNA that are
turned into proteins ), we store each of the connected components as
an array of the form:

exon_type in [1,6]
splice_type = [1,3]

and then the array elements are

[ exon_type, splice_type, exon_type ]

~ 99% of the elements are of the form [ [1,3], 1, [1,3] ],

so I almost always get a hash or merge join ( correctly ) but for the
rare junction types ( which are usually more interesting as well ) I
correctly get nest loops with an index scan.

> Can you demonstrate
> that the existing stats are relevant at all to the query you're worried
> about?

Well, if we didn't have mcv's and just relied on ndistinct to estimate
the '=' selectivities, either my low selectivity quals would use the
index, or my high selectivity quals would use a table scan, either of
which would be wrong.

I guess I could wipe out the stats and get some real numbers tonight,
but I can't see how the planner would be able to distinguish *without*
mcv's...

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley

>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.  For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1] ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT )  That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

by

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1] http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2] http://en.wikipedia.org/wiki/Copula_%28statistics%29

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why don't we accept exponential format for integers?

2010-12-17 Thread Nathan Boley

>>>> print int(1e+01)
> 10
>>>>
>

That isn't building an integer: it is creating a float and casting to an int.

try:

int( 1e100 )

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra  wrote:
> Dne 19.1.2011 20:25, Robert Haas napsal(a):
>> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra  wrote:
>>> Yes, I was aware of this problem (amount of memory consumed with lots of
>>> updates), and I kind of hoped someone will come up with a reasonable
>>> solution.
>>
>> As far as I can see, periodically sampling some or all of the table is
>> really the only thing that has a chance of working OK.  You could try
>> to track changes only when they're not too big, but I think that's
>> still going to have nasty performance consequences.
>
> Yes, sampling all the table is the only way to get reliable ndistinct
> estimates.

IMHO continuing to focus on worst case results is a dead end. Yes, for
some distributions, ndistinct is very hard to estimate well. However,
let us not forget that the current ndistinct estimator is very bad but
the postgres planner is, on the whole, very good.  As far as I can see
this is due to two facts:

1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )

2) The choice of plan is fairly robust to mis-estimation of ndistinct,
because there are only a few things the executer can choose. It
doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
the table, we still usually need to scan the entire table.

Again, in my *very* humble opinion, If the ultimate goal is to improve
the planner, we should try to cut down on the number of cases in which
a poor estimate of ndistinct gives a really bad plan instead of
chasing after marginal gains from a better ndistinct estimator. Maybe
( as I've advocated for before ) this means parameterizing the
distribution of values ( ie, find that the values are roughly uniform
), maybe this means estimating the error of our statistics and using
the most robust rather than the best plan, or maybe it means something
totally different. But: ( and the literature seems to support me )
pounding away at the ndistinct estimator seems like a dead end. If you
think about it, it's a bit ridiculous to look at the whole table
*just* to "estimate" ndistinct - if we go that far why dont we just
store the full distribution and be done with it?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

>> If you think about it, it's a bit ridiculous to look at the whole table
>> *just* to "estimate" ndistinct - if we go that far why dont we just
>> store the full distribution and be done with it?
>
> - the best you could do is to average the
> individual probabilities which gives ... well, 1/ndistinct.
>

That is certainly *not* the best you could do in every case. The mean
is only the best estimate in L2, which is definitely not the case
here.

Consider a table with 10K values, 9,990 of which are 1 and the rest of
which are 2, 3, ..., 10, versus a table that has the same 10 distinct
values evenly distributed. For a simple equality query, in the first
case, a bitmap scan might be best. In the second case, a sequential
scan would always be best.

This is precisely the point I was trying to make in my email: the loss
function is very complicated. Improving the ndistinct estimator could
significantly improve the estimates of ndistinct ( in the quadratic
loss sense ) while only marginally improving the plans.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

>
> Not true. You're missing the goal of this effort - to get ndistinct
> estimate for combination of multiple columns. Which is usually
> pathological in case of dependent columns.



> Again, don't think about a single column (although even in that case
> there are known fail cases). Think about multiple columns and the number
> of distinct combinations. With attribute value independence assumption,
> this is usually significantly underestimated. And that's a problem as it
> often leads to an index scan instead of sequential scan (or something
> like that).

My point was that, in the case of single columns, we've done pretty
well despite the poor ndistinct estimates. I suspect the same will be
true in the case of multiple columns; good heuristics will trump
minimax estimators.

>> ( as I've advocated for before ) this means parameterizing the
>> distribution of values ( ie, find that the values are roughly uniform
>> ), maybe this means estimating the error of our statistics and using
>> the most robust rather than the best plan, or maybe it means something
>> totally different. But: ( and the literature seems to support me )
>
> Which literature? I've read an awful lot of papers on this topic lately,
> and I don't remember anything like that. If there's an interesting
> paper, I'd like to read it. Yes, all the papers state estimating a
> ndistinct is a particularly tricky task, but that's somehow expected?

It is definitely expected - non-parametric minimax results are always
*very* weak. The papers that you've cited seem to confirm this;
bounding ndistinct estimation error is tricky in the general case (
even with very simple loss functions, which we do not have ). The same
is true for other non-parametric methods. Think about kernel density
estimation: how many data points do you need to estimate the density
of a normal distribution well? What about if you *know* that the data
is normal ( or even that it comes from a big family? ).

> No, I'm not trying to do this just to improve the ndistinct estimate.
> The goal is to improve the cardinality estimate in case of dependent
> columns, and one of the papers depends on good ndistinct estimates.
>
> And actually it does not depend on ndistinct for the columns only, it
> depends on ndistinct estimates for the combination of columns. So
> improving the ndistinct estimates for columns is just a necessary first
> step (and only if it works reasonably well, we can do the next step).

I think that any approach which depends on precise estimates of
ndistinct is not practical.

I am very happy that you've spent so much time on this, and I'm sorry
if my previous email came off as combative. My point was only that
simple heuristics have served us well in the past and, before we go to
the effort of new, complicated schemes, we should see how well similar
heuristics work in the multiple column case. I am worried that if the
initial plan is too complicated then nothing will happen and, even if
something does happen, it will be tough to get it committed ( check
the archives for cross column stat threads - there are a lot ).

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug  wrote:
> On Jan20, 2011, at 02:41 , Nathan Boley wrote:
>>>> If you think about it, it's a bit ridiculous to look at the whole table
>>>> *just* to "estimate" ndistinct - if we go that far why dont we just
>>>> store the full distribution and be done with it?
>>>
>>> - the best you could do is to average the
>>> individual probabilities which gives ... well, 1/ndistinct.
>>
>> That is certainly *not* the best you could do in every case. The mean
>> is only the best estimate in L2, which is definitely not the case
>> here.
>
> No, not in every case. But on average it comes out best, no?

In the sense of minimizing the average plan cost over all values?
Definitely not. Consider the trivial case where a bitmap scan is the
cost of a sequential scan + epsilon.

>
>> Consider a table with 10K values, 9,990 of which are 1 and the rest of
>> which are 2, 3, ..., 10, versus a table that has the same 10 distinct
>> values evenly distributed. For a simple equality query, in the first
>> case, a bitmap scan might be best. In the second case, a sequential
>> scan would always be best.
>
> True. But selectivity estimates alone don't show the difference there.

Yet the full distribution would - supporting my argument that even in
cases where we dont know a specific value, the full distribution is
more informative.

>
> Also, in my personal experience this isn't really the area we've got
> problems now. The problem cases for me always were queries with a rather
> large number of joins (7 to 10 or so) plus rather complex search
> conditions. The join order, not the access strategy, then becomes the
> deciding factor in plan performance. And the join order *is* driven purely
> off the selectivity estimates, unlike the access strategy which even today
> takes other factors into account (like clusteredness, I believe)
>

I think I'm losing you. Why would ndistinct be complete sufficient for
estimating the optimal join order?

>> This is precisely the point I was trying to make in my email: the loss
>> function is very complicated. Improving the ndistinct estimator could
>> significantly improve the estimates of ndistinct ( in the quadratic
>> loss sense ) while only marginally improving the plans.
>
>
> The point of this exercise isn't really to improve the ndistinct estimates
> for single columns. Rather, we'd like to get ndistinct estimates for
> groups of columns because that allows us to use the uniform bayesian
> approach to multi-column selectivity estimation that Tomas found literature
> on. Which on the whole seems like it *does* have a chance of greatly
> improving the plans in some cases. We could, of course, estimate multi-column
> ndistinct the same way we estimate single-column ndistincts, but quite a few
> people raised concerns that this wouldn't work due to the large error our
> current ndistinct estimates have.

Sure. But estimating multi-column ndistinct is surely not the only way
to approach this problem.

The comment I made, which you objected to, was that it's silly to look
at the whole table and then throw away all information *except*
ndistinct. If we really think that looking at the whole table is the
best approach, then shouldn't we be able to get better statistics? Is
this really an open question?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ossp-uuid Contrib Patch

2012-09-12 Thread Nathan Wagner

On Mon, Sep 10, 2012 at 04:23:00PM -0700, David E. Wheeler wrote:
> Well given that OSSP seems to be abandon ware (no activity since July
> 2008), it might be time to dump it in favor of something else.

Perhaps this would be a good time to bring up my uuid code again.  I've
got a module for uuid support for postgres.  Originally written for 8.4
but I've kept it updated.  It's available at

https://pd.if.org/git/uuid.git

and for anonymous git cloning

git clone https://pd.f.org/git/uuid

The install directions are perhaps a bit lacking, but after the git
checkout:

cd uuid
make
cd postgres
make
sudo make install
psql -c 'create extension pduuid'

Installing to different schemas is supported also.  The postgres/README
file describes the supported features.  Briefly, my code supports uuid
generation of version 1, 3, 4, and 5 uuids, casting to and from numeric,
bit string, and bytea values, and provides a uuid_recent() function to
get the most recent  uuid generated in that backend.  The code is
entirely in the public domain.

At one point, it compiled on Windows and MacOS, but I haven't
tested that recently.  I don't think I have changed anything
that would affect compiling on those platforms.  IIRC, the
only platform dependent code is in figuring out a local
mac address and a sub-second timestamp.

Comments or suggestions welcome.

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measure Theoretic Data Types in Postgresql

2012-10-12 Thread Nathan Boley

> However, by realizing that the bounds on the ranges have a linear ordering
> one can speed this up to 0(m) using windowing functions on common table
> expressions.
>
> So what I am proposing is formalizing this optimization into a class of data
> types, that will hide the implementation details.

Could this not also be handled by extending merge join to work with an
overlap operator?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley

> On the basis of all of the foregoing, I don't think we can consider
> this patch further for this CommitFest and will update
> commitfest.postgresql.org accordingly.

FWIW, I am very excited about this patch and would be happy to review
it but have been very busy over the past month. If I can promise a
review by Thursday morning could we keep it active? Hopefully, at the
very least, I can provide some useful feedback and spawn some
community interest.

I am worried that there is a bit of a chicken and an egg problem with
this patch. I code nearly exclusively in python and C, but I have
often found pl/python to be very unwieldy.  For this reason I often
use pl/perl or pl/pgsql for problems that, outside of postgres, I
would always use python. From the documentation, this patch seems like
an enormous step in the right direction.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley

> I think it would be great for you to review it... I doubt that will
> cause it to get committed for 9.0, but my doubt is no reason for you
> to hold off reviewing it.

I assumed so, but the pretense of a chance will probably help to motivate me :-)

I'll have something by Thursday, and then 'Returned with Feedback'
will at least be factual.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG 9.0 and standard_conforming_strings

2010-02-03 Thread Nathan Wagner

On Wed, 03 Feb 2010 14:41:13 -0500, Tom Lane  wrote:

> Indeed it is, which is one of the reasons to be cautious with changing
> it.  We've been telling people to move away from \' for a long time,
> but actually flipping the switch that will make their apps insecure
> is not something to do on the spur of the moment.

AFAICT the switch was added in 8.2, and mentioned in the release notes
dated 2006-12-05.  The documentation for 8.2 says "The default is
currently
off, causing PostgreSQL to have its historical behavior of treating
backslashes
as escape characters. The default will change to on in a future release
to improve compatibility with the standard."

So people have had three years of warning, which I would hardly
characterize
as "spur of the moment".  If you want the old behavior, change the setting
to off.

I think that a major release point is exactly the right time to do this,
doing it at a minor release number is much less reasonable.

A question for those opposed to doing it now: how exactly do you propose
to
warn people that is different than the notice that it will be changed in
a future release that has been around for the last three years?

-- 
nw

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] extended operator classes vs. type interfaces

2010-04-09 Thread Nathan Boley

> The advantage of specifying a + and a - in the type interface is that
> the unit definition can then be specified as part of the type
> declaration itself.  So you can do:
>
> CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s');
> CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m');
>
> All of the stuff about defining + and - is hidden from the user - it's
> part of the type interface, which is pre-created.
>

The disadvantage is that it does not permit irregularly spaced units.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley

> Robert explained why having more MCVs might be useful because we use
> the frequency of the least common MCV as an upper bound on the
> frequency of any value in the MCV.

Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows  )/ndistinct. Obviously
changing the number of mcv's affects this by lowering
n_ndistinct_rows, but it's always pretty coarse estimate.

>  Binding the length of the MCV list to the size of the histogram is
> arbitrary but so would any other value

Wouldn't the best approach be to stop adding MCV's/histogram buckets
when adding new ones doesn't decrease your prediction error
'substantially'?

One very hacky threshold heuristic is to stop adding MCV's when a
simple equality select (  SELECT col FROM table WHERE col == VALUE )
would switch the plan from an index to a sequential scan ( or vice
versa, although with the current code this would never happen ). ie,
if the non_mcv frequency estimate is 0.1% ( producing an index scan ),
but adding the MCV gives us an estimate of 5% ( pbly producing a seq
scan ) then add that value as an mcv. More sophisticated variations
might also consider plan changes to very suboptimal joins; even more
sophisticated would be to stop when the MAX( curr - optimal plan /
optimal plan ) was below some threshold, say 20%, over a bunch of
recently executed queries.

A similar approach would work for histogram bins, except the queries
of interest are inequality rather than equality selections.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley

> That one's used, too, but the other is used as an upper bound.
> n_distinct tends to come out too small on large tables, so that
> formula is prone to overestimation.  Actually, both formulas are prone
> to overestimation.
>

Right - thanks.

> When this happens depends on the values of a whole boat-load of GUCs...

Well, then doesn't the 'proper' number of buckets depend on a whole
boat-load of GUCs...

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Range Types: empty ranges

2011-02-11 Thread Nathan Boley

FWIW, a very informal survey of probabilists didn't yield any reason
for trying to put an order on the empty set ( unless the metric was
cardinality or other equivalence relation ).

I think the problem here is that the idea of union and intersection
forming a ring over sets is being conflated with the order relation.
Clearly, combining the two notions can be inconsistent.  However...

>> >  A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C)
>>
>> But the basic range type isn't even closed under UNION.
>
> An excellent point. Allow me to move the target a little:
>
>  WHERE A && B AND A && C
> and:
>  WHERE A && (B INTERSECT C)
>
> That seems like a logically sound transformation, but if (B INTERSECT C)
> is empty, it relies on the empty range for those two to be equivalent.
>
> Now, I agree that lack of closure on UNION exhibits many of the problems
> that I am pointing out related to forbidding empty ranges. However, I'm
> not sure if that means we should give up on either.

This seems potentially very useful, because we can transform WHERE A
&& B AND A && C from a bitmap scan into WHERE A && (B INTERSECT C), a
simple index scan. In the union case ( even if we had a type that
supported disjoint intervals), I doubt we would ever make that
transformation because the index will probably still be over connected
intervals.

So, +1 for keeping it how it is ( but maybe with a better error message ).

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Nathan Boley

> Personally, I think the first thing we ought to do is add a real, bona
> fide planner hint to override the selectivity calculation manually,
> maybe something like this:
>
> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>


If you're going to go that far, why not just collect statistics on
that specific predicate?

ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x < 5 AND y = 1);

Then it won't fall subject to all of the pitfalls that Tom outlines below.

Selectivities are easy to estimate if we know the predicate. They only
become hard when they have to work for every possible predicate.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-14 Thread Nathan Boley

Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.


Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity )
i) Any assumptions being made ( ie independence of elements within
an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+* mult * dist[i] / mcelem_dist[i] gives us probability 
probability
+* of qual matching from assumption of independent 
element occurence
+* with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is
   certainly just the order of the estimate that matters, why not just
   perform the calculation in log space?

2) compute_array_stats is really long. Could you break it up? ( I know that
   the other analyze functions are the same way, but why continue in
that vein? )

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Nathan Boley

> Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
> view map both into same fileds.

Yes, you're right. I am sorry about the error.

> Surely, non-standard use of histogram slot
> should be avoided.

Agreed.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-10-15 Thread Nathan Boley

> Looking now, I see that Alexander wasn't Cc'd on the review, so it's
> possible he missed the message?
>

We've corresponded off list and have discussed my review at some length.

Alex submitted an updated patch on Sep 22 to me personally ( although
not to the list? Alex? ), with the promise of a new version with
improved comments.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2011-11-15 Thread Nathan Boley

> Rebased with head.

FYI, I've added myself as the reviewer for the current commitfest.

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql line number reporting from stdin

2011-11-28 Thread Nathan Wagner


On Sat, 26 Nov 2011 22:36:15 +0200, Peter Eisentraut wrote:

There is a long-standing oddity in psql that running

psql -f foo.sql

returns error messages with file name and line number, like

psql:foo.sql:1: ERROR:  syntax error at or near "foo"

but running

psql < foo.sql does not.  I suggest we change the latter to print

psql::1: ERROR:  syntax error at or near "foo"

Other examples for the use of the spelling "" in this context
include gcc and slonik.

Error messages printed in interactive mode will not be affected, of
course.

Patch attached.


No issue with the change itself, but the docs claim
that

"the variant using the shell's input redirection is
(in theory) guaranteed to yield exactly the same output
you would have received had you entered everything by hand."

--
nw

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-11-28 Thread Nathan Boley

>> Version of patch with few more comments and some fixes.
>
> Where are we on the idea of better statistics for arrays?

I need to finish reviewing the patch: I'll try to send in something
tomorrow night. So far it looks good.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] storage of sensor data with Fourier transforms

2007-05-04 Thread Nathan Buchanan


Hello!

I have a potential situation where I will have a lot of sensor data coming
in very often. (every second or so) The sensor data is from physics type
measurements, and will normally follow a slowly changing pattern with
sinusoidal disturbances. The overall shape of the data is more important
than catching high frequency disturbances.

I had the idea of taking the Fourier transform of the waveform and storing
the waveform internally that way to reduce storage requirements. When I get
some free cycles, I may try doing this. I would want it to be done in the
database in such a way that the user can still join to a table using this
internal storage scheme.

Why am I mailing this list? I'd like to ask if anyone has heard of someone
doing anything like this. I did a small search of the lists and internet but
didn't come up with anything. I just don't want to re-invent the wheel.

Thanks for your time,
Nathan

--
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Got Mole problems? Call Avogadro at 6.02 x 10^23.

Re: [HACKERS] storage of sensor data with Fourier transforms

2007-05-05 Thread Nathan Buchanan

On 5/5/07, Tom Lane <[EMAIL PROTECTED]> wrote:

"Nathan Buchanan" <[EMAIL PROTECTED]> writes:
> I had the idea of taking the Fourier transform of the waveform and
storing
> the waveform internally that way to reduce storage requirements.

Aside from what Steve said: The Fourier transform in itself doesn't
reduce data size --- it's N points in, N points out.  If you want to
reduce storage requirements you have to resort to lossy compression, ie,
deliberately throwing away some data.  The transformed data might be
more suitable for doing that (eg you can selectively discard
high-frequency components) but do you really want to?  Usually the point
of storing measurements is so you can do unspecified analysis on them
later.  Applying lossy compression will restrict what you can
(meaningfully) do later on.

regards, tom lane

Thanks for the replies. It seems I need to examine my plan more closely.

Nathan

Re: [HACKERS] Not ready for 8.3

2007-05-15 Thread Nathan Buchanan

On 5/15/07, Tatsuo Ishii <[EMAIL PROTECTED]> wrote:

> Stefan Kaltenbrunner wrote:
> >> They are not stable.  The items should point to the archives, which
are
> >> supposedly more stable.  (I had already fixed one item in PatchStatus
> >> this morning).  Really it would be much nicer to have links using the
> >> Message-Id but I doubt that's at all doable.
> >
> > hrm - I see so is there a particular reason for that behaviour ?
>
> They're stable until Bruce removes something from the queue. When
> something is removed, it's renumbered.
>
> It's how mhonarc works. It's the same with the archives - if we delete a
> mail, they get renumbered. So we never should delete, we should just
> blank out, but it has happened a couple of times.

As I proposed for many times, why don't we add message number to each
subject line in mail? For example like this:

[HACKERS: 12345] Re: Not ready for 8.3

This way, we could always obtain stable (logical) pointer, without
reling on particular archival infrastructure.

This sounds like a good idea to me - though I'm just a lurker on the list.

Nathan

Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley

>> Because intervals (mathematical not SQL) can be open or closed at each
>> end point we need to know what the next an previous value would be at
>> the specified granularity. And while you can do some operations without
>> knowing this, there are many you can't. For instance you could not tell
>> whether two [] or () ranges were adjacent, or be able to coalesce an
>> array of ranges.
>
> This statement seems to me to demonstrate that you don't actually
> understand the concept of open and closed ranges.  It has nothing
> whatsoever to do with assuming that the data type is discrete;
> these concepts are perfectly well defined for the reals, for example.
> What it is about is whether the inclusion conditions are "< bound"
> or "<= bound".

IMHO the first question is whether, for integers, [1,2] UNION [3,5]
should be equal to [1,5]. In math this is certainly true, and defining
'next' seems like a reasonable way to establish this in postgres.

The next question is whether, for floats, [1,3-FLT_EPSILON] UNION
[3,5] should be [1,5].

And the next question is whether, for numeric(6,2), [1,2.99] UNION
[3,5] should be [1,5].

FWIW, I would answer yes, no, yes to those three questions.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley

> Right, I don't think strings are any more or less countable than integers.
> (and yes, it's a bit OT).

Well, if I remember my algebra, I think the issue is that integers are
locally finite whereas strings are not ( assuming standard orderings,
of course ).

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IN, BETWEEN, spec compliance, and odd operator names

2008-08-25 Thread Nathan Boley

> Yes, but always in relation to operator classes, so from BTrees opclass or
> such, which I refered to as "the context of index searches", as I don't
> really see any theorical need for opclass if it's not for indexing.

IIRC analyze uses the btree op class to build the selectivity histogram.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley

>>> Right now our
>>> "histogram" values are really quantiles; the statistics_target T for a
>>> column determines a number of quantiles we'll keep track of, and we
>>> grab values from into an ordered list L so that approximately 1/T of
>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>> thinking that multicolumn statistics would instead divide the range of
>>> each column up into T equally sized segments,
>>
>> Why would you not use the same histogram bin bounds derived for the
>> scalar stats (along each axis of the matrix, of course)?  This seems to
>> me to be arbitrarily replacing something proven to work with something
>> not proven.  Also, the above forces you to invent a concept of "equally
>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. I believe those are the same
> difficulties Gregory Stark mentioned having in his first post in this
> thread. But of course that's an excellent point, that what we do now
> is proven. I'm not sure which problem will be harder to solve -- the
> weird geometry or the "equally sized ranges" for data types where that
> makes no sense.
>

Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley

> I'm still working my way around the math, but copulas sound better
> than anything else I've been playing with.

I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math questions.

I've previously thought that, at least in the 2D case, we could use
image compression algorithms to compress the copula, but recently I've
realized that this is a change point problem. In terms of compression,
we want to decompose the copula into regions that are as homogenous as
possible.  I'm not familiar with change point problems in multiple
dimensions, but I'll try and ask someone that is, probably late next
week. If you decide to go the copula route, I'd be happy to write the
decomposition algorithm - or at least work on the theory.

Finally, a couple points that I hadn't seen mentioned earlier that
should probably be considered-

1) NULL's need to be treated specially - I suspect the assumption of
NULL independence is worse than other independence assumptions. Maybe
dealing with NULL dependence could be a half step towards full
dependence calculations?

2) Do we want to fold the MCV's into the dependence histogram? That
will cause problems in our copula approach but I'd hate to have to
keep an N^d histogram dependence relation in addition to the copula.

3) For equality selectivity estimates, I believe the assumption that
the ndistinct value distribution is uniform in the histogram will
become worse as the dimension increases. I proposed keeping track of
ndistinct per histogram beckets earlier in the marginal case partially
motivated by this exact scenario. Does that proposal make more sense
in this case? If so we'd need to store two distributions - one of the
counts and one of ndistinct.

4) How will this approach deal with histogram buckets that have
scaling count sizes ( ie -0.4 )?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

2008-10-19 Thread Nathan Boley

> I still need to go through backend/utils/adt/selfuncs.c
> to figure out exactly how we use the one-dimensional values.
>

Here's a page that helped me figure all this out.

http://www.postgresql.org/docs/8.1/static/planner-stats-details.html

>>
>> 2) Do we want to fold the MCV's into the dependence histogram? That
>> will cause problems in our copula approach but I'd hate to have to
>> keep an N^d histogram dependence relation in addition to the copula.
>
> Yeah, if we're already trying to figure out how to compress copulae,
> having also to compress MCV matrices seems painful and error-prone.
> But I'm not sure why it would cause problems to keep them in the
> copula -- is that just because we are most interested in the copula
> modeling the parts of the distribution that are most sparsely
> populated?
>

The problem I was thinking of is that we don't currently store the
true marginal distribution. As it stands, histograms only include non
mcv values. So we would either need to take the mcv's separately (
which would assume independence between mcv's and non-mcv values ) or
store multiple histograms.

>> 4) How will this approach deal with histogram buckets that have
>> scaling count sizes ( ie -0.4 )?
>
> I'm not sure what you mean here.
>

That was more a note to myself, and should have been numbered 3.5.
ndistinct estimates currently start to scale after a large enough
row/ndistinct ratio. If we try to model ndistinct, we need to deal
with scaling ndistinct counts somehow. But that's way off in the
future, it was probably pointless to mention it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Simple postgresql.conf wizard

2008-12-05 Thread Nathan Boley

Thanks for putting out pgtune - it's a sorely needed tool.

I had a chance to look over the source code and have a few comments,
mostly about python specific coding conventions.

- The windows specific try block ( line 16 ) raises a ValueError vs
ImportError on my debian machine. Maybe it would be more appropriate
to explicitly test platform.system()=="Windows"?

- from ctypes import * ( 18 ) makes the block difficult to read and
pollutes the namespace.

- on line 45, the try block should probably catch exceptions derived
from Exception ( to avoid catching SystemExit and KeyboardInterrupt
errors ). ie, except Exception: return None. Also, printing out the
expection in debug mode would probably be a good idea ( ie except
Exception, e: print e\ return None )

- all classes ( 58, 135, 205 ) are 'old-style' classes.  I dont see
any reason to use classic classes ( unless Python 2.1 is a support
goal? ) To make classes 'new style' classes (
http://www.python.org/doc/2.5.2/ref/node33.html ) they should inherit
object. i.e. class PGConfigLine(object):

- The doc strings ( 59, 136, 206 ) don't follow standard conventions,
described here http://www.python.org/dev/peps/pep-0257/.

- Functions also support doc strings ( 342, 351, etc. )

- Tuple unpacking doesn't require the tuple boundaries ( 446 and
others ). ie, options, args = ReadOptions()  works.

This is more of a style comment about the 'Java-ish interface' ( line
112 ), feel free to ignore it.

overloading __str__ and __repr__ are the python ways to return string
representations of an object. ie, instead of toSting use __str__ and
then ( on 197 ) print l or print str(l) instead of print l.toString()

for the other methods ( getValue, getLineNumber, isSetting ) I'm
assuming you didnt call the attributes directly because you didnt want
them to be accidently overwritten. Have you considered the use of
properties ( http://www.python.org/download/releases/2.2.3/descrintro/#property
)? Also, it would be more clear to mark attributes as private ( i.e.
_name or __name, _readable, _lineNumber, _setsParameter ) if you dont
want them to be accessed directly.

Hope my comments are useful! Thanks again for writing this.

-Nathan

P.S. I'd be happy to officially review this if it gets to that.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley

> One more direction could be implementing "MCV" for range of values (group
> values and interpolate in between). Consider statistics on timestamp column
> that says that for "2008-December" there are as many X rows, for
> "2008-November" as many as Y, etc.  That could be used for rather accurate
> cardinality estimation of "between" cases, while keeping number of entries
> in "MCV" list small.
>

I suggested this earlier (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ).
If there's interest, I'd be happy to clean up my patch and submit it
for 8.5

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley

>> What is the specific difference between what you are talking about and
>> what scalarineqsel already implements?
>
> Hmm... Northing new. Feel sorry for bothering you. I did not realize
> histograms are implemented.
>

Well, ISTM there is a profound difference. For scalarineqsel we care
about the total number of values in a bucket. For eqsel  we care about
the total number of *distinct* values in each bucket ( which we don't
track ).

IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
the limited storage in pg_statistic to try and estimate the
selectivity for ranges of values rather than a single value? That
gives way better coverage of the distribution. If the number of values
is too high to fit in a single bucket we put it in an mcv slot
anyways. *That* should be the mechanism by which the number of mcv's
increases.

I guess this is a bit off topic for the middle of a commit fest though.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley

Thanks for the response.

>> Well, ISTM there is a profound difference. For scalarineqsel we care
>> about the total number of values in a bucket. For eqsel  we care about
>> the total number of *distinct* values in each bucket
>
> Really?
>

Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.

Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.

>> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
>> the limited storage in pg_statistic to try and estimate the
>> selectivity for ranges of values rather than a single value?
>
> MCVs are useful for questions that are not related to ranges of values
> --- an example of something highly useful we do with them is to try to
> estimate the fraction of a column that satisfies a LIKE or regex
> pattern.
>

Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.

> In fact, as I was just pointing out to Bruce, we can compute them and
> do useful things with them for datatypes that don't have a defined sort
> order and so the whole concept of "range" is meaningless.
>

Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.

> Now, if your point is that it'd be more flexible to not tie the MCV list
> length to the histogram length, you're right.

No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc.  Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.

I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley

>> Isn't a selectivity estimate of x = v as ( the number of values in v's
>> histogram bucket ) / ( number of distinct values in v's histogram
>> bucket ) pretty rational? Thats currently what we do for non-mcv
>> values, except that we look at ndistinct over the whole table instead
>> of individual histogram buckets.
>
> But the histogram buckets are (meant to be) equal-population, so it
> should come out the same either way.

Why is it the same either way? The histogram buckets have equal
population in the number of total values, not the number of distinct
value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be
[0,2) and [2,+oo), but they would have 2 and 5 distinct values
respectively.

> The removal of MCVs from the
> population will knock any possible variance in ndistinct down to the
> point where I seriously doubt that this could offer a win.

Well, if the number of MCV's is large enough then it certainly won't
matter. But isn't that pointlessly inefficient for most distributions?
I provided some empirical evidence of this in an earlier post (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php )
for normal distributions.

> An even
> bigger problem is that this requires estimation of ndistinct among
> fractions of the population, which will be proportionally less accurate
> than the overall estimate.  Accurate ndistinct estimation is *hard*.
>

Agreed, but I'm not convinced that the overall error rate will go up
for our current estimator. Ndistinct estimation is hardest for
populations with a large ndistinct count variation. If the assumptions
underlying this are founded ( namely that ndistinct distributions are
related to the ordering relation in a predictable way ) then ndistinct
estimation should be easier because the ndistinct distribution will be
more uniform. But that's certainly something that should be tested on
real data.

>> now, if there are 100 histogram buckets then any values that occupy
>> more than 1% of the table will be mcv's regardless - why force a value
>> to be an mcv if it only occupies 0.1% of the table?
>
> Didn't you just contradict yourself?  The cutoff would be 1% not 0.1%.

No, I'm saying that if the number of histogram buckets is 100, then
even if the mcv list is set to length 2, every value that occupies 1%
or more of the table will be an mcv. However, if the size is fixed to
100 I think that it will be common to see values with relative
frequencies much lower than 1% near the bottom of the list.

> In any case there's already a heuristic to cut off the MCV list at some
> shorter length (ie, more than 1% in this example) if it seems not
> worthwhile to keep the last entries.  See lines 2132ff (in CVS HEAD)
> in analyze.c.

Given an increase in the default stats target, limits like the 25% are
exactly what I think there should be more of.

Can anyone suggest a good data set to test this sort of question on?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hist boundary duplicates bug in head and 8.3

2009-01-06 Thread Nathan Boley

>> For heavy tailed distributions, it is possible for analyze to
>> duplicate histogram boundaries.
>
> I don't think this is a bug.

hmmm... Well, I assumed it was a bug from a comment in analyze.

>From ( near ) line 2130 in analyze.c

 * least 2 instances in the sample.  Also, we won't suppress values
 * that have a frequency of at least 1/K where K is the intended
 * number of histogram bins; such values might otherwise cause us to
 * emit duplicate histogram bin boundaries.
 */

If this is expected, I'm also not sure what the use of maxmincount in
analyze is...

Thanks for the response,

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 >

1 - 100 of 334 matches

Mail list logo