Re: [HACKERS] AIX compile ?? libcrypt HELP !!
Tom Lane wrote: > Nathan Boeger <[EMAIL PROTECTED]> writes: > > I am trying to compile an app (compiles on *BSD, Linux) but for some > > reason I cant get it to compile on AIX 4.3.3.3. It seems not to like the > > libcrypt ?? > > > ld: 0711-317 ERROR: Undefined symbol: __crypt_r > > ld: 0711-317 ERROR: Undefined symbol: __setkey_r > > ld: 0711-317 ERROR: Undefined symbol: __encrypt_r > > I'll bet AIX doesn't have the "reentrant" versions crypt_r(), setkey_r(), > etc. Did you check its man page for the crypt functions? > > regards, tom lane well, no but... I did however link agianst the ssl libcrypto (-L/usr/local/ssl/lib -lcrypto ) and it works... why ? thank you for the response !! nathan
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 10/3/17, 5:59 PM, "Tom Lane" wrote: > I thought it would be a good idea to get this done before walking away > from the commitfest and letting all this info get swapped out of my > head. So I've reviewed and pushed this. Thanks! > I took out most of the infrastructure you'd put in for constructing > RangeVars for tables not explicitly named on the command line. It > was buggy (eg you can't assume a relcache entry will stick around) > and I don't believe it's necessary. I don't think that warnings > should be issued for any tables not explicitly named. > > In any case, though, the extent to which we should add more warning > or log output seems like a fit topic for a new thread and a separate > patch. Let's call this one done. I'll look into submitting that to the next commitfest. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional logging for VACUUM and ANALYZE
On 10/5/17, 12:29 AM, "Michael Paquier" wrote: > On Thu, Oct 5, 2017 at 1:23 AM, Bossart, Nathan wrote: >> Presently, there are a few edge cases in vacuum_rel() and analyze_rel() that >> I >> believe do not have sufficient logging. This was discussed a bit in the >> vacuum-multiple-relations thread [0], but it was ultimately decided that any >> logging changes should be proposed separately. > > I think that I agree with that, especially now with VACUUM allowing > multiple relations. The discussion then would be how much logging we > want. WARNING looks adapted per the discussions we had on the other > thread as manual VACUUMs can now involve much more relations, even > with partitioned tables. More opinions would be welcome. Thanks for taking a look. >> and a test that exercises a bit of this functionality. > > My take on those test would be to not include them. This is a lot just > to test two logging lines where the relation has been dropped. Yeah, I'm not terribly concerned about those tests. >> If this change were to be considered for back-patching, we would likely want >> to >> also apply Michael's RangeVar fix for partitioned tables to 10 [1]. Without >> this change, log messages for unspecified partitions will be emitted with the >> parent's RangeVar information. > > Well, that's assuming that we begin logging some information for > manual VACUUMs using the specified RangeVar, something that does not > happen at the top of upstream REL_10_STABLE, but can happen if we were > to include the patch you are proposing on this thread for > REL_10_STABLE. But the latter is not going to happen. Or did you patch > your version of v10 to do so in your stuff? For v10 the ship has > already sailed, so I think that it would be better to just let it go, > and rely on v11 which has added all the facility we wanted. I agree. I didn't mean to suggest that it should be back-patched, just that v10 has a small quirk that needs to be handled if others feel differently. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enhancements to passwordcheck
On 11/1/17, 9:33 PM, "Peter Eisentraut" wrote: > On 9/25/17 14:04, Bossart, Nathan wrote: >> I'd like to use this thread to gauge community interest in adding this >> functionality to this module. If there is interest, I'll add it to the next >> commitfest. > > This thread has no patch, which is not really the intended use of the > commit fest, so I'm closing the commit fest entry for now. Sorry about that. I'll reopen it when I have a patch to share. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE
On 10/5/17, 11:53 PM, "Jing Wang" wrote: > The patch has been updated according to Nathan's comments. > Thanks Nathan's review. Thanks for the new versions of the patches. I apologize for the long delay for this new review. It looks like the no-pgdump patch needs a rebase at this point. I was able to apply the code portions with a 3-way merge, but the documentation changes still did not apply. I didn't have any problems applying the pgdump patch. + + + The name of a database or keyword current_database. When using + current_database,it means using the name of the connecting database. + + For commands that accept the CURRENT_USER and SESSION_USER keywords, the keywords are typically listed in the 'Synopsis' section. I think CURRENT_DATABASE should be no different. For example, the parameter type above could be "database_specification," and the following definition could be included at the bottom of the synopsis: where database_specification can be: object_name | CURRENT_DATABASE Then, in the parameters section, the CURRENT_DATABASE keyword would be defined: CURRENT_DATABASE Comment on the current database instead of an explicitly identified role. Also, it looks like only the COMMENT and SECURITY LABEL documentation is being updated in this patch set. However, this keyword seems applicable to many other commands, too (e.g. GRANT, ALTER DATABASE, ALTER ROLE, etc.). +static ObjectAddress +get_object_address_database(ObjectType objtype, DbSpec *object, bool missing_ok) +{ + char*dbname; + ObjectAddress address; + + dbname = get_dbspec_name(object); + + address.classId = DatabaseRelationId; + address.objectId = get_database_oid(dbname, missing_ok); + address.objectSubId = 0; + + return address; +} This helper function is only used once, and it seems simple enough to build the ObjectAddress in the switch statement. Also, instead of get_database_oid(), could we use get_dbspec_oid()? if (stmt->objtype == OBJECT_DATABASE) { - char *database = strVal((Value *) stmt->object); - - if (!OidIsValid(get_database_oid(database, true))) + if (!OidIsValid(get_dbspec_oid((DbSpec*)stmt->object, true))) { + char*dbname = NULL; + + dbname = get_dbspec_name((DbSpec*)stmt->object); + ereport(WARNING, (errcode(ERRCODE_UNDEFINED_DATABASE), -errmsg("database \"%s\" does not exist", database))); +errmsg("database \"%s\" does not exist", dbname))); return address; } } This section seems to assume that the DbSpec will be of type DBSPEC_CSTRING in the error handling. That should be safe for now, as you cannot drop the current database, but I would suggest adding assertions here to be sure. + dbname = get_dbspec_name((DbSpec*)stmt->dbspec); As a general note, casts are typically styled as "(DbSpec *) stmt" (with the spaces) in PostgreSQL. + case DBSPEC_CURRENT_DATABASE: + { + HeapTuple tuple; + Form_pg_database dbForm; Can we just declare "tuple" and "dbForm" at the beginning of get_dbspec_name() so we don't need the extra set of braces? + if (fout->remoteVersion >= 10) + appendPQExpBuffer(dbQry, "COMMENT ON DATABASE CURRENT_DATABASE IS "); + else + appendPQExpBuffer(dbQry, "COMMENT ON DATABASE %s IS ", fmtId(datname)); This feature would probably only be added to v11, so the version checks in the pgdump patch will need to be updated. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_upgrade to clusters with a different WAL segment size
Hi hackers, The thread for the recent change to allow setting the WAL segment size at initdb time [0] included a bit of discussion regarding pg_upgrade [1], where it was suggested that relaxing an error check (presumably in check_control_data()) might be enough to upgrade servers to a different WAL segment size. We've had success with our initial testing of upgrades to larger WAL segment sizes, including post-upgrade pgbench runs. Beyond adjusting check_control_data(), it looks like the 'pg_resetwal -l' call in copy_xact_xlog_xid() may need to be adjusted to ensure that the WAL starting address is set to a valid value. Allowing changes to the WAL segment size during pg_upgrade seems like a nice way to avoid needing a dump and load, so I would like to propose adding support for this. I'd be happy to submit patches for this in the next commitfest. Nathan [0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fc49e24fa69a15efacd5b8958115ed9c43c48f9a [1] https://www.postgresql.org/message-id/20160826043926.pwkz45ksxpmfn4g6%40alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
Great, I’ll keep working on this patch and will update this thread with a more complete implementation. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/10/17, 8:10 PM, "Masahiko Sawada" wrote: > I agree to report per-table information. Especially In case of one of > tables specified failed during vacuuming, I think we should report at > least information of tables that is done successfully so far. +1 I believe you already get all per-table information when you do not specify a table name (“VACUUM VERBOSE;”), so I would expect to get this for free as long as we are building this into the existing ExecVacuum(…) and vacuum(…) functions. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/11/17, 6:32 PM, "Tom Lane" wrote: > You probably won't get much in the near term, because we're in > stabilize-the-release mode not develop-new-features mode. > Please add your patch to the pending commitfest > https://commitfest.postgresql.org/14/ > so that we remember to look at it when the time comes. Understood. I’ve added it to the commitfest. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/11/17, 7:20 PM, "Michael Paquier" wrote: > Browsing the code Thanks for taking a look. > It seems to me that you don't need those extra square brackets around > the column list. Same remark for vacuum.sgml. I’ll remove them. My intent was to ensure that we didn’t accidentally suggest that statements like “VACUUM , foo, bar;” were accepted, but the extra brackets don’t seem to fix that, and I don’t foresee much confusion anyway. > In short for all that, it is enough to have "[, ... ]" to document > that a list is accepted. That makes sense, I’ll fix it. > It seems to me that it would have been less invasive to loop through > vacuum() for each relation. Do you foresee advantages in allowing > vacuum() to handle multiple? I am not sure if is is worth complicating > the current logic more considering that you have as well so extra > logic to carry on option values. That was the approach I first prototyped. The main disadvantage that I found was that the command wouldn’t fail-fast if one of the tables or columns didn’t exist, and I thought that it might be frustrating to encounter such an error in the middle of vacuuming several large tables. It’s easy enough to change the logic to emit a warning and simply move on to the next table, but that seemed like it could be easily missed among the rest of the vacuum log statements (especially with the verbose option specified). What are your thoughts on this? In the spirit of simplifying things a bit, I do think it is possible to eliminate one of the new node types, since the fields for each are almost identical. > + * used for error messages. In the case that relid is valid, rels > + * must have exactly one element corresponding to the specified relid. > s/rels/relations/ as variable name? Agreed, that seems nicer. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/16/17, 11:21 PM, "Michael Paquier" wrote: > On Wed, May 17, 2017 at 7:56 AM, Bossart, Nathan wrote: >> I think this issue already exists, as this comment in get_rel_oids(…) seems >> to indicate: >> >> /* >> * Since we don't take a lock here, the relation might be gone, or the >> * RangeVar might no longer refer to the OID we look up here. In the >> * former case, VACUUM will do nothing; in the latter case, it will >> * process the OID we looked up here, rather than the new one. Neither >> * is ideal, but there's little practical alternative, since we're >> * going to commit this transaction and begin a new one between now >> * and then. >> */ >> relid = RangeVarGetRelid(vacrel, NoLock, false); >> >> With the patch applied, I believe this statement still holds true. So if >> the relation disappears before we get to vacuum_rel(…), we will simply skip >> it and move on to the next one. The vacuum_rel(…) code provides a WARNING >> in many cases (e.g. the user does not have privileges to VACUUM the table), >> but we seem to silently skip the table when it disappears before the call to >> vacuum_rel(…). > > Yes, that's the bits using try_relation_open() which returns NULL if > the relation is gone. I don't think that we want VACUUM to be noisy > about that when running it on a database. Agreed. >> If we added a WARNING to the effect of “skipping vacuum of — >> relation no longer exists” for this case, I think what you are suggesting >> would be satisfied. > > We would do no favor by reporting nothing to the user. Without any > information, the user triggering a manual vacuum may believe that the > relation has been vacuumed as it was listed but that would not be the > case. Agreed. Unfortunately, I think this is already possible when you specify a table to be vacuumed. >> However, ANALYZE has a slight caveat. While analyze_rel(…) silently skips >> the relation if it no longer exists like vacuum_rel(…) does, we do not >> pre-validate the columns list at all. So, in an ANALYZE statement with >> multiple tables and columns specified, it’ll only fail once we get to the >> undefined column. To fix this, we could add a check for the column lists >> near get_rel_oids(…) and adjust do_analyze_rel(…) to emit a WARNING and skip >> any columns that vanish in the meantime. >> >> Does this seem like a sane approach? >> >> 1. Emit WARNING when skipping if relation disappears before we get to it. >> 2. Early in vacuum(…), check that the specified columns exist. > > And issue an ERROR, right? Correct. This means that both the relations and columns specified would cause an ERROR if they do not exist and a WARNING if they disappear before we can actually process them. > + RelationAndColumns *relinfo = (RelationAndColumns *) > lfirst(relation); > + int per_table_opts = options | relinfo->options; /* > VACOPT_ANALYZE may be set per-table */ > + ListCell *oid; > I have just noticed this bit in your patch. So you can basically do > something like that: > VACUUM (ANALYZE) foo, (FULL) bar; > Do we really want to go down to this level of control? I would keep > things a bit more restricted as users may be confused by the different > lock levels involved by each operation, and make use of the same > options for all the relations listed. Opinions from others is welcome. I agree with you here, too. I stopped short of allowing customers to explicitly provide per-table options, so the example you provided wouldn’t work here. This is more applicable for something like the following: VACUUM (FREEZE, VERBOSE) foo, bar (a); In this case, the FREEZE and VERBOSE options are used for both tables. However, we have a column list specified for ‘bar’, and the ANALYZE option is implied when we specify a column list. So when we process ‘bar’, we need to apply the ANALYZE option, but we do not need it for ‘foo’. For now, that is all that this per-table options variable is used for. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/18/17, 6:12 PM, "Michael Paquier" wrote: > Fine for me as well. I would suggest to split the patch into two parts > to ease review then: > - Rework this error handling for one relation. > - The main patch. I’d be happy to do so, but I think part one would be pretty small, and almost all of the same code needs to be changed in the main patch anyway. I do not foresee a huge impact on review-ability either way. If others disagree, I can split it up. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 5/18/17, 8:03 PM, "Tom Lane" wrote: >”Bossart, Nathan" writes: >> On 5/18/17, 6:12 PM, "Michael Paquier" wrote: >>> Fine for me as well. I would suggest to split the patch into two parts >>> to ease review then: >>> - Rework this error handling for one relation. >>> - The main patch. >> >> I’d be happy to do so, but I think part one would be pretty small, and >> almost all of the same code needs to be changed in the main patch anyway. I >> do not foresee a huge impact on review-ability either way. If others >> disagree, I can split it up. > >Yeah, I'm dubious that that's really necessary. If the change proves >bigger than you're anticipating, maybe it's worth a two-step approach, >but I share your feeling that it probably isn’t. Just in case it was missed among the discussion, I’d like to point out that v5 of the patch includes the “ERROR if ANALYZE not specified” change. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 8/23/17, 11:59 PM, "Michael Paquier" wrote: > Robert, Amit and other folks working on extending the existing > partitioning facility would be in better position to answer that, but > I would think that we should have something as flexible as possible, > and storing a list of relation OID in each VacuumRelation makes it > harder to track the uniqueness of relations vacuumed. I agree that the > concept of a partition with multiple parents induces a lot of > problems. But the patch as proposed worries me as it complicates > vacuum() with a double loop: one for each relation vacuumed, and one > inside it with the list of OIDs. Modules calling vacuum() could also > use flexibility, being able to analyze a custom list of columns for > each relation has value as well. I understand your concern, and I'll look into this for v9. I think the majority of this change will go into get_rel_oids(...). Like you, I am also curious to what the partitioning folks think. > + * relations is a list of VacuumRelations to process. If it is NIL, all > + * relations in the database are processed. > Typo here, VacuumRelation is singular. I'll make this change in v9. I should also note that the dedupe_relations(...) function needs another small fix for column lists. Since the lack of a column list means that we should ANALYZE all columns, a duplicate table name with an empty column list should effectively null out any other specified columns. For example, "ANALYZE table (a, b), table;" currently dedupes to the equivalent of "ANALYZE table (a, b);" when it should dedupe to "ANALYZE table;". Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 8/30/17, 5:37 PM, "Michael Paquier" wrote: > Yeah... Each approach has its cost and its advantages. It may be > better to wait for more opinions, no many people have complained yet > that for example a list of columns using twice the same one fails. Sounds good to me. > +VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ class="PARAMETER">table_name ] [, ...] > I just noticed that... But regarding the docs, I think that you have > misplaced the position of "[, ...]", which should be inside the > table_name portion in the case of what I quote here, no? I think that's what I had initially, but it was changed somewhere along the line. It is a little more complicated for the versions that accept column lists. VACUUM ... ANALYZE [ [ table_name [ (column_name [, ...] ) ] ] [, ...] ] ISTM that we need the extra brackets here to clarify that the table and column list combination is what can be provided in a list. Does that make sense? Or do you think we can omit the outermost brackets here? > +VacuumRelation * > +makeVacuumRelation(RangeVar *relation, List *va_cols, Oid oid) > +{ > + VacuumRelation *vacrel = makeNode(VacuumRelation); > + vacrel->relation = relation; > + vacrel->va_cols = va_cols; > + vacrel->oid = oid; > + return vacrel; > +} > Perhaps in makefuncs.c instead of vacuum.c? That's usually the place > used for node constructions like that. Ah, yes. That is a much better place. I'll make this change. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/3/17, 11:46 PM, "Michael Paquier" wrote: > I did not consider first that the list portion also needed to be > modified, perhaps because I am not coding that myself... So now that > it is becoming more complicated what about just using AutovacMemCxt? > This would simplify the list of VacuumRelation entries and the > RangeVar creation as well, and honestly this is ugly and there are no > other similar patterns in the backend code: +1 > This would become way more readable by using makeRangeVar() and the > new makeVacuumRelation. As this is partly my fault that we are at this > state, I am fine as well to remove this burden from you, Nathan, and > fix that myself in a new version. And I don't want to step on your > toes either :) No worries, I can take care of it. I appreciate your patience with all of these reviews. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/4/17, 10:32 PM, "Simon Riggs" wrote: > ISTM there is no difference between > VACUUM a, b > and > VACUUM a; VACUUM b; > > If we want to keep the code simple we must surely consider whether the > patch has any utility. Yes, this is true, but I think the convenience factor is a bit understated with that example. For example, if you need to manually cleanup several tables for XID purposes, VACUUM FREEZE VERBOSE table1; VACUUM FREEZE VERBOSE table2; VACUUM FREEZE VERBOSE table3; VACUUM FREEZE VERBOSE table4; VACUUM FREEZE VERBOSE table5; becomes VACUUM FREEZE VERBOSE table1, table2, table3, table4, table5; I would consider even this to be a relatively modest example compared to the sorts of things users might do. In addition, I'd argue that this feels like a natural extension of the VACUUM command, one that I, like others much earlier in this thread, was surprised to learn wasn't supported. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/4/17, 8:16 PM, "Michael Paquier" wrote: > So vacuum_multiple_tables_v14.patch is good for a committer in my > opinion. With this patch, if the same relation is specified multiple > times, then it gets vacuum'ed that many times. Using the same column > multi-times results in an error as on HEAD, but that's not a new > problem with this patch. Thanks! > So I would tend to think that the same column specified multiple times > should cause an error, and that we could let VACUUM run work N times > on a relation if it is specified this much. This feels more natural, > at least to me, and it keeps the code simple. I think that is a reasonable approach. Another option I was thinking about was to de-duplicate only the individual column lists. This alternative approach might be a bit more user-friendly, but I am beginning to agree with you that perhaps we should not try to infer the intent of the user in these "duplicate" scenarios. I'll work on converting the existing de-duplication patch into something more like what you suggested. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/9/17, 7:28 AM, "Michael Paquier" wrote: > In the duplicate patch, it seems to me that you can save one lookup at > the list of VacuumRelation items by checking for column duplicates > after checking that all the columns are defined. If you put the > duplicate check before closing the relation you can also use the > schema name associated with the Relation. I did this so that the ERROR prioritization would be enforced across all relations. For example: VACUUM ANALYZE table1 (a, a), table2 (does_not_exist); If I combine the 'for' loops to save a lookup, this example behaves differently. Instead of an ERROR for the nonexistent column, you would hit an ERROR for the duplicate column in table1's list. However, I do not mind changing this. > + if (i == InvalidAttrNumber) > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_COLUMN), > +errmsg("column \"%s\" of relation \"%s\" does not exist", > + col, RelationGetRelationName(rel; > This could use the schema name unconditionally as you hold a Relation > here, using RelationGetNamespace(). Sure, I think this is a good idea. I'll make this change in the next version of the patch. > if (!onerel) > + { > + /* > +* If one of the relations specified by the user has disappeared > +* since we last looked it up, let them know so that they do not > +* think it was actually analyzed. > +*/ > + if (!IsAutoVacuumWorkerProcess() && relation) > + ereport(WARNING, > + (errmsg("skipping \"%s\" --- relation no longer exists", > + relation->relname))); > + > return; > + } > Hm. So if the relation with the defined OID is not found, then we'd > use the RangeVar, but this one may not be set here: > + /* > +* It is safe to leave everything except the OID empty here. > +* Since no tables were specified in the VacuumStmt, we know > +* we don't have any columns to keep track of. Also, we do > +* not need the RangeVar, because it is only used for error > +* messaging when specific relations are chosen. > +*/ > + rel_oid = HeapTupleGetOid(tuple); > + relinfo = makeVacuumRelation(NULL, NIL, rel_oid); > + vacrels_tmp = lappend(vacrels_tmp, relinfo); > So if the relation is analyzed but skipped, we would have no idea that > it actually got skipped because there are no reports about it. That's > not really user-friendly. I am wondering if we should not instead have > analyze_rel also enforce the presence of a RangeVar, and adding an > assert at the beginning of the function to undertline that, and also > do the same for vacuum(). It would make things also consistent with > vacuum() which now implies on HEAD that a RangeVar *must* be > specified. I agree that it is nice to see when relations are skipped, but I do not know if the WARNING messages would provide much value for this particular use case (i.e. 'VACUUM;'). If a user does not provide a list of tables to VACUUM, they might not care too much about WARNINGs for dropped tables. > Are there opinions about back-patching the patch checking for > duplicate columns? Stable branches now complain about an unhelpful > error message. I wouldn't mind drafting something up for the stable branches. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE
A few general comments. While this patch applies, I am still seeing some whitespace errors: comment_on_current_database_no_pgdump_v4.1.patch:488: trailing whitespace. ColId comment_on_current_database_no_pgdump_v4.1.patch:490: trailing whitespace. | CURRENT_DATABASE comment_on_current_database_no_pgdump_v4.1.patch:491: trailing whitespace. { comment_on_current_database_no_pgdump_v4.1.patch:501: trailing whitespace. ColId comment_on_current_database_no_pgdump_v4.1.patch:502: trailing whitespace. { warning: squelched 9 whitespace errors warning: 14 lines add whitespace errors. It looks like the 'dbname' test is currently failing when you run 'make check-world'. The Travis CI build log [1] shows the details of the failure. From a brief glance, I would guess that some of the queries are returning unexpected databases that are created in other tests. Also, I think this change will require updates to the documentation. + if (dbspecname->dbnametype == DBSPEC_CURRENT_DATABASE ) + dbname = get_database_name(MyDatabaseId); + else + dbname = dbspecname->dbname; This pattern is repeated in the patch several times. It looks like the end goal of these code blocks is either to get the database name or the database OID, so perhaps we should have get_dbspec_name() and get_dbspec_oid() helper functions (like get_rolespec_oid() for RoleSpec nodes). +static bool +_equalDatabaseSpec(const DBSpecName *a, const DBSpecName *b) +{ + COMPARE_SCALAR_FIELD(dbnametype); + COMPARE_STRING_FIELD(dbname); + COMPARE_LOCATION_FIELD(location); + + return true; +} There are some inconsistencies in the naming strategy. For example, this function is called _equalDatabaseSpec(), but the struct is DBSpecName. I would suggest calling it DatabaseSpec or DbSpec throughout the patch. Nathan [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/275747367 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/20/17, 2:29 PM, "Tom Lane" wrote: > I started to look at this... Thanks for taking a look. > I think that the way this ought to work is you process the VacuumRelation > structs purely sequentially, each in its own transaction, so you don't > need leaps of faith about what to do if a relation changes between the > first time you look at it and when you actually come to process it. How might this work for VACUUM statements with no tables specified? It seems like we must either handle the case when the relation changes before it is processed or hold a lock for the duration of the vacuuming. > The idea of "fast failing" because some later VacuumRelation struct might > contain an error seems like useless complication, both in terms of the > implementation and the user-visible behavior. I agree that the patch might be simpler without this, but the user-visible behavior is the reason I had included it. In short, my goal was to avoid errors halfway through a long-running VACUUM statement because the user misspelled a relation/column name or the relation/column was dropped. It's true that the tests become mostly pointless if the relations or columns change before they are processed, but this seems like it would be a rarer issue in typical use cases. I'll continue to look into switching to a more sequential approach and eliminating the leaps of faith. > As for the other patch, I wonder if it might not be better to > silently ignore duplicate column names. But in either case, we don't > need a pre-check, IMO. I'd just leave it to the lookup loop in > do_analyze_rel to notice if it's looked up the same column number > twice. That would be more likely to lead to a back-patchable fix, > too. 0002 as it stands is useless as a back-patch basis, since it > proposes modifying code that doesn't even exist in the back branches. I'd be happy to write something up that is more feasibly back-patched. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rethinking autovacuum.c memory handling
On 9/23/17, 5:27 AM, "Michael Paquier" wrote: >On Sat, Sep 23, 2017 at 6:09 AM, Tom Lane wrote: >> I notice that autovacuum.c calls autovacuum_do_vac_analyze, and >> thereby vacuum(), in TopTransactionContext. This doesn't seem >> like a terribly great idea, because it doesn't correspond to what >> happens during a manually-invoked vacuum. > > Indeed, the inconsistency is not good here. > >> What I think we should do instead is invoke autovacuum_do_vac_analyze >> in the PortalContext that do_autovacuum has created, which we already >> have a mechanism to reset once per table processed in do_autovacuum. >> >> The attached patch does that, and also modifies perform_work_item() >> to use the same approach. Right now perform_work_item() has a >> copied-and-pasted MemoryContextResetAndDeleteChildren(PortalContext) >> call in its error recovery path, but that seems a bit out of place >> given that perform_work_item() isn't using PortalContext otherwise. > > I have spent some time looking at your patch and testing it. This > looks sane. A small comment that I have would be to add an assertion > at the top of perform_work_item to be sure that it is called in the > memory context of AutovacMemCxt. This looks reasonable to me as well. I haven't noticed any issues after a couple hours of pgbench with aggressive autovacuum settings, either. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rethinking autovacuum.c memory handling
On 9/23/17, 12:36 PM, "Tom Lane" wrote: >"Bossart, Nathan" writes: >> This looks reasonable to me as well. I haven't noticed any issues after >> a couple hours of pgbench with aggressive autovacuum settings, either. > > Thanks for looking. As I'm sure you realize, what motivated that was > not liking the switch into AutovacMemCxt that you'd added in > autovacuum_do_vac_analyze ... with this patch, we can drop that. Yup. I’ll go ahead and update that patch now. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix number skipping in to_number
On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote: > Ok I've made that change in the attached v3. I'm not sure as I'm on > en_US.UTF-8 locale too. Maybe something Windows specific? This patch applies against master (8485a25a), compiles, and passes a make check. I tested both on my mac laptop, and my linux server. If we want this patch, I'd say it's ready for committer. We may want (and I can't believe I'm saying this) more discussion as to exactly what the strategy for to_number() (and friends) is. Do we want to duplicate Oracle's functionality, or do we want a similar function to do similar things, without necessarily having a goal of identical behavior to oracle? For myself, I pretty much never use the to_date, to_number, or to_timestamp functions except when porting oracle code. I do use the to_char functions on occasion. If strftime were available, I probably wouldn't use them. I would commit this patch and update the TODO with a goal of making to_number as Oracle compatible as is reasonable. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 9/24/17, 10:12 PM, "Michael Paquier" wrote: > Attached is a proposal of patch. The patch seems reasonable to me, and I haven't encountered any issues in my tests, even after applying the vacuum-multiple-relations patch on top of it. +* Take a lock here for the relation lookup. If ANALYZE or VACUUM spawn +* multiple transactions, the lock taken here will be gone once the +* current transaction running commits, which could cause the relation +* to be gone, or the RangeVar might not refer to the OID looked up here. I think this could be slightly misleading. Perhaps it would be more accurate to say that the lock will be gone any time vacuum() creates a new transaction (either in vacuum_rel() or when use_own_xacts is true). Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Enhancements to passwordcheck
Hi hackers, Currently, the passwordcheck module provides a few basic checks to strengthen passwords. However, any configuration must be ready at compile time, and many common password requirements cannot be enforced without creating a custom version of this module. I think there are a number of useful parameters that could be added to enable common password restrictions, including the following list, which is based on some asks from our customers: passwordcheck.min_password_length passwordcheck.min_uppercase_letters passwordcheck.min_lowercase_letters passwordcheck.min_numbers passwordcheck.min_special_chars passwordcheck.superuser_can_bypass passwordcheck.max_expiry_period passwordcheck.force_new_password I'd like to use this thread to gauge community interest in adding this functionality to this module. If there is interest, I'll add it to the next commitfest. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 9/25/17, 6:51 PM, "Michael Paquier" wrote: >> +* Take a lock here for the relation lookup. If ANALYZE or >> VACUUM spawn >> +* multiple transactions, the lock taken here will be gone >> once the >> +* current transaction running commits, which could cause >> the relation >> +* to be gone, or the RangeVar might not refer to the OID >> looked up here. >> >> I think this could be slightly misleading. Perhaps it would be more >> accurate to say that the lock will be gone any time vacuum() creates a new >> transaction (either in vacuum_rel() or when use_own_xacts is true). > > The comment of the proposed patch matches as much as possible what is > currently on HEAD, so I would still go with something close to that. Sure. This is just a minor point, and I could see the argument that your phrasing is more concise, anyway. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enhancements to passwordcheck
On 9/25/17, 8:31 PM, "Michael Paquier" wrote: > Yes, I have developped a couple of years back a fork of passwordcheck > which had much similar enhancements, so getting something more modular > in upstream would be really welcome. Awesome. > On top of that I think that you should have a parameter that specifies > a string full of special characters. For example I have been using > things like !@#$%^&*()_+{}|<>?=. But you likely want to make that > modular, people have their own set of character characters, and it is > just something that could be compared with strchr(), still the default > should be meaningful. +1 >> passwordcheck.superuser_can_bypass > > Not sure that this one has much meaning. Superusers could easily > unload the module. True. I can leave it out for now. >> passwordcheck.force_new_password > > So this is to make sure that the new password is not the same as the > old one? This would be better with the last N passwords, still this > would require more facilities. I would discard this one as what you > are proposing here is not modular enough IMO, and needs a separate > feature set. Fair enough. I'll definitely start with a set of very non-controversial parameters, as this will likely require rewriting most of the module. >> I'd like to use this thread to gauge community interest in adding this >> functionality to this module. If there is interest, I'll add it to the next >> commitfest. > > I think that's a good idea. Great to see that you are contributing > back some stuff. Thanks for the detailed feedback. On 9/25/17, 8:37 PM, "Euler Taveira" wrote: >> passwordcheck.max_expiry_period >> > How would you enforce that? If the password expires, you can log in to > change the password. If you let him/her to get in and change the > password, you can't obligate the user to do that. You could send a > message to remember that the password will expire but you can't > enforce that (unless you make a change in the protocol). My idea was to tie into the existing password expiration functionality (VALID UNTIL in CREATE/ALTER ROLE statements). I don't think this would involve any changes to how password expiration works. Instead, it would just be a simple check to make sure the specified expiration date is not too far in the future. >> passwordcheck.force_new_password >> > Does it mean a password different from the old one? +1. It could be > different from the last 3 passwords but we don't store a password > history. Yes. As Michael pointed out, this might be better to do as a separate effort since we'll almost certainly need to introduce a way to store password history. One interesting design challenge will be how to handle pre-hashed passwords, since the number of checks we can do on those is pretty limited. I'm currently thinking of a parameter that can be used to block, allow, or force pre-hashed passwords. If we take that route, perhaps we will also need to ensure that PostgreSQL fails to start when invalid combinations are specified (e.g. pre-hashed passwords are forced and min_password_length is nonzero). Thoughts? Also, I imagine we'll want to keep the cracklib and "password cannot contain role name" features around, although perhaps these could become configurable like the rest of the options. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix number skipping in to_number
On Mon, Sep 25, 2017 at 07:52:19PM +0100, Oliver Ford wrote: > Thanks for your review. The issue is that Oracle throws errors on many > more input cases than Postgres does, so making it exactly like Oracle > could break a lot of existing users. E.g. to_number ('123,000', '999') > returns '123' on Postgres, but throws an error on Oracle. So making it > exactly Oracle-like could break existing users who might rely on the > current behavior. I wouldn't use to_number for anything other than oracle compatibility, and then just so I didn't have to wade through all the ported oracle code. I would use a regex or some such to clean up the number, and then cast the result. For an input string of '123,000' I might do a translate('123,000', ',', '')::integer or perhaps use regexp_replace(). Either way I would want a more positive decision as to what was valid or not, based on the input. > My view is that we shouldn't deliberately introduce errors in order to be > exactly like Oracle if we don't currently and there's a sane use case for > current behavior. Do you have any examples of results that are different > between Oracle and Postgres, and you think the Oracle result makes more > sense? Not really, other than I think an error report might make more sense. '123,000' doesn't really match the format of '999'. If anything it seems like we're guessing rather than validating input. It is surprising (to me at least) that to_char(to_number('123,000', '999'), '999') doesn't give us the original input (in the sense that identical formats don't preserve the original string). So I'm not sure the current behavior is a sane use case, but perhaps more people are using to_number() to get *some* numeric result, rather than for wanting it to be like oracle. I would generally prefer to throw an exception instead of getting a number I wasn't expecting, but I can see cases where that might not be the case. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enhancements to passwordcheck
On 9/26/17, 2:38 AM, "Albe Laurenz" wrote: > Nathan Bossart wrote: >>>> passwordcheck.force_new_password >>>> >>> Does it mean a password different from the old one? +1. It could be >>> different from the last 3 passwords but we don't store a password >>> history. >> >> Yes. As Michael pointed out, this might be better to do as a separate >> effort since we'll almost certainly need to introduce a way to store >> password history. > > That increases the number of passwords stored on the server and > consequently the damage when that list is stolen. > Of course the old passwords are invalid, but if someone cracks them > they could still try them on other systems the person uses. > > I think we should accept such a risk only if the benefits are clear, and > my opinion has always been that if you forbid password reuse, people > tend to come up with password generation schemes that are no better > than the original passwords. Right. However, without this, they may not change the password at all, which would make the expiry functionality less effective. I suppose there's not a great way to guard against these sorts of password generation schemes without being able to judge the proposed password against the previous password, too. Perhaps the max_expiry_period parameter should be left out for now as well. >> One interesting design challenge will be how to handle pre-hashed >> passwords, since the number of checks we can do on those is pretty >> limited. I'm currently thinking of a parameter that can be used to >> block, allow, or force pre-hashed passwords. If we take that route, >> perhaps we will also need to ensure that PostgreSQL fails to start when >> invalid combinations are specified (e.g. pre-hashed passwords are forced >> and min_password_length is nonzero). Thoughts? > > As was pointed out in the original discussion > d960cb61b694cf459dcfb4b0128514c203937...@exadv11.host.magwien.gv.at > the weak point of "passwordcheck" is that it does not work very well > for encrypted passwords. > The only saving grace is that you can at least check against > username equals password. Thanks for linking the original thread. There are a lot of interesting points. I wonder if enhanced password checking in core or contrib might be received differently with the introduction of SCRAM authentication, since the weaknesses of MD5 were often cited. > Disabling pre-hashed passwords in order to allow better password > checks is a problem rather than a solution, because it exposes you > to password theft of the clear-text password. I think we shouldn't > go there. > > The overall opinion in the above thread was that if you *really* care > about security, you don't use database passwords, but external > authentication with a centralized identity management system. > > So I think it is fine to extend "passwordcheck", but we shouldn't > take it serious enough to reduce security elsewhere in order to > improve the module. I understand the points made here, but not allowing configurability here really hinders the module's ability to enforce much of anything. However, I did say I wanted to avoid controversial parameters, so I'll plan to count this one out as well. This leaves us with the following proposed parameters for now: passwordcheck.min_password_length passwordcheck.min_uppercase_letters passwordcheck.min_lowercase_letters passwordcheck.min_numbers passwordcheck.min_special_chars passwordcheck.special_chars passwordcheck.allow_username_in_password passwordcheck.use_cracklib Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/25/17, 12:42 AM, "Michael Paquier" wrote: > + if (!IsAutoVacuumWorkerProcess()) > + ereport(WARNING, > + (errmsg("skipping \"%s\" --- relation no longer exists", > + relation->relname))); > I like the use of WARNING here, but we could use as well a LOG to be > consistent when a lock obtention is skipped. It looks like the LOG statement is only emitted for autovacuum, so maybe we should keep this at WARNING for consistency with the permission checks below it. > +* going to commit this transaction and begin a new one between > now > +* and then. > +*/ > + relid = RangeVarGetRelid(relinfo->relation, NoLock, false); > We will likely have to wait that the matters discussed in > https://www.postgresql.org/message-id/25023.1506107...@sss.pgh.pa.us > are settled. Makes sense. > +VACUUM FULL vactst, vactst, vactst, vactst; > This is actually a waste of cycles. I'll clean this up in v22. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enhancements to passwordcheck
On 9/27/17, 7:41 AM, "Peter Eisentraut" wrote: > On 9/25/17 23:10, Bossart, Nathan wrote: >> One interesting design challenge will be how to handle pre-hashed >> passwords, since the number of checks we can do on those is pretty >> limited. I'm currently thinking of a parameter that can be used to >> block, allow, or force pre-hashed passwords. > > Pre-hashed passwords are the normal case. You can't break that without > making this module a net loss in security. A strength in making this configurable is that you could use it to enforce that passwords are pre-hashed. But yes, blocking pre- hashed passwords could be undesirable given an untrusted network or server. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of RangeVar for partitioned tables in autovacuum
On 9/26/17, 9:28 PM, "Michael Paquier" wrote: > In conclusion, I think that the open item of $subject should be > removed from the list, and we should try to get the multi-VACUUM patch > in to cover any future problems. I'll do so if there are no > objections. If someone did want to add logging for vacuum_rel() and analyze_rel() in v10 after your patch was applied, wouldn't the NULL RangeVars force us to skip the new log statements for partitions? I think we might instead want to back-patch the VacuumRelation infrastructure so that we can appropriately log for partitions. However, I'm dubious that it is necessary to make such a big change so close to release for hypothetical log statements. So, in the end, I agree with you. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands
On 9/28/17, 8:46 PM, "Michael Paquier" wrote: > On Fri, Sep 29, 2017 at 10:44 AM, Tom Lane wrote: >> Michael Paquier writes: >>> On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan >>> wrote: >>>> Alright, I've added logging for autovacuum in v23. I ended up needing to >>>> do a little restructuring to handle the case when the relation was skipped >>>> because the lock could not be obtained. While doing so, I became >>>> convinced that LOG was probably the right level for autovacuum logs. >> >>> OK, of course let's not change the existing log levels. This could be >>> always tuned later on depending on feedback from others. I can see >>> that guc.c also uses elevel == 0 for some logic, so we could rely on >>> that as you do. >> >> FWIW, I don't think this patch should be mucking with logging behavior >> at all; that's not within its headline charter, and I doubt many people >> are paying attention. I propose to commit it without that. If you feel >> hot about changing the logging behavior, you can resubmit that as a new >> patch in a new thread where it will get some visibility and debate on >> its own merits. > > Okay. I am fine with that as well. Sure, that seems reasonable to me. Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 9/29/17, 11:18 AM, "Robert Haas" wrote: > I don't think I understand problem #2. I think the concern is about > reporting the proper relation name when VACUUM cascades from a > partitioned table to its children and then some kind of concurrent DDL > happens, but I don't see a clear explanation on the thread as to what > exactly the failure scenario is, and I didn't see a problem in some > simple tests I just ran. Furthermore, it sounds like this might get > fixed as part of committing the patch to allow VACUUM to mention > multiple tables, which Tom has indicated he will handle. Yes. It looks like v10 is safe, and the vacuum-multiple-relations patch should help prevent any future logging issues caused by this. Discussion here: http://postgr.es/m/CAB7nPqRX1465FP%2Bameysxxt63tCQDDW6KvaTPMfkSxaT1TFGfw%40mail.gmail.com Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch for geqo tweaks
I have two patches to make the geqo initialization and mutation slightly better. The first adjusts the mutation swaps to avoid having to re-pick ties. The second changes the initialization and shuffling algorithm for the gene array to use an in-place Fisher-Yates shuffling algorithm. Diffs against commit 49124613f134b04594b1a5c46368eb0a5db16d4b (i.e. tip of master as of when I re-made the diff). On my system the patches pass a ./configure; make; make check -- nw >From c7f3c7cc37f943481b2358149210789be3d39ee9 Mon Sep 17 00:00:00 2001 From: Nathan Wagner Date: Sun, 21 Sep 2014 09:30:01 + Subject: [PATCH 1/2] cleanup geqo_mutation.c Avoid a possible random number collision by choosing random ranges better. This will potentially change the output of the algorithm since it avoids extra calls to geqo_randint(). --- src/backend/optimizer/geqo/geqo_mutation.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/src/backend/optimizer/geqo/geqo_mutation.c b/src/backend/optimizer/geqo/geqo_mutation.c index 1a06d49..c78bd2c 100644 --- a/src/backend/optimizer/geqo/geqo_mutation.c +++ b/src/backend/optimizer/geqo/geqo_mutation.c @@ -43,20 +43,18 @@ geqo_mutation(PlannerInfo *root, Gene *tour, int num_gene) int num_swaps = geqo_randint(root, num_gene / 3, 0); Genetemp; - while (num_swaps > 0) { swap1 = geqo_randint(root, num_gene - 1, 0); - swap2 = geqo_randint(root, num_gene - 1, 0); + swap2 = geqo_randint(root, num_gene - 2, 0); - while (swap1 == swap2) - swap2 = geqo_randint(root, num_gene - 1, 0); + if (swap2 >= swap1) + swap2++; temp = tour[swap1]; tour[swap1] = tour[swap2]; tour[swap2] = temp; - num_swaps -= 1; } } -- 2.5.0 >From 47979bb844eda3821f08c1f00afc081ee0a3a260 Mon Sep 17 00:00:00 2001 From: Nathan Wagner Date: Sun, 21 Sep 2014 13:02:48 + Subject: [PATCH 2/2] rewrote array shuffler for geqo_recombination This simplifies the array shuffling by doing it inplace and eliminating a memory allocation and array initialization. The output is identical in the sense that it still randomizes an array, but the actual shuffled array will be different. --- src/backend/optimizer/geqo/geqo_recombination.c | 32 +++-- 1 file changed, 9 insertions(+), 23 deletions(-) diff --git a/src/backend/optimizer/geqo/geqo_recombination.c b/src/backend/optimizer/geqo/geqo_recombination.c index 652fadc..b327776 100644 --- a/src/backend/optimizer/geqo/geqo_recombination.c +++ b/src/backend/optimizer/geqo/geqo_recombination.c @@ -37,31 +37,17 @@ void init_tour(PlannerInfo *root, Gene *tour, int num_gene) { - Gene *tmp; - int remainder; - int next, - i; + int i, j; - /* Fill a temp array with the IDs of all not-yet-visited cities */ - tmp = (Gene *) palloc(num_gene * sizeof(Gene)); - - for (i = 0; i < num_gene; i++) - tmp[i] = (Gene) (i + 1); - - remainder = num_gene - 1; - - for (i = 0; i < num_gene; i++) - { - /* choose value between 0 and remainder inclusive */ - next = geqo_randint(root, remainder, 0); - /* output that element of the tmp array */ - tour[i] = tmp[next]; - /* and delete it */ - tmp[next] = tmp[remainder]; - remainder--; + /* shuffle tour in-place, uses Fisher-Yates inside-out algorithm */ + for (i=0; i < num_gene; i++) { + j = geqo_randint(root, i, 0); + /* we could skip the compare and just assign */ + if (i != j) { + tour[i] = tour[j]; + } + tour[j] = (Gene) i+1; } - - pfree(tmp); } /* alloc_city_table -- 2.5.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No Issue Tracker - Say it Ain't So!]
I don't have the original message for this thread, so I arbitrarily picked a message to reply to. Since what has been asked for is a bug *tracker*, and we already have a bugs mailing list, I put together something. I downloaded the archives for pgsql-bugs, and fed them into a database. This part was easy, since I have already written a pg backed usenet server and had the code hand for storing and parsing out bits of rfc 2822 messages. It's dirt simple. If the system sees a message with 'Bug #(\d+)' in the subject line, it creates an entry in a bugs table with that bug number (if needed), and then marks the message as belonging to that bug. If there seems to be metadata about the bug in the format of the (unquoted) Bug reference: Logged by: Email address: PostgreSQL version: Operating system: Description: Details: it pulls that out and puts it in the bugs table. There's also an "open" boolean in the table, defaulting to true. The results can be found at https://granicus.if.org/pgbugs/ Ok. So now we have a bug tracker, but... Some open questions that I don't think have really been addressed, with my commentary interspersed: 1: Can a bug be more than "open" or "closed"? I think yes. At least we probably want to know why a bug is closed. Is it not a bug at all, not our bug, a duplicate submission, a duplicate of another bug, something we won't fix for some reason (e.g. a bug against version 7) 2: Who can declare a bug closed. Ugh. I'm going to close some of them if it seems obvious to me that they should be closed. But what if it's not obvious? I could probably maintain it to some extent, but I don't know how much time that would actually take. Related to the next point, it probably makes sense to just close up front bugs that are marked against unsupported pg versions, or haven't had any activity for too long, perhaps two years. Just closing bugs with no mailing list activity for two years closes 5280 of 6376 bugs. 3: How far back should I actually import data from the bugs list? I have imported each archived month from December of 1998. It looks like the bug sequence was started at 1000 in December of 2003. Emails with no bug id in the subject line don't get associated with any bug, they're in the DB bug not really findable. 4: What should I do with emails that don't reference a bug id but seem to be talking about a bug? I suggest we do nothing with them as far as the bug tracker is concerned. If people want to mark their message as pertaining to a bug, they can put that in the subject line. However, I don't think a bug id can be assigned via email, that is, I think you have to use a web form to create a bug report with a bug id. Presumably that could change if whoever runs the bug counter wants it to. 5: How can we use email to update the status of a bug? I suggest using email headers to do this. 'X-PGBug-Fixed: ' and the like. I assume here that everyone who might want to do such a thing uses an MUA that would allow this, and they know how. 6: Does there need to be any security on updating the status? Probably not. I don't think it's the sort of thing that would attract malicious adjustments. If I'm wrong, I'd need to rethink this. I realize I'm making security an afterthought, which makes my teeth itch, but I think layers of security would make it much less likely to be actually adopted. Just to be clear, this is both a work in progress and a proof of concept. It's slow, it's ugly. I haven't created any indexes, written any css or javascript, or implemented any caching. I may work on that before you read this though. Comments are welcome, and no, I don't really expect that this will be what gets adopted, mainly I wanted to show that we can probably just build something rather effective off our existing infrastructure, and I had Saturday free (as of this writing, I've got maybe 5 hours into it total, albeit with lots of code re-use from other projects). There are some obvious todo items, filtering and searching being the most salient. Some data import issues: March 10, 2003, bad Date header, looked like junk anyway, so I deleted it. Time zone offsets of --0400 and -0500 (at least), I took these as being -0400 (or whathaveyou). Date header of Sat, 31 May 2008 12:12:18 +1930, judging by the name on the email, this is probably posted from Venezuela, which switched to time one -0430 in 2007, which could also be thought of as +1930 if you ignore the implied date change. And, by way of some statistics, since I've got the archives in a database: Emails: 43870 Bugs: 6376 Distinct 'From' headers: 10643 The bugs have 3.5 messages each on average, with 2 being the most common number, and 113 at the most, for bug 12990. 1284 bugs have only one message associated with them. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No Issue Tracker - Say it Ain't So!]
On Sun, Oct 04, 2015 at 04:30:49PM -0700, Josh Berkus wrote: > That would be the key part, wouldn't it? Nice that you have [code to > store and parse email messages]. Yeah. It actually made most of the work pretty easy. It's available with a bunch of other code at https://pd.if.org/git/ if anyone wants it. I did find a bug in my header processing though, so I'll need to commit that fix. > We'd also want a way to link a bug fix to a commit, and probably a way > to give the bug a list of searchable keywords (and add to that list). I've been thinking of hooking it up to the fti machinery and providing a search box. I've never really used fti before, so this might be a good opportunity to learn it for real. > > it probably makes sense to just close up front bugs that are marked > > against unsupported pg versions, or haven't had any activity for too > > long, perhaps two years. > I'm reluctant to close all of those unexamined, since part of the > purpose of this is to find bugs which were never fixed. Probably we > should organize a posse to comb trhough all of the old bugs and > hand-close them. I'm doing some of that as I poke at the bugs pages. Perhaps it would make sense to have a closed status of 'stale' or the like (perhaps that's what you meant by 'timed out') which could be used to get bugs out of the main list but still be marked as 'not human reviewed'. These could then be reviewed by the posse. > Yeah, fixing this [email's without a bug id] would probably be tied to > the possible change to mailman. Unless someone already has a way to > get majordomo to append a bug ID. How are bug ids assigned now? From the evidence, I assume there is a sequence in a database that the web submission form queries to format a resulting email to the bugs mailing list. Do the mailing lists live on the same server? If so, perhaps it would be easy to assign a new bug id to a new thread on -bugs. But perhaps that's too aggressive in creating bugs. > > 5: How can we use email to update the status of a bug? > > > > I suggest using email headers to do this. 'X-PGBug-Fixed: > > ' and the like. I assume here that everyone who might > > want to do such a thing uses an MUA that would allow this, and they > > know how. > > I guess that depends on who we expect to use this, at least for > closing stuff. I could certainly support more than one mechanism. A web interface for those who would prefer such and emails would seem to be the basic requirements. > > 6: Does there need to be any security on updating the status? > > > > Probably not. I don't think it's the sort of thing that would > > attract malicious adjustments. If I'm wrong, I'd need to rethink > > this. I realize I'm making security an afterthought, which makes my > > teeth itch, but I think layers of security would make it much less > > likely to be actually adopted. > > I think there needs to be some kind of administrative access which > allows, for example, an issue to be closed so that it can't be > reopened. I guess technically we have that now since I'm the only one who can close or open a bug in the db I've set up :) Seriously though, I think it probably makes the most sense to tie the system in with the existing pg community accounts if it goes that far. > Anyway, I'm not convinced we want to reinvent this particular wheel, but > if we do, you've done a yeoman's job. Thank you. (Assuming I've interpreted the phrase correctly, I'm not familiar with it, and a web search was only semi-enlightening). -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] No Issue Tracker - Say it Ain't So!
I don't have the original message for this thread, so I arbitrarily picked a message to reply to. Since what has been asked for is a bug *tracker*, and we already have a bugs mailing list, I put together something. I downloaded the archives for pgsql-bugs, and fed them into a database. This part was easy, since I have already written a pg backed usenet server and had the code hand for storing and parsing out bits of rfc 2822 messages. It's dirt simple. If the system sees a message with 'Bug #(\d+)' in the subject line, it creates an entry in a bugs table with that bug number (if needed), and then marks the message as belonging to that bug. If there seems to be metadata about the bug in the format of the (unquoted) Bug reference: Logged by: Email address: PostgreSQL version: Operating system: Description: Details: it pulls that out and puts it in the bugs table. There's also an "open" boolean in the table, defaulting to true. The results can be found at https://granicus.if.org/pgbugs/ Ok. So now we have a bug tracker, but... Some open questions that I don't think have really been addressed, with my commentary interspersed: 1: Can a bug be more than "open" or "closed"? I think yes. At least we probably want to know why a bug is closed. Is it not a bug at all, not our bug, a duplicate submission, a duplicate of another bug, something we won't fix for some reason (e.g. a bug against version 7) 2: Who can declare a bug closed. Ugh. I'm going to close some of them if it seems obvious to me that they should be closed. But what if it's not obvious? I could probably maintain it to some extent, but I don't know how much time that would actually take. Related to the next point, it probably makes sense to just close up front bugs that are marked against unsupported pg versions, or haven't had any activity for too long, perhaps two years. Just closing bugs with no mailing list activity for two years closes 5280 of 6376 bugs. 3: How far back should I actually import data from the bugs list? I have imported each archived month from December of 1998. It looks like the bug sequence was started at 1000 in December of 2003. Emails with no bug id in the subject line don't get associated with any bug, they're in the DB bug not really findable. 4: What should I do with emails that don't reference a bug id but seem to be talking about a bug? I suggest we do nothing with them as far as the bug tracker is concerned. If people want to mark their message as pertaining to a bug, they can put that in the subject line. However, I don't think a bug id can be assigned via email, that is, I think you have to use a web form to create a bug report with a bug id. Presumably that could change if whoever runs the bug counter wants it to. 5: How can we use email to update the status of a bug? I suggest using email headers to do this. 'X-PGBug-Fixed: ' and the like. I assume here that everyone who might want to do such a thing uses an MUA that would allow this, and they know how. 6: Does there need to be any security on updating the status? Probably not. I don't think it's the sort of thing that would attract malicious adjustments. If I'm wrong, I'd need to rethink this. I realize I'm making security an afterthought, which makes my teeth itch, but I think layers of security would make it much less likely to be actually adopted. Just to be clear, this is both a work in progress and a proof of concept. It's slow, it's ugly. I haven't created any indexes, written any css or javascript, or implemented any caching. I may work on that before you read this though. Comments are welcome, and no, I don't really expect that this will be what gets adopted, mainly I wanted to show that we can probably just build something rather effective off our existing infrastructure, and I had Saturday free (as of this writing, I've got maybe 5 hours into it total, albeit with lots of code re-use from other projects). There are some obvious todo items, filtering and searching being the most salient. Some data import issues: March 10, 2003, bad Date header, looked like junk anyway, so I deleted it. Time zone offsets of --0400 and -0500 (at least), I took these as being -0400 (or whathaveyou). Date header of Sat, 31 May 2008 12:12:18 +1930, judging by the name on the email, this is probably posted from Venezuela, which switched to time one -0430 in 2007, which could also be thought of as +1930 if you ignore the implied date change. And, by way of some statistics, since I've got the archives in a database: Emails: 43870 Bugs: 6376 Distinct 'From' headers: 10643 The bugs have 3.5 messages each on average, with 2 being the most common number, and 113 at the most, for bug 12990. 1284 bugs have only one message associated with them. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] bugs and bug tracking
So, in order to do some clean up and see how my pgbugs page (https://granicus.if.org/pgbugs/) might actually work I've been going through bugs and marking their status. A lot of questions arise. A lot of the reports aren't bugs at all, but requests for help. My guess is that the users either don't know where to ask or don't understand the difference between a bug and not knowing how to do what they want to do. Perhaps a more thorough explaination on the submission form would be useful. What is the difference between a bug and a feature request? Ok, I know the difference, but do we want to treat them differently. For example, see bug 9457 (https://granicus.if.org/pgbugs/9457). As Pavel Stehule noted, this isn't a *bug* per se, but what should we do with it? Close it as 'not a bug'? I don't like this because it's not really the same as the other 'not a bug's. Create another 'closed' status of 'feature request'? Except that if we decide to implement the feature, in some sense it becomes a bug until we actually implement it. Create an 'open' status of 'feature request' to mark it until it is either implemented or rejected. At least then it's tracked. This last choice is my preference. I conflate open bugs in the sense of 'not closed so we still need to do something with the bug even if it is just closing it' and open bugs in the sense of 'this seems to actually be a bug in postgres'. I'm not sure what terminology I should use. I have lots of different types of 'not a bug'. Not a bug, the use should have posted to a different list. (e.g. 13602) Not a bug, probably user error, which is similar to the above. Not a bug, but a bug in the libraries we use (e.g. openssl, 10184) Not a bug, works as intended, i.e. the user didn't make a mistake, but had an incorrect notion of what it was supposed to do. (11596) Not a bug, but the user never got a reply. That is, I decided personally that this wasn't actually a bug. (13367) And bug 1000 is not a bug, system test. Do we care about the difference between any of these? I track them differently in my update script, but they all get the same status in the db. Can a bug be 'fixed' if there's no actually identifiable commit that fixes the bug? See 13516, which Tom Lane claims was fixed in 9.1. A grep for 13516 of the git log for both master and REL9_1_STABLE don't turn up anything. I can't as yet figure out how to match up git commit messages to identify every branch in which a fix was applied. I could of course load all of the commit messages into a table and compare them that way. Should I mark as "open" (i.e. not "new) any report that has a response? More than one response? That would at least distinguish between bug reports that at least someone, in theory, has taken a look at, and those that haven't been addressed at all. I have created the following statuses: Fixed - bug is presumably fixed Unreproduceable - we can't make the system demonstrate this error Timed Out - the reporter was asked to provide more information and didn't respond for a while. I would probably suggest somewhere around a month for this. Should there be a 'waiting on feedback' to mark the 'pre timed out' phase? Stale 5281 - the bug hasn't had any activity for >2 years, so just close it. If people want to troll through these to give them a better status, that would probably be good, but there's still a thousand open bugs newer than that. Not Our Bug - looks like a bug, but it's not a bug in postgres. What exactly are our bugs? Just what you'd get out of the release tarballs or the git repo? Or is there more? Not a Bug - see above discussion Won't Fix - this is arguably a bug, but for some reason we're not going to fix it. Perhaps we intentionally violate the standard, or it's a bug against a version we don't support, or we're not going to backpatch it. Open - this seems to be a bug, or at least we're talking about it and it's not where we want to close it. Note of course that "closing" a bug just means it will show up somewhere else in my tracker, obviously it doesn't affect the mailing list at all. New - this hasn't been looked at enough for someone to change the status to something better. I don't have a "reopened" status. I'm not sure what it means, other than it used to be closed, but someone changed it back to open. I don't immediately see why we would want to distinguish between this and a regular open bug, other than perhaps as a way of conflating status with priority. It's easy to make one though if people really want it. I probably have too many statuses already. I will post later on my thoughts on how to control the system. Are people, in principle, willing to put magic incantations in their emails and commit messages? I'm not asking for a commitment to my tool here, I'm just exploring what the bounds of people's, and committer's in particular, willingness to adjust their workflow or message texts a bit to make it easier to automate bu
Re: [HACKERS] bugs and bug tracking
On Tue, Oct 06, 2015 at 12:04:11PM -0500, Jaime Casanova wrote: > I like how this page is looking now, good work. Thank you. > Now, AFAIU from previous mails part of the reason to have a bug > tracker is to make easy to know when a bug was fixed. Which should be > interpreted as: which versions this bug affected? and which minor > versions on those branches fix the issue > > for example bug # 13636 was reported for 9.4.4 but it existed in older > branches so Tom fixed it in all active branches (ie: > http://www.postgresql.org/message-id/e1zfjgx-0005lu...@gemulon.postgresql.org). > is it possible (even if not yet implemented) to add that information? It is possible. I don't know yet how easy it will be to automate it for all the back patches. I think I can look for matching commit messages but I haven't written any code yet. It might be possible to infer this information after the fact by looking at where on which branches a commit fix was applied. > also i like that we can search on bugs but we can filter by version. > i'm just guessing that if someone looks for a bug he's probably > worrying about the specific version he is using. I'll probably get to adding filtering soon-ish. Maybe even today. I haven't decided if I want to do that on the server side, or add some javascript to let the user sort and filter whatever the page has already returned. I'm not really a web guy, so it takes me a while to figure out what to do. > having a bug tracker that allow us to not lose track of bugs is good, > having a bug tracker that allow us to have the whole information on a > bug is better, IMHO. I agree. It's just a matter of somehow automating the process. I'm under no illusions though that I have any control over things. I'm hoping that one or more of the committers will say "we'd like to do it this way" and I'll work with that. Personally, I'd like to see either '[Fixes #12345]' anywhere in a commit message, or 'Fixes: #12345' at the beginning of a line. But it has to come from them. Also, the version numbers are user reported and a bit of a mess. I don't think they could really be relied on as is for users trying to find out if a bug affects their version. Someone would have to update that information, and communicate that update to the tracker. The general concensus seems to be that magic phrases at the beginning of a line in a message body is the way to go. I don't necessarily agree, but any consistent system can be made to work. This obviously applies to any metadata, not just version numbers. > > A lot of the reports aren't bugs at all, but requests for help. My > > guess is that the users either don't know where to ask or don't > > understand the difference between a bug and not knowing how to do what > > they want to do. Perhaps a more thorough explaination on the submission > > form would be useful. > > > > the real problem here is that fill the bug report doesn't force you to > register in a ML, while asking for help in a ML will. and a lot of > people don't want to register in a ML, they just want a specific > question answered so i don't think any change in the form will avoid > that. True. Perhaps we could provide another form for other lists. Probably tilting at windmills here, but it would be nice if we could cut down on the amount of time taken up by "this isn't a bug, you should go ask down the hall". -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugs and bug tracking
On Tue, Oct 06, 2015 at 10:57:42AM -0700, Josh Berkus wrote: > On 10/06/2015 10:17 AM, Bruce Momjian wrote: > > Therefore, our current default behavior is to ignore user reports, > > unless someone takes an action to reply, record, or retain the email for > > later review. What a tracker does is to make the default user report be > > _retained_, meaning we have to take action to _not_ retain a user report > > as an open item. > > Well, we can determine how that's handled. There are bug trackers out > there that automatically archive unconfirmed bug reports after a certain > amount of time. I'd personally recommend it. > > Of course, that requires a bug tracker which can have an "unconfirmed" > status. This is essentially what I have done with the 'Stale' status. Though I have done at two years to be conservative, rather than 60 days, which I think is entirely more reasonable. > > Second, we have a mix of user reports. Some bug reports are not bugs > > and must be reclassified. In other cases, uses ask questions via > > non-tracked communicate channels, e.g. pgsql-general, but they are > > really bugs. So, to do this right, we need a way of marking tracked > > bugs as not bugs, and a way of adding bugs that were reported in a > > non-tracked manner. > > Yeah, I was wondering about that. I think I have suggested that there be a way to generate a bug id via email. Presumably someone could just copy that email address to make a not-tracked discussion get a bug id. If the system archived all the lists (not hard) it would be possible to pull the other emails from the thread into the bug (also not hard). As for marking as 'not-a-bug' this can easily be done via whatever mechanism might be used. Something along the lines of: Bug Status: not a bug would probably work. It's really whatever people are willing to actually do. > FWIW, when I talk about bugs which we lost track of, they're not > generally unconfirmed bug reports. Usually, it's stuff which a > contributor replied to, but the bug was low-impact, circumstantial, and > hard to reproduce, and came in during a busy period (like release time). > So I'd be perfectly OK with the idea that unconfirmed bugs hang around > in the system for 60 days, then automatically convert to "stale" status. My tracker would do this trivially if I changed the query to 60 days instead of two years. > Until we build up a team of volunteers for bug triage, we might have to > do that. > > Speaking of which ... this project is rich in skilled users who are > involved in the community but don't code. Bug triage is exactly the > kind of thing very part-time community supporters can do, if we make it > easy for them to do. I can make it easy. But that doesn't directly address two of the other points: 1: Do we need any system for who is allowed to triage bugs? 2: Should an equivalent email be sent to the list? Specifically with point number 2, suppose the above mechanism is used. When a triager marks a bug as (say) not a bug, should the system just update the database, or should it actually send a formatted email to the bugs list with the 'Bug Status: not a bug' line (among others, presumably)? I think it should send the email, but I can see how that could be construed as junk. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugs and bug tracking
On Tue, Oct 06, 2015 at 01:17:48PM -0400, Bruce Momjian wrote: > I do think we rushed to choose a tracker a little too quickly. Have we chosen one? > Let me explain, from a high level, what a new tracker will change in > our workflow. [snip] I won't quote your whole message, which I essentially agree with. Let me say that the questions I have brought up have several purposes. One, I think it's important to identify what exactly we're after. I hope my questions have help put some light on that. Two, I think any attempt to tell the developers and committers that they need to change their workflow to adapt to some system is bound to fail, so, I have asked, just what changed would you all be willing to actually *do*? Tom Lane is pretty good at noting a bug number in his commit messages, for example. Would he be willing to modify that slightly to make it easier to machine parse? Would you be willing to add a bug number to your commit messages? I'm not asking for guarantees. Actually I'm not really asking for anything, I'm just trying to figure out what the parameters of a solution might be. If the answer to that is "no, I'm not willing to change anything at all", that's fine, it just colors what might be done and how much automation I or someone else might be able to write. I think even with a bug tracker the default "ignore" behavior can still be done. In principle, we could even mark bugs as "unconfirmed" or "logged" or something right away and only mark them as new or open or something if they actually draw a reply. We could even require a reply from a committer if that was wanted. If I haven't made it clear by now, I am trying to write a system that requires the absolute minimal amount of change to the existing way of doing things. As I noted in my original email, I've put together a bug tracker, not a ticket system. If people would like to make some small adjustments to make it easier to automate a bug tracker, that would be great, but if not, that's fine too, it's no *worse* than what we already have. And if people really wanted a ticket system, it wouldn't be hard to enhance a tracker. > My point is that we have our current workflow not because we are > idiots, but because it fit our workflow and resources best. I am not > sure if we have succeeded because of our current non-retain mode, or > in spite of it. It might be time to switch to a default-retain mode, > especially since most other projects have that mode, but we should be > clear what we are getting into. I thinking having a bug tracker and retention vs non-retention are orthogonal questions. But I agree that knowing what might be involved is a good idea. I think perhaps one of the problems is that different people want different things, so it's probably going to be hard to make everyone happy. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugs and bug tracking
I have added full text searching to my tracker. I only index the first 50 KB of each message. There's apparently a one MB limit on that anyway, which a few messages exceed. I figure anything important is probably in the first 50KB. I could be wrong. I could re-index fairly easily. It seems to work pretty well. I have changed the 'stale' status to 90 days. I have triaged most of the remaining bugs with a message count > 1. This is in a separate file if it some other system wants to collect the work. I am still working out exactly how to find multiple matching commits along different branches. I'm running git patch-id on every commit, but that will take a while. Later tonight I will implement some in-message-body update syntax I'm also going to create (internally) a web page for the bugs that allows editing the status from that page. I will post the details on that once I've tested it a bit. I'll probably make a web page allowing updates as well. How often are the bugs mail list archives updated? I've got a bunch of posts to the bugs list in my inbox that aren't in the archive download. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugs and bug tracking
On Wed, Oct 07, 2015 at 03:06:50PM -0400, Stephen Frost wrote: > * Nathan Wagner (nw...@hydaspes.if.org) wrote: > > I have added full text searching to my tracker. I only index the first > > 50 KB of each message. There's apparently a one MB limit on that > > anyway, which a few messages exceed. I figure anything important is > > probably in the first 50KB. I could be wrong. I could re-index fairly > > easily. It seems to work pretty well. > > Note that we have FTS for the -bugs, and all the other, mailing lists.. True, but that finds emails. The search I have finds bugs (well, bug reports anyway). Specifically, I have the following function: create or replace function bugvector(bugid bigint) returns tsvector language 'sql' as $$ select tsvagg( setweight(to_tsvector(substr(body(msg), 1, 50*1024)), 'D') || setweight(to_tsvector(header_value(msg, 'Subject')), 'C') ) from emails where bug = $1 $$ strict; which, as you can see, collects into one tsvector all the emails associated with that particular bug. So a search hit is for the whole bug. There's probably some search artifacts here. I suspect a bug with a long email thread will be ranked higher than a one with a short thread. Perhaps that's ok though. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugs and bug tracking
On Tue, Oct 20, 2015 at 10:39:55AM -0700, Joshua D. Drake wrote: > So where are we at on this? Well, I can't speak to where we are, but my system is up, running, and seems to work well, It even attracts a few visitors. I have been meaning to write a triage interface, but I have been stuck doing postgis work for an anthropology paper I am working on for the last week. I hope to be able to take a break from that tomorrow and through the weekend. I have read the various comments, and will also do some simple scanning for phrases in emails and commit messages indicating status changes which I will reflect on the main web page. Once I write that code, I will email the list with what I have done. If people want to use it, or criticize it, they would be welcome to do so. If not, well, the interface is useful for me in any event, so I will probably maintain it for the forseeable future. https://granicus.if.org/pgbugs/ for anyone who hasn't and wants to take a look. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] fortnight interval support
On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote: > Aw, you're no fun. select '1 fortnight'::interval => '14 days' would be cool. Patch attached... :) % psql -p 5433 -d template1 -h localhost psql (9.4.5, server 9.6devel) WARNING: psql major version 9.4, server major version 9.6. Some psql features might not work. Type "help" for help. template1=# select current_date; date 2015-10-27 (1 row) template1=# select '1 fortnight'::interval; interval -- 14 days (1 row) template1=# select current_date + '1 fortnight'::interval; ?column? - 2015-11-10 00:00:00 (1 row) template1=# select current_date + '1.3 fortnight'::interval; ?column? - 2015-11-14 04:48:00 (1 row) template1=# select current_date + '1.3 fortnights'::interval; ?column? - 2015-11-14 04:48:00 (1 row) -- nw diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index 926358e..2032fe0 100644 --- a/src/backend/utils/adt/datetime.c +++ b/src/backend/utils/adt/datetime.c @@ -186,6 +186,8 @@ static const datetkn deltatktbl[] = { {DDECADE, UNITS, DTK_DECADE}, /* "decade" relative */ {"decades", UNITS, DTK_DECADE}, /* "decades" relative */ {"decs", UNITS, DTK_DECADE},/* "decades" relative */ + {DFORTNIGHT, UNITS, DTK_FORTNIGHT}, /* "fortnights" relative */ + {"fortnights", UNITS, DTK_FORTNIGHT}, /* "fortnights" relative */ {"h", UNITS, DTK_HOUR}, /* "hour" relative */ {DHOUR, UNITS, DTK_HOUR}, /* "hour" relative */ {"hours", UNITS, DTK_HOUR}, /* "hours" relative */ @@ -3281,6 +3283,12 @@ DecodeInterval(char **field, int *ftype, int nf, int range, tmask = DTK_M(DAY); break; + case DTK_FORTNIGHT: + tm->tm_mday += val * 14; + AdjustFractDays(fval, tm, fsec, 14); + tmask = DTK_M(WEEK); + break; + case DTK_WEEK: tm->tm_mday += val * 7; AdjustFractDays(fval, tm, fsec, 7); diff --git a/src/include/utils/datetime.h b/src/include/utils/datetime.h index e9a1ece..3641292 100644 --- a/src/include/utils/datetime.h +++ b/src/include/utils/datetime.h @@ -52,6 +52,7 @@ struct tzEntry; #define DHOUR "hour" #define DDAY "day" #define DWEEK "week" +#define DFORTNIGHT "fortnight" #define DMONTH "month" #define DQUARTER "quarter" #define DYEAR "year" @@ -181,6 +182,7 @@ struct tzEntry; #define DTK_TZ_MINUTE 35 #define DTK_ISOYEAR36 #define DTK_ISODOW 37 +#define DTK_FORTNIGHT 38 /* diff --git a/src/test/regress/expected/interval.out b/src/test/regress/expected/interval.out index c873a99..7a72f2a 100644 --- a/src/test/regress/expected/interval.out +++ b/src/test/regress/expected/interval.out @@ -40,6 +40,12 @@ SELECT INTERVAL '1.5 weeks' AS "Ten days twelve hours"; 10 days 12:00:00 (1 row) +SELECT INTERVAL '1 fortnight' AS "Fourteen days"; + Fourteen days +--- + 14 days +(1 row) + SELECT INTERVAL '1.5 months' AS "One month 15 days"; One month 15 days --- diff --git a/src/test/regress/sql/interval.sql b/src/test/regress/sql/interval.sql index 789c3de..285266a 100644 --- a/src/test/regress/sql/interval.sql +++ b/src/test/regress/sql/interval.sql @@ -12,6 +12,7 @@ SELECT INTERVAL '-08:00' AS "Eight hours"; SELECT INTERVAL '-1 +02:03' AS "22 hours ago..."; SELECT INTERVAL '-1 days +02:03' AS "22 hours ago..."; SELECT INTERVAL '1.5 weeks' AS "Ten days twelve hours"; +SELECT INTERVAL '1 fortnight' AS "Fourteen days"; SELECT INTERVAL '1.5 months' AS "One month 15 days"; SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years..."; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fortnight interval support
On Tue, Oct 27, 2015 at 12:04:55PM -0500, Merlin Moncure wrote: > On Tue, Oct 27, 2015 at 8:52 AM, Nathan Wagner wrote: > > On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote: > >> Aw, you're no fun. select '1 fortnight'::interval => '14 days' would be > >> cool. > > > > Patch attached... > This is very cool (you are 100% certain there are no performance > impacts on current cases, right?)! :-) It passed the regression test. It must be perfect :) -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fortnight interval support
On Wed, Oct 28, 2015 at 08:17:25AM +1300, Gavin Flower wrote: > You trying to get PostgreSQL banned in France??? :-) > > When I was learning French many years ago, I was told that the French > consider their fortnight to be 15 days!!! What, it's a "fortnight", not a "quinzaine". You have no idea how hard it was to resist updating the patch... -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fortnight interval support
On Tue, Oct 27, 2015 at 01:52:11PM +, Nathan Wagner wrote: > On Mon, Oct 26, 2015 at 01:58:52PM -0400, Robert Haas wrote: > > Aw, you're no fun. select '1 fortnight'::interval => '14 days' would be > > cool. > > Patch attached... This isn't necessarily bad, but I observe that it would be difficult or impossible to add fortnight support to intervals as an extension rather than by modifying the scanner tables that the interval parser uses at original compile time. You might be able to override symbols with a C extension, but parts of the parser are static (in the C sense), so you'd need to override and duplicate a lot of the existing functions. Of course, a hookable interval parser is absurd in the first place. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add EXTRA_CFLAGS to configure
On Wed, Oct 28, 2015 at 02:42:19PM +0100, Robert Haas wrote: > On Wed, Oct 28, 2015 at 2:17 PM, Andres Freund wrote: > >> I use COPT for this purpose. > > > > Unless I miss something you can't just pass that to configure though, > > right? I.e. it has to be passed to each make invocation? > > What I do is: > > echo COPT=-Wall -Werror > src/Makefile.custom Make will pick up variables from the environment. So unless the makefile goes out of its way to circumvent it, you can just do COPT=-Werror export COPT and then run your usual configure/compile cycle. There's no specific need to modify the makefiles or pass extra arguments into make. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] September 2015 Commitfest
On Sat, Oct 31, 2015 at 08:03:59AM +0100, Robert Haas wrote: > +1. FWIW, I'm willing to review some patches for this CommitFest, but > if the committers have to do first-round review as well as > committer-review of every patch in the CommitFest, this is going to be > long, ugly, and painful. We need to have a substantial pool of > non-committers involved in the review process so that at least some of > the work gets spread out. As a non-committer, let me offer my thoughts. First, I would ask that every patch include a commit id that the patch was generated against. For example, I was looking at the "group command option for psql" patch. I created a branch off of master, but the patch doesn't apply cleanly. On further investigation, it looks like Adam Brightwell noted this on September 21, but the patch hasn't been updated. If I knew which commit id the patch was created against, I could create a branch from there and test the patch, but without, I need to figure that out which is more work, or I need to re-create the patch, which is also more work. Second, it would be convenient if there were a make target that would set up a test environment. Effectively do what the 'make check' does, but don't run the tests and leave the database up. It should probably drop you into a shell that has the paths set up as well. Another target should be available to shut it down. So, what would be cool, and make testing easier is if I could do a 'git checkout -b feature abcdef' (or whatever the commit id is and branch name you want to use) Then from there a make make check make testrig make testshutdown These two would go a long way to making the process of actually doing a review easier. Back to the patch in question, so Mr Brightwell noted that the patch doesn't apply against master. Should someone then mark the patch as waiting on author? Is failing to apply against master a 'waiting on author' cause? Is the answer different if the patch author has supplied a commit id that the patch was generated from? There was then some further discussion on the interface, and what to do with startup files, and nothing was really decided, and then the thread petered out. This potential reviewer is then left with the conclusion that this patch really can't be reviewed, and it's not sure if it's even wanted as is anyway. So I move on to something else. There was a bunch of discussion by a bunch of committers, and no-one updated the status of the patch in the commitfest, and there's no clear guidelines on what the status should be in this case. If I were needing to decide, I would say that the patch should either be marked as "Waiting on Author" on the grounds that the patch doesn't currently apply, or "Returned with feedback" on the grounds that there was unaddressed feedback on the details of the patch, and it was noted as a "proof of concept" when submitted anyway. But I'm unwilling to just change it to that without more clear definitions of the meaning of each status. A link to definitions and when the status should be changed would help. What is "ready for committer" anyway? It's clearly not "a committer will apply the patch", since things sit in that status without being committed. If I think the patch is good and should be applied, do I mark it as ready for committer, and then later a committer will also review the patch? If so, is doing anything other than the sanity checks, and possibly offering an opinion, on the patch even useful? -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] September 2015 Commitfest
On Sat, Oct 31, 2015 at 12:08:58PM -0400, Tom Lane wrote: > Nathan Wagner writes: > > Second, it would be convenient if there were a make target that would > > set up a test environment. Effectively do what the 'make check' does, > > but don't run the tests and leave the database up. It should probably > > drop you into a shell that has the paths set up as well. Another > > target should be available to shut it down. > > As far as that goes, I don't think it's really the makefiles' place to > establish a manual-testing convention. What I do, and what I think > most other longtimers do, is create test installations in nondefault > places. [snip description on how to set this up ] > You could imagine putting something into the standard makefiles > that did some subset of this, but I think it would be too rigid > to be useful. I think it would be very useful to just be able to tell the system "fire this up for me so I can test it". I don't think it needs to handle every possible testing scenario, just making it easier to leave up the test postmaster from make check would be very useful, at least to me. > As an example, what if you wanted to compare the behaviors of both > unmodified HEAD and your patched code? It's not very hard to set up > two temporary installations along the lines of the recipe I've just > given, but I can't see the makefiles handling that. They could pick up make or environment variables. We already do that for psql. Something like PGPORT=5495 PGPATH=~/pg95 make startit or some such. I'm not actually proposing this, I'm just noting how the makefiles could handle it fairly easily. All I'd really like is a way to leave the database used for 'make check' running so I can do any additional poking around by hand that I might want to do more easily. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for geqo tweaks
On Wed, Nov 04, 2015 at 12:51:52PM -0500, Tom Lane wrote: > Nathan Wagner writes: > > I have two patches to make the geqo initialization and mutation > > slightly better. > > > The first adjusts the mutation swaps to avoid having to re-pick > > ties. The second changes the initialization and shuffling algorithm > > for the gene array to use an in-place Fisher-Yates shuffling > > algorithm. > > I took a quick look at this. > > I'm not very impressed with the first patch: it might save a few > geqo_randint() calls, but it seems to do so at the price of making the > swap choices less random --- for instance it sure looks to me like the > last array element is now less likely to participate in swaps than > other elements. Unless you can prove that actually the swapping is > still unbiased, I'm inclined to reject this part. If I have understood the original code correctly, we need to select two different random integers between 0 and num_gene-1, inclusive. That happens to be num_gene possible results. Having chosen the first one, which I will call "swap1", we now only have num_gene-1 possible results, which need to range from either 0 to swap1-1 or from swap1+1 to num_gene-1, which is num_gene-1 possible results. I treat this as a single range from 0 to num_gene-2 and generate a number within that range, which I will call "swap2". If swap2 is between 0 and swap1-1, it is in the first range, and no adjustment is necessary. If it is greater than or equal to swap1, then it is in the second range. However the generated swap2 in the second range will be between swap1 and num_gene-2, whereas we need it to be between swap1+1 and num_gene-1, so I add one to swap2, adjusting the range to the needed range. It would be equivalent to set swap2 to num_gene-1 in that case, effectively remapping the first value to the last, but an increment was more intuitive to me. > As for the second part, I had to look up Fisher-Yates ;-) but after > having read Wikipedia's entry about it I think this is a good change. > The code's shorter and more efficient, and it should mathematically > provide an equally-unbiased initial shuffle. It could do with a > better comment, and I'd be inclined to handle the first element > outside the loop rather than uselessly computing geqo_randint(0,0), I could do that. It will make the code slightly harder to read. I wonder if it would be worth having geqo_randint() handle the special case instead. > Having said that, though, I believe that it's also probably a > *different* initial shuffle, which may well mean that GEQO gives > different plans in some cases. Yes, I would expect it to be different in the general case. I think my commit message noted that, but perhaps it could be more explicit. > That doesn't bother me as long as we only make the change in HEAD, but > does anyone want to complain? I don't see any need to backport either of these patches. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for geqo tweaks
On Wed, Nov 04, 2015 at 12:51:52PM -0500, Tom Lane wrote: > As for the second part, I had to look up Fisher-Yates ;-) but after > having read Wikipedia's entry about it I think this is a good change. > The code's shorter and more efficient, and it should mathematically > provide an equally-unbiased initial shuffle. It could do with a > better comment, and I'd be inclined to handle the first element > outside the loop rather than uselessly computing geqo_randint(0,0), > but those are trivial changes. I see you committed a modified version of my patch in commit 59464bd6f928ad0da30502cbe9b54baec9ca2c69. You changed the tour[0] to be hardcoded to 1, but it should be any of the possible gene numbers from 0 to remainder. If you want to pull the geqo_randint(0,0) out of the loop, it would be the last element, not the first (i.e. where remainder == 0). We might be able to just skip the last swap, and the loop could be for (i=0; i < num_gene-1; i++) { but I'd need to re-read the details of the Fisher-Yates algorithm to be sure. It may be that the last swap needs to happen for the shuffle to be fully random. In any case, tour[0] certainly shouldn't be hardcoded to 1. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for geqo tweaks
On Fri, Nov 06, 2015 at 11:19:00AM -0500, Tom Lane wrote: > Nathan Wagner writes: > > I see you committed a modified version of my patch in commit > > 59464bd6f928ad0da30502cbe9b54baec9ca2c69. > > > You changed the tour[0] to be hardcoded to 1, but it should be any > > of the possible gene numbers from 0 to remainder. > > How so? The intent is to replace the first iteration of the > Fisher-Yates loop, not the old loop. That iteration will certainly > end by assigning 1 to tour[0], because it must choose j = i = 0. You are correct. I got confused between reading the original code, my patch, and your modified patch. I wonder why the algorithm bothers with the first iteration at all, in the case of an initialized array, it would just swap the first element with itself. I must be missing something. I'll need to do some more reading. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for geqo tweaks
On Fri, Nov 06, 2015 at 11:45:38AM -0500, Tom Lane wrote: > However, really the whole argument is moot, because I notice that > geqo_mutation() is only called in the "#ifdef CX" code path, which > we don't use. I suppose someone could turn it on via a compiler define. > So there's little point in improving it. No, probably not. > (There's a fair amount of dead code in /geqo/, which I've never had > the energy to clean up, but maybe we should do that sometime. It > seems unlikely that anyone will ever be interested in experimenting > with the ifdef'ed-out code paths.) I also note that in src/backend/optimizer/path/allpaths.c there is a join_search_hook variable apparently intended for plugins (extensions?) to be able to control the search path optimizer. And the geqo code is AFAICT turned off by default anyway, so none of the code is used in probably the vast majority of systems, with standard_join_search() being called instead. Would it be worth either of removing at least the non-ERX portions of the geqo code, or removing the geqo code entirely (presumably with a deprecation cycle) and moving it to an extension? If there's any interest, I can work up a patch for either or both. There is only one test in the regression suite that turns on geqo that I could find. It's labeled "check for failure to generate a plan with multiple degenerate IN clauses" in join.sql. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for geqo tweaks
On Fri, Nov 06, 2015 at 02:16:41PM -0500, Tom Lane wrote: > Uh, what? It's not by any means turned off by default. > > postgres=# select name,setting from pg_settings where name like '%geqo%'; > name | setting > -+- > geqo| on [snip] My day to make a fool of myself in public I guess. You're right of course. I can only plead distraction by having too many projects in mind at once and not focusing properly. Sorry for taking up your time on things I should have checked better. > I'm inclined to think that removing all the ifdefd-out-by-default logic > would be a fine thing to do, though. I'll work up a patch. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
> I actually don't know much about the I/O subsystem, but, no, WAL is > not separated from data. I believe $PGDATA is on a SAN, but I don't > know anything about its characteristics. The computer is here: http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm $PGDATA is on a 5 disk SATA software raid 5. Is there anything else that would be useful to know? ( Sorry, this isn't a subject that I'm very familiar with ) -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane wrote: > Alexander Korotkov writes: >> On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch wrote: >>> I've attached a new version that includes the UINT64_FMT fix, some edits of >>> your newest comments, and a rerun of pgindent on the new files. I see no >>> other issues precluding commit, so I am marking the patch Ready for >>> Committer. >>> If I made any of the comments worse, please post another update. > >> Changes looks reasonable for me. Thanks! > > I am starting to look at this patch now. I'm wondering exactly why the > decision was made to continue storing btree-style statistics for arrays, > in addition to the new stuff. If I understand you're suggestion, queries of the form SELECT * FROM rel WHERE ARRAY[ 1,2,3,4 ] <= x AND x <=ARRAY[ 1, 2, 3, 1000]; would no longer use an index. Is that correct? Are you suggesting removing MCV's in lieu of MCE's as well? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane wrote: > Nathan Boley writes: >> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane wrote: >>> I am starting to look at this patch now. I'm wondering exactly why the >>> decision was made to continue storing btree-style statistics for arrays, >>> in addition to the new stuff. > >> If I understand you're suggestion, queries of the form > >> SELECT * FROM rel >> WHERE ARRAY[ 1,2,3,4 ] <= x >> AND x <=ARRAY[ 1, 2, 3, 1000]; > >> would no longer use an index. Is that correct? > > No, just that we'd no longer have statistics relevant to that, and would > have to fall back on default selectivity assumptions. Which, currently, would mean queries of that form would typically use a table scan, right? > Do you think that > such applications are so common as to justify bloating pg_statistic for > everybody that uses arrays? I have no idea, but it seems like it will be a substantial regression for the people that are. What about MCV's? Will those be removed as well? Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
>> What about MCV's? Will those be removed as well? > > Sure. Those seem even less useful. Ya, this will destroy the performance of several queries without some heavy tweaking. Maybe this is bad design, but I've gotten in the habit of storing sequences as arrays and I commonly join on them. I looked through my code this morning, and I only have one 'range' query ( of the form described up-thread ), but there are tons of the form SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array; I can provide some examples if that would make my argument more compelling. Sorry to be difficult, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
[ sorry Tom, reply all this time... ] > What do you mean by "storing sequences as arrays"? So, a simple example is, for transcripts ( sequences of DNA that are turned into proteins ), we store each of the connected components as an array of the form: exon_type in [1,6] splice_type = [1,3] and then the array elements are [ exon_type, splice_type, exon_type ] ~ 99% of the elements are of the form [ [1,3], 1, [1,3] ], so I almost always get a hash or merge join ( correctly ) but for the rare junction types ( which are usually more interesting as well ) I correctly get nest loops with an index scan. > Can you demonstrate > that the existing stats are relevant at all to the query you're worried > about? Well, if we didn't have mcv's and just relied on ndistinct to estimate the '=' selectivities, either my low selectivity quals would use the index, or my high selectivity quals would use a table scan, either of which would be wrong. I guess I could wipe out the stats and get some real numbers tonight, but I can't see how the planner would be able to distinguish *without* mcv's... Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
>> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. For example, in the United States, > there are over 40,000 postal codes, and some even larger number of > city names, and doesn't the number of entries go as O(m*n)? Not to mention that the model parameters will be impossible to estimate well. ( I've only scanned the thread, so sorry if Im repeating something that's already been said ) My intuition is that storing the covariance structure will be unworkable both technically and statistically for all but the simplest of cases. That being said, I dont think the problem is a lack of ways to parameterize the covariance structure ( there are several papers on mutli-dim histogram estimators, at least one of whichtalks about databases explicitly, not to mention approaches like CART[1] ) , but a complete lack of infrastructure to do anything with the estimates. So keep working on it ;-) ( but try to make the framework general enough to allow better estimators ). I wonder if a good first step might be to focus on the AND and OR operators since they seem like the most common cases and union and intersection commute ( although it's important to remember that the estimate variances do NOT ) That is, what about estimating the selectivity of the condition WHERE X=A and Y=B by f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) + x_2*selectivity(X = A)*selectivity( Y = B ) + x_3 where x_{1,2,3} are parameters to be estimated from the data. Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple years back ). Best, Nathan [1] http://en.wikipedia.org/wiki/Classification_and_regression_tree [2] http://en.wikipedia.org/wiki/Copula_%28statistics%29 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why don't we accept exponential format for integers?
>>>> print int(1e+01) > 10 >>>> > That isn't building an integer: it is creating a float and casting to an int. try: int( 1e100 ) Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra wrote: > Dne 19.1.2011 20:25, Robert Haas napsal(a): >> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra wrote: >>> Yes, I was aware of this problem (amount of memory consumed with lots of >>> updates), and I kind of hoped someone will come up with a reasonable >>> solution. >> >> As far as I can see, periodically sampling some or all of the table is >> really the only thing that has a chance of working OK. You could try >> to track changes only when they're not too big, but I think that's >> still going to have nasty performance consequences. > > Yes, sampling all the table is the only way to get reliable ndistinct > estimates. IMHO continuing to focus on worst case results is a dead end. Yes, for some distributions, ndistinct is very hard to estimate well. However, let us not forget that the current ndistinct estimator is very bad but the postgres planner is, on the whole, very good. As far as I can see this is due to two facts: 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) 2) The choice of plan is fairly robust to mis-estimation of ndistinct, because there are only a few things the executer can choose. It doesn't usually matter if a value composes 5% or 20% ( or 99% ) of the table, we still usually need to scan the entire table. Again, in my *very* humble opinion, If the ultimate goal is to improve the planner, we should try to cut down on the number of cases in which a poor estimate of ndistinct gives a really bad plan instead of chasing after marginal gains from a better ndistinct estimator. Maybe ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) pounding away at the ndistinct estimator seems like a dead end. If you think about it, it's a bit ridiculous to look at the whole table *just* to "estimate" ndistinct - if we go that far why dont we just store the full distribution and be done with it? Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
>> If you think about it, it's a bit ridiculous to look at the whole table >> *just* to "estimate" ndistinct - if we go that far why dont we just >> store the full distribution and be done with it? > > - the best you could do is to average the > individual probabilities which gives ... well, 1/ndistinct. > That is certainly *not* the best you could do in every case. The mean is only the best estimate in L2, which is definitely not the case here. Consider a table with 10K values, 9,990 of which are 1 and the rest of which are 2, 3, ..., 10, versus a table that has the same 10 distinct values evenly distributed. For a simple equality query, in the first case, a bitmap scan might be best. In the second case, a sequential scan would always be best. This is precisely the point I was trying to make in my email: the loss function is very complicated. Improving the ndistinct estimator could significantly improve the estimates of ndistinct ( in the quadratic loss sense ) while only marginally improving the plans. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
> > Not true. You're missing the goal of this effort - to get ndistinct > estimate for combination of multiple columns. Which is usually > pathological in case of dependent columns. > Again, don't think about a single column (although even in that case > there are known fail cases). Think about multiple columns and the number > of distinct combinations. With attribute value independence assumption, > this is usually significantly underestimated. And that's a problem as it > often leads to an index scan instead of sequential scan (or something > like that). My point was that, in the case of single columns, we've done pretty well despite the poor ndistinct estimates. I suspect the same will be true in the case of multiple columns; good heuristics will trump minimax estimators. >> ( as I've advocated for before ) this means parameterizing the >> distribution of values ( ie, find that the values are roughly uniform >> ), maybe this means estimating the error of our statistics and using >> the most robust rather than the best plan, or maybe it means something >> totally different. But: ( and the literature seems to support me ) > > Which literature? I've read an awful lot of papers on this topic lately, > and I don't remember anything like that. If there's an interesting > paper, I'd like to read it. Yes, all the papers state estimating a > ndistinct is a particularly tricky task, but that's somehow expected? It is definitely expected - non-parametric minimax results are always *very* weak. The papers that you've cited seem to confirm this; bounding ndistinct estimation error is tricky in the general case ( even with very simple loss functions, which we do not have ). The same is true for other non-parametric methods. Think about kernel density estimation: how many data points do you need to estimate the density of a normal distribution well? What about if you *know* that the data is normal ( or even that it comes from a big family? ). > No, I'm not trying to do this just to improve the ndistinct estimate. > The goal is to improve the cardinality estimate in case of dependent > columns, and one of the papers depends on good ndistinct estimates. > > And actually it does not depend on ndistinct for the columns only, it > depends on ndistinct estimates for the combination of columns. So > improving the ndistinct estimates for columns is just a necessary first > step (and only if it works reasonably well, we can do the next step). I think that any approach which depends on precise estimates of ndistinct is not practical. I am very happy that you've spent so much time on this, and I'm sorry if my previous email came off as combative. My point was only that simple heuristics have served us well in the past and, before we go to the effort of new, complicated schemes, we should see how well similar heuristics work in the multiple column case. I am worried that if the initial plan is too complicated then nothing will happen and, even if something does happen, it will be tough to get it committed ( check the archives for cross column stat threads - there are a lot ). Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug wrote: > On Jan20, 2011, at 02:41 , Nathan Boley wrote: >>>> If you think about it, it's a bit ridiculous to look at the whole table >>>> *just* to "estimate" ndistinct - if we go that far why dont we just >>>> store the full distribution and be done with it? >>> >>> - the best you could do is to average the >>> individual probabilities which gives ... well, 1/ndistinct. >> >> That is certainly *not* the best you could do in every case. The mean >> is only the best estimate in L2, which is definitely not the case >> here. > > No, not in every case. But on average it comes out best, no? In the sense of minimizing the average plan cost over all values? Definitely not. Consider the trivial case where a bitmap scan is the cost of a sequential scan + epsilon. > >> Consider a table with 10K values, 9,990 of which are 1 and the rest of >> which are 2, 3, ..., 10, versus a table that has the same 10 distinct >> values evenly distributed. For a simple equality query, in the first >> case, a bitmap scan might be best. In the second case, a sequential >> scan would always be best. > > True. But selectivity estimates alone don't show the difference there. Yet the full distribution would - supporting my argument that even in cases where we dont know a specific value, the full distribution is more informative. > > Also, in my personal experience this isn't really the area we've got > problems now. The problem cases for me always were queries with a rather > large number of joins (7 to 10 or so) plus rather complex search > conditions. The join order, not the access strategy, then becomes the > deciding factor in plan performance. And the join order *is* driven purely > off the selectivity estimates, unlike the access strategy which even today > takes other factors into account (like clusteredness, I believe) > I think I'm losing you. Why would ndistinct be complete sufficient for estimating the optimal join order? >> This is precisely the point I was trying to make in my email: the loss >> function is very complicated. Improving the ndistinct estimator could >> significantly improve the estimates of ndistinct ( in the quadratic >> loss sense ) while only marginally improving the plans. > > > The point of this exercise isn't really to improve the ndistinct estimates > for single columns. Rather, we'd like to get ndistinct estimates for > groups of columns because that allows us to use the uniform bayesian > approach to multi-column selectivity estimation that Tomas found literature > on. Which on the whole seems like it *does* have a chance of greatly > improving the plans in some cases. We could, of course, estimate multi-column > ndistinct the same way we estimate single-column ndistincts, but quite a few > people raised concerns that this wouldn't work due to the large error our > current ndistinct estimates have. Sure. But estimating multi-column ndistinct is surely not the only way to approach this problem. The comment I made, which you objected to, was that it's silly to look at the whole table and then throw away all information *except* ndistinct. If we really think that looking at the whole table is the best approach, then shouldn't we be able to get better statistics? Is this really an open question? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ossp-uuid Contrib Patch
On Mon, Sep 10, 2012 at 04:23:00PM -0700, David E. Wheeler wrote: > Well given that OSSP seems to be abandon ware (no activity since July > 2008), it might be time to dump it in favor of something else. Perhaps this would be a good time to bring up my uuid code again. I've got a module for uuid support for postgres. Originally written for 8.4 but I've kept it updated. It's available at https://pd.if.org/git/uuid.git and for anonymous git cloning git clone https://pd.f.org/git/uuid The install directions are perhaps a bit lacking, but after the git checkout: cd uuid make cd postgres make sudo make install psql -c 'create extension pduuid' Installing to different schemas is supported also. The postgres/README file describes the supported features. Briefly, my code supports uuid generation of version 1, 3, 4, and 5 uuids, casting to and from numeric, bit string, and bytea values, and provides a uuid_recent() function to get the most recent uuid generated in that backend. The code is entirely in the public domain. At one point, it compiled on Windows and MacOS, but I haven't tested that recently. I don't think I have changed anything that would affect compiling on those platforms. IIRC, the only platform dependent code is in figuring out a local mac address and a sub-second timestamp. Comments or suggestions welcome. -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measure Theoretic Data Types in Postgresql
> However, by realizing that the bounds on the ranges have a linear ordering > one can speed this up to 0(m) using windowing functions on common table > expressions. > > So what I am proposing is formalizing this optimization into a class of data > types, that will hide the implementation details. Could this not also be handled by extending merge join to work with an overlap operator? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython3
> On the basis of all of the foregoing, I don't think we can consider > this patch further for this CommitFest and will update > commitfest.postgresql.org accordingly. FWIW, I am very excited about this patch and would be happy to review it but have been very busy over the past month. If I can promise a review by Thursday morning could we keep it active? Hopefully, at the very least, I can provide some useful feedback and spawn some community interest. I am worried that there is a bit of a chicken and an egg problem with this patch. I code nearly exclusively in python and C, but I have often found pl/python to be very unwieldy. For this reason I often use pl/perl or pl/pgsql for problems that, outside of postgres, I would always use python. From the documentation, this patch seems like an enormous step in the right direction. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython3
> I think it would be great for you to review it... I doubt that will > cause it to get committed for 9.0, but my doubt is no reason for you > to hold off reviewing it. I assumed so, but the pretense of a chance will probably help to motivate me :-) I'll have something by Thursday, and then 'Returned with Feedback' will at least be factual. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG 9.0 and standard_conforming_strings
On Wed, 03 Feb 2010 14:41:13 -0500, Tom Lane wrote: > Indeed it is, which is one of the reasons to be cautious with changing > it. We've been telling people to move away from \' for a long time, > but actually flipping the switch that will make their apps insecure > is not something to do on the spur of the moment. AFAICT the switch was added in 8.2, and mentioned in the release notes dated 2006-12-05. The documentation for 8.2 says "The default is currently off, causing PostgreSQL to have its historical behavior of treating backslashes as escape characters. The default will change to on in a future release to improve compatibility with the standard." So people have had three years of warning, which I would hardly characterize as "spur of the moment". If you want the old behavior, change the setting to off. I think that a major release point is exactly the right time to do this, doing it at a minor release number is much less reasonable. A question for those opposed to doing it now: how exactly do you propose to warn people that is different than the notice that it will be changed in a future release that has been around for the last three years? -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extended operator classes vs. type interfaces
> The advantage of specifying a + and a - in the type interface is that > the unit definition can then be specified as part of the type > declaration itself. So you can do: > > CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s'); > CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m'); > > All of the stuff about defining + and - is hidden from the user - it's > part of the type interface, which is pre-created. > The disadvantage is that it does not permit irregularly spaced units. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die
> Robert explained why having more MCVs might be useful because we use > the frequency of the least common MCV as an upper bound on the > frequency of any value in the MCV. Where is that being used? The only non-MCV frequency estimate that I recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously changing the number of mcv's affects this by lowering n_ndistinct_rows, but it's always pretty coarse estimate. > Binding the length of the MCV list to the size of the histogram is > arbitrary but so would any other value Wouldn't the best approach be to stop adding MCV's/histogram buckets when adding new ones doesn't decrease your prediction error 'substantially'? One very hacky threshold heuristic is to stop adding MCV's when a simple equality select ( SELECT col FROM table WHERE col == VALUE ) would switch the plan from an index to a sequential scan ( or vice versa, although with the current code this would never happen ). ie, if the non_mcv frequency estimate is 0.1% ( producing an index scan ), but adding the MCV gives us an estimate of 5% ( pbly producing a seq scan ) then add that value as an mcv. More sophisticated variations might also consider plan changes to very suboptimal joins; even more sophisticated would be to stop when the MAX( curr - optimal plan / optimal plan ) was below some threshold, say 20%, over a bunch of recently executed queries. A similar approach would work for histogram bins, except the queries of interest are inequality rather than equality selections. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die
> That one's used, too, but the other is used as an upper bound. > n_distinct tends to come out too small on large tables, so that > formula is prone to overestimation. Actually, both formulas are prone > to overestimation. > Right - thanks. > When this happens depends on the values of a whole boat-load of GUCs... Well, then doesn't the 'proper' number of buckets depend on a whole boat-load of GUCs... -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Types: empty ranges
FWIW, a very informal survey of probabilists didn't yield any reason for trying to put an order on the empty set ( unless the metric was cardinality or other equivalence relation ). I think the problem here is that the idea of union and intersection forming a ring over sets is being conflated with the order relation. Clearly, combining the two notions can be inconsistent. However... >> > A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) >> >> But the basic range type isn't even closed under UNION. > > An excellent point. Allow me to move the target a little: > > WHERE A && B AND A && C > and: > WHERE A && (B INTERSECT C) > > That seems like a logically sound transformation, but if (B INTERSECT C) > is empty, it relies on the empty range for those two to be equivalent. > > Now, I agree that lack of closure on UNION exhibits many of the problems > that I am pointing out related to forbidding empty ranges. However, I'm > not sure if that means we should give up on either. This seems potentially very useful, because we can transform WHERE A && B AND A && C from a bitmap scan into WHERE A && (B INTERSECT C), a simple index scan. In the union case ( even if we had a type that supported disjoint intervals), I doubt we would ever make that transformation because the index will probably still be over connected intervals. So, +1 for keeping it how it is ( but maybe with a better error message ). Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: cross column correlation ...
> Personally, I think the first thing we ought to do is add a real, bona > fide planner hint to override the selectivity calculation manually, > maybe something like this: > > WHERE (x < 5 AND y = 1) SELECTIVITY (0.1); > If you're going to go that far, why not just collect statistics on that specific predicate? ie, ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x < 5 AND y = 1); Then it won't fall subject to all of the pitfalls that Tom outlines below. Selectivities are easy to estimate if we know the predicate. They only become hard when they have to work for every possible predicate. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539 I glanced through the code and played around with the feature and, in general, it looks pretty good. But I have a some comments/questions. Design: First, it makes me uncomfortable that you are using the MCV and histogram slot kinds in a way that is very different from other data types. I realize that tsvector uses MCV in the same way that you do but: 1) I don't like that very much either. 2) TS vector is different in that equality ( in the btree sense ) doesn't make sense, whereas it does for arrays. Using the histogram slot for the array lengths is also very surprising to me. Why not just use a new STA_KIND? It's not like we are running out of room, and this will be the second 'container' type that splits the container and stores stats about the elements. Second, I'm fairly unclear about what the actual underlying statistical method is and what assumptions it makes. Before I can really review the method, I will probably need to see some documentation, as I say below. Do you have any tests/simulations that explore the estimation procedure that I can look at? When will it fail? In what cases does it improve estimation? Documentation: Seems a bit lacking - especially if you keep the non standard usage of mcv/histograms. Also, it would be really nice to see some update to the row-estimation-examples page ( in chapter 55 ). The code comments are good in general, but there are not enough high level comments . It would be really nice to have a comment that gives an overview of what each of the following functions are supposed to do: mcelem_array_selec mcelem_array_contain_overlap_selec and especially mcelem_array_contained_selec Also, in the compute_array_stats you reference compute_tsvector_stats for the algorithm, but I think that it would be smarter to either copy the relevant portions of the comment over or to reference a published document. If compute_tsvector_stats changes, I doubt that anyone will remember to update the comment. Finally, it would be really nice to see, explicitly, and in mathematical notation 1) The data that is being collected and summarized. ( the statistic ) 2) The estimate being derived from the statistic ( ie the selectivity ) i) Any assumptions being made ( ie independence of elements within an array ) For instance, the only note I could find that actually addressed the estimator and assumptions underneath it was +* mult * dist[i] / mcelem_dist[i] gives us probability probability +* of qual matching from assumption of independent element occurence +* with condition that length = i. and that's pretty cryptic. That should be expanded upon and placed more prominently. A couple nit picks: 1) In calc_distr you go to some lengths to avoid round off errors. Since it is certainly just the order of the estimate that matters, why not just perform the calculation in log space? 2) compute_array_stats is really long. Could you break it up? ( I know that the other analyze functions are the same way, but why continue in that vein? ) Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays
> Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats > view map both into same fileds. Yes, you're right. I am sorry about the error. > Surely, non-standard use of histogram slot > should be avoided. Agreed. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: collect frequency statistics for arrays
> Looking now, I see that Alexander wasn't Cc'd on the review, so it's > possible he missed the message? > We've corresponded off list and have discussed my review at some length. Alex submitted an updated patch on Sep 22 to me personally ( although not to the list? Alex? ), with the promise of a new version with improved comments. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
> Rebased with head. FYI, I've added myself as the reviewer for the current commitfest. Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql line number reporting from stdin
On Sat, 26 Nov 2011 22:36:15 +0200, Peter Eisentraut wrote: There is a long-standing oddity in psql that running psql -f foo.sql returns error messages with file name and line number, like psql:foo.sql:1: ERROR: syntax error at or near "foo" but running psql < foo.sql does not. I suggest we change the latter to print psql::1: ERROR: syntax error at or near "foo" Other examples for the use of the spelling "" in this context include gcc and slonik. Error messages printed in interactive mode will not be affected, of course. Patch attached. No issue with the change itself, but the docs claim that "the variant using the shell's input redirection is (in theory) guaranteed to yield exactly the same output you would have received had you entered everything by hand." -- nw -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: collect frequency statistics for arrays
>> Version of patch with few more comments and some fixes. > > Where are we on the idea of better statistics for arrays? I need to finish reviewing the patch: I'll try to send in something tomorrow night. So far it looks good. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] storage of sensor data with Fourier transforms
Hello! I have a potential situation where I will have a lot of sensor data coming in very often. (every second or so) The sensor data is from physics type measurements, and will normally follow a slowly changing pattern with sinusoidal disturbances. The overall shape of the data is more important than catching high frequency disturbances. I had the idea of taking the Fourier transform of the waveform and storing the waveform internally that way to reduce storage requirements. When I get some free cycles, I may try doing this. I would want it to be done in the database in such a way that the user can still join to a table using this internal storage scheme. Why am I mailing this list? I'd like to ask if anyone has heard of someone doing anything like this. I did a small search of the lists and internet but didn't come up with anything. I just don't want to re-invent the wheel. Thanks for your time, Nathan -- _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Got Mole problems? Call Avogadro at 6.02 x 10^23.
Re: [HACKERS] storage of sensor data with Fourier transforms
On 5/5/07, Tom Lane <[EMAIL PROTECTED]> wrote: "Nathan Buchanan" <[EMAIL PROTECTED]> writes: > I had the idea of taking the Fourier transform of the waveform and storing > the waveform internally that way to reduce storage requirements. Aside from what Steve said: The Fourier transform in itself doesn't reduce data size --- it's N points in, N points out. If you want to reduce storage requirements you have to resort to lossy compression, ie, deliberately throwing away some data. The transformed data might be more suitable for doing that (eg you can selectively discard high-frequency components) but do you really want to? Usually the point of storing measurements is so you can do unspecified analysis on them later. Applying lossy compression will restrict what you can (meaningfully) do later on. regards, tom lane Thanks for the replies. It seems I need to examine my plan more closely. Nathan
Re: [HACKERS] Not ready for 8.3
On 5/15/07, Tatsuo Ishii <[EMAIL PROTECTED]> wrote: > Stefan Kaltenbrunner wrote: > >> They are not stable. The items should point to the archives, which are > >> supposedly more stable. (I had already fixed one item in PatchStatus > >> this morning). Really it would be much nicer to have links using the > >> Message-Id but I doubt that's at all doable. > > > > hrm - I see so is there a particular reason for that behaviour ? > > They're stable until Bruce removes something from the queue. When > something is removed, it's renumbered. > > It's how mhonarc works. It's the same with the archives - if we delete a > mail, they get renumbered. So we never should delete, we should just > blank out, but it has happened a couple of times. As I proposed for many times, why don't we add message number to each subject line in mail? For example like this: [HACKERS: 12345] Re: Not ready for 8.3 This way, we could always obtain stable (logical) pointer, without reling on particular archival infrastructure. This sounds like a good idea to me - though I'm just a lurker on the list. Nathan
Re: [HACKERS] Range types
>> Because intervals (mathematical not SQL) can be open or closed at each >> end point we need to know what the next an previous value would be at >> the specified granularity. And while you can do some operations without >> knowing this, there are many you can't. For instance you could not tell >> whether two [] or () ranges were adjacent, or be able to coalesce an >> array of ranges. > > This statement seems to me to demonstrate that you don't actually > understand the concept of open and closed ranges. It has nothing > whatsoever to do with assuming that the data type is discrete; > these concepts are perfectly well defined for the reals, for example. > What it is about is whether the inclusion conditions are "< bound" > or "<= bound". IMHO the first question is whether, for integers, [1,2] UNION [3,5] should be equal to [1,5]. In math this is certainly true, and defining 'next' seems like a reasonable way to establish this in postgres. The next question is whether, for floats, [1,3-FLT_EPSILON] UNION [3,5] should be [1,5]. And the next question is whether, for numeric(6,2), [1,2.99] UNION [3,5] should be [1,5]. FWIW, I would answer yes, no, yes to those three questions. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range types
> Right, I don't think strings are any more or less countable than integers. > (and yes, it's a bit OT). Well, if I remember my algebra, I think the issue is that integers are locally finite whereas strings are not ( assuming standard orderings, of course ). -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IN, BETWEEN, spec compliance, and odd operator names
> Yes, but always in relation to operator classes, so from BTrees opclass or > such, which I refered to as "the context of index searches", as I don't > really see any theorical need for opclass if it's not for indexing. IIRC analyze uses the btree op class to build the selectivity histogram. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
>>> Right now our >>> "histogram" values are really quantiles; the statistics_target T for a >>> column determines a number of quantiles we'll keep track of, and we >>> grab values from into an ordered list L so that approximately 1/T of >>> the entries in that column fall between values L[n] and L[n+1]. I'm >>> thinking that multicolumn statistics would instead divide the range of >>> each column up into T equally sized segments, >> >> Why would you not use the same histogram bin bounds derived for the >> scalar stats (along each axis of the matrix, of course)? This seems to >> me to be arbitrarily replacing something proven to work with something >> not proven. Also, the above forces you to invent a concept of "equally >> sized" ranges, which is going to be pretty bogus for a lot of datatypes. > > Because I'm trying to picture geometrically how this might work for > the two-column case, and hoping to extend that to more dimensions, and > am finding that picturing a quantile-based system like the one we have > now in multiple dimensions is difficult. I believe those are the same > difficulties Gregory Stark mentioned having in his first post in this > thread. But of course that's an excellent point, that what we do now > is proven. I'm not sure which problem will be harder to solve -- the > weird geometry or the "equally sized ranges" for data types where that > makes no sense. > Look at copulas. They are a completely general method of describing the dependence between two marginal distributions. It seems silly to rewrite the stats table in terms of joint distributions when we'll still need the marginals anyways. Also, It might be easier to think of the dimension reduction problem in that form. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
> I'm still working my way around the math, but copulas sound better > than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics against one another. Feel free to mail me off list if you have any math questions. I've previously thought that, at least in the 2D case, we could use image compression algorithms to compress the copula, but recently I've realized that this is a change point problem. In terms of compression, we want to decompose the copula into regions that are as homogenous as possible. I'm not familiar with change point problems in multiple dimensions, but I'll try and ask someone that is, probably late next week. If you decide to go the copula route, I'd be happy to write the decomposition algorithm - or at least work on the theory. Finally, a couple points that I hadn't seen mentioned earlier that should probably be considered- 1) NULL's need to be treated specially - I suspect the assumption of NULL independence is worse than other independence assumptions. Maybe dealing with NULL dependence could be a half step towards full dependence calculations? 2) Do we want to fold the MCV's into the dependence histogram? That will cause problems in our copula approach but I'd hate to have to keep an N^d histogram dependence relation in addition to the copula. 3) For equality selectivity estimates, I believe the assumption that the ndistinct value distribution is uniform in the histogram will become worse as the dimension increases. I proposed keeping track of ndistinct per histogram beckets earlier in the marginal case partially motivated by this exact scenario. Does that proposal make more sense in this case? If so we'd need to store two distributions - one of the counts and one of ndistinct. 4) How will this approach deal with histogram buckets that have scaling count sizes ( ie -0.4 )? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
> I still need to go through backend/utils/adt/selfuncs.c > to figure out exactly how we use the one-dimensional values. > Here's a page that helped me figure all this out. http://www.postgresql.org/docs/8.1/static/planner-stats-details.html >> >> 2) Do we want to fold the MCV's into the dependence histogram? That >> will cause problems in our copula approach but I'd hate to have to >> keep an N^d histogram dependence relation in addition to the copula. > > Yeah, if we're already trying to figure out how to compress copulae, > having also to compress MCV matrices seems painful and error-prone. > But I'm not sure why it would cause problems to keep them in the > copula -- is that just because we are most interested in the copula > modeling the parts of the distribution that are most sparsely > populated? > The problem I was thinking of is that we don't currently store the true marginal distribution. As it stands, histograms only include non mcv values. So we would either need to take the mcv's separately ( which would assume independence between mcv's and non-mcv values ) or store multiple histograms. >> 4) How will this approach deal with histogram buckets that have >> scaling count sizes ( ie -0.4 )? > > I'm not sure what you mean here. > That was more a note to myself, and should have been numbered 3.5. ndistinct estimates currently start to scale after a large enough row/ndistinct ratio. If we try to model ndistinct, we need to deal with scaling ndistinct counts somehow. But that's way off in the future, it was probably pointless to mention it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simple postgresql.conf wizard
Thanks for putting out pgtune - it's a sorely needed tool. I had a chance to look over the source code and have a few comments, mostly about python specific coding conventions. - The windows specific try block ( line 16 ) raises a ValueError vs ImportError on my debian machine. Maybe it would be more appropriate to explicitly test platform.system()=="Windows"? - from ctypes import * ( 18 ) makes the block difficult to read and pollutes the namespace. - on line 45, the try block should probably catch exceptions derived from Exception ( to avoid catching SystemExit and KeyboardInterrupt errors ). ie, except Exception: return None. Also, printing out the expection in debug mode would probably be a good idea ( ie except Exception, e: print e\ return None ) - all classes ( 58, 135, 205 ) are 'old-style' classes. I dont see any reason to use classic classes ( unless Python 2.1 is a support goal? ) To make classes 'new style' classes ( http://www.python.org/doc/2.5.2/ref/node33.html ) they should inherit object. i.e. class PGConfigLine(object): - The doc strings ( 59, 136, 206 ) don't follow standard conventions, described here http://www.python.org/dev/peps/pep-0257/. - Functions also support doc strings ( 342, 351, etc. ) - Tuple unpacking doesn't require the tuple boundaries ( 446 and others ). ie, options, args = ReadOptions() works. This is more of a style comment about the 'Java-ish interface' ( line 112 ), feel free to ignore it. overloading __str__ and __repr__ are the python ways to return string representations of an object. ie, instead of toSting use __str__ and then ( on 197 ) print l or print str(l) instead of print l.toString() for the other methods ( getValue, getLineNumber, isSetting ) I'm assuming you didnt call the attributes directly because you didnt want them to be accidently overwritten. Have you considered the use of properties ( http://www.python.org/download/releases/2.2.3/descrintro/#property )? Also, it would be more clear to mark attributes as private ( i.e. _name or __name, _readable, _lineNumber, _setsParameter ) if you dont want them to be accessed directly. Hope my comments are useful! Thanks again for writing this. -Nathan P.S. I'd be happy to officially review this if it gets to that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
> One more direction could be implementing "MCV" for range of values (group > values and interpolate in between). Consider statistics on timestamp column > that says that for "2008-December" there are as many X rows, for > "2008-November" as many as Y, etc. That could be used for rather accurate > cardinality estimation of "between" cases, while keeping number of entries > in "MCV" list small. > I suggested this earlier ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ). If there's interest, I'd be happy to clean up my patch and submit it for 8.5 -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
>> What is the specific difference between what you are talking about and >> what scalarineqsel already implements? > > Hmm... Northing new. Feel sorry for bothering you. I did not realize > histograms are implemented. > Well, ISTM there is a profound difference. For scalarineqsel we care about the total number of values in a bucket. For eqsel we care about the total number of *distinct* values in each bucket ( which we don't track ). IMHO, the whole idea of increasing mcv's seems a mistake. Why not use the limited storage in pg_statistic to try and estimate the selectivity for ranges of values rather than a single value? That gives way better coverage of the distribution. If the number of values is too high to fit in a single bucket we put it in an mcv slot anyways. *That* should be the mechanism by which the number of mcv's increases. I guess this is a bit off topic for the middle of a commit fest though. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
Thanks for the response. >> Well, ISTM there is a profound difference. For scalarineqsel we care >> about the total number of values in a bucket. For eqsel we care about >> the total number of *distinct* values in each bucket > > Really? > Well, we would obviously also care about the total number of values in the buckets if we were trying to use the histogram in eqsel. Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual histogram buckets. >> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use >> the limited storage in pg_statistic to try and estimate the >> selectivity for ranges of values rather than a single value? > > MCVs are useful for questions that are not related to ranges of values > --- an example of something highly useful we do with them is to try to > estimate the fraction of a column that satisfies a LIKE or regex > pattern. > Good point. I guess I was responding to the eqsel benchmarks. I should remember that I don't necessarily know all the places that mcv's are used. > In fact, as I was just pointing out to Bruce, we can compute them and > do useful things with them for datatypes that don't have a defined sort > order and so the whole concept of "range" is meaningless. > Another good point. But don't we have bigger stat problems for datatypes without an ordering relation? IIRC, analyze doesn't even fully compute the mcv list, as that would be N^2 in the sample size. > Now, if your point is that it'd be more flexible to not tie the MCV list > length to the histogram length, you're right. No, my point is just the opposite. I think the two should be *more* tightly linked. It seems that ( at least for eqsel and scalarineqsel ) mcv's should be the values that the histogram can't deal with effectively. ie, there's no ordering relation, there are too many values to fit into a histogram bucket, the histogram eqsel estimate does an especially bad job for a relatively common value, etc. Even now, if there are 100 histogram buckets then any values that occupy more than 1% of the table will be mcv's regardless - why force a value to be an mcv if it only occupies 0.1% of the table? That just bloats pg_statistic and slows down joins unnecessarily. I'll submit a patch for 8.5 and then, hopefully, some simple benchmarks can make my point.. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
>> Isn't a selectivity estimate of x = v as ( the number of values in v's >> histogram bucket ) / ( number of distinct values in v's histogram >> bucket ) pretty rational? Thats currently what we do for non-mcv >> values, except that we look at ndistinct over the whole table instead >> of individual histogram buckets. > > But the histogram buckets are (meant to be) equal-population, so it > should come out the same either way. Why is it the same either way? The histogram buckets have equal population in the number of total values, not the number of distinct value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be [0,2) and [2,+oo), but they would have 2 and 5 distinct values respectively. > The removal of MCVs from the > population will knock any possible variance in ndistinct down to the > point where I seriously doubt that this could offer a win. Well, if the number of MCV's is large enough then it certainly won't matter. But isn't that pointlessly inefficient for most distributions? I provided some empirical evidence of this in an earlier post ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ) for normal distributions. > An even > bigger problem is that this requires estimation of ndistinct among > fractions of the population, which will be proportionally less accurate > than the overall estimate. Accurate ndistinct estimation is *hard*. > Agreed, but I'm not convinced that the overall error rate will go up for our current estimator. Ndistinct estimation is hardest for populations with a large ndistinct count variation. If the assumptions underlying this are founded ( namely that ndistinct distributions are related to the ordering relation in a predictable way ) then ndistinct estimation should be easier because the ndistinct distribution will be more uniform. But that's certainly something that should be tested on real data. >> now, if there are 100 histogram buckets then any values that occupy >> more than 1% of the table will be mcv's regardless - why force a value >> to be an mcv if it only occupies 0.1% of the table? > > Didn't you just contradict yourself? The cutoff would be 1% not 0.1%. No, I'm saying that if the number of histogram buckets is 100, then even if the mcv list is set to length 2, every value that occupies 1% or more of the table will be an mcv. However, if the size is fixed to 100 I think that it will be common to see values with relative frequencies much lower than 1% near the bottom of the list. > In any case there's already a heuristic to cut off the MCV list at some > shorter length (ie, more than 1% in this example) if it seems not > worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD) > in analyze.c. Given an increase in the default stats target, limits like the 25% are exactly what I think there should be more of. Can anyone suggest a good data set to test this sort of question on? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hist boundary duplicates bug in head and 8.3
>> For heavy tailed distributions, it is possible for analyze to >> duplicate histogram boundaries. > > I don't think this is a bug. hmmm... Well, I assumed it was a bug from a comment in analyze. >From ( near ) line 2130 in analyze.c * least 2 instances in the sample. Also, we won't suppress values * that have a frequency of at least 1/K where K is the intended * number of histogram bins; such values might otherwise cause us to * emit duplicate histogram bin boundaries. */ If this is expected, I'm also not sure what the use of maxmincount in analyze is... Thanks for the response, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers