First of all, I'd like to give a name to the thing that a backend listens to. The code talks about "listening to a relation", but as the comments there point out, the name doesn't actually identify a relation. I'm going to call it a topic from now on.

I'd like to point out that we don't really have a notification queue, but a notification flag for each topic. We're not constrained by the number of notifications, but by the number of topics.

It might make sense to change the semantics so that we never lose a notification, if we're going to implement NOTIFY 'msg', but that's another discussion.

I've been thinking about the options for shmem data structure. Replacing pg_listener in the straightforward way would give us an array:

struct {
  char topicname[NAMEDATALEN];
  int listenerpid;
  int notification;
} listenertable[max_subscriptions];

Where max_subscriptions is the maximum number of active LISTENs.

If we're ready to take the performance hit, we can exploit the fact that it's signal extra backends, as long as those backends can figure out that it was a false alarm. In fact, we can always signal all backends.

Exploiting that, we could have:

struct {
  char topicname[NAMEDATALEN];
  int subscriptions; /* number of active subscriptions for this topic */
  int notification_counter; /* increase by 1 on each NOTIFY */
} listenertable[max_topics]

NOTIFY increases the notification_counter by one, and signals all backends. Every backend keeps a private list of (topicname, notification_counter) pairs for the topics it's subscribed to, in addition to the shared memory table. The signal handler compares the notification_counter in the private list and in the listenertable. If they don't match, notify the client. If they match, it was a false alarm.

The shmem requirement of this is
  max_topics * (NAMEDATALEN + sizeof(int) * 2)

If we're not ready to take the performance hit, we can add the list of backends to shmem:

struct {
  char topicname[NAMEDATALEN];
  int subscriptions; /* number of active subscriptions for this topic */
  int notification_counter; /* increase by 1 on each NOTIFY */
  int subscribers[max_backends];
} listenertable[max_topics]

and only signal those backends that are in the subscribers array.

The shmem requirement of this is
  max_topics * (NAMEDATALEN + sizeof(int) * (2 + max_backends))


We can also do a tradeoff between shmem usage and unnecessary signals:

struct {
  char topicname[NAMEDATALEN];
  int subscriptions; /* number of active subscriptions for this topic */
  int notification_counter; /* increase by 1 on each NOTIFY */
  int subscriber_cache[cache_size];
} listenertable[max_topics]

Where cache_size can be any number
  max_topics * (NAMEDATALEN + sizeof(int) * (2 + cache_size))

Where cache_size can be anything between 0 and max_backends. If the cache gets full, NOTIFY signals all backends. Otherwise, only those that are in the cache.


If max_topics is large, a hash table should be used instead of an array of structs.

Now that I think of it, using the notification_counter, we *can* guarantee that no notification is lost. The signal handler just needs to notify the client (shmem notification_counter) - (private notification_counter) times.


- Heikki

On Thu, 6 Oct 2005, Neil Conway wrote:

Applications that frequently use LISTEN/NOTIFY can suffer from
performance problems because of the MVCC bloat created by frequent
insertions into pg_listener. A solution to this has been suggested in
the past: rewrite LISTEN/NOTIFY to use shared memory rather than system
catalogs.

The problem is that there is a static amount of shared memory and a
potentially unbounded number of notifications, so we can run out of
memory. There are two ways to solve this: we can do as sinval does and
clear the shared memory queue, then effectively issue a NOTIFY ALL that
awakens all listeners. I don't like this behaviour: it seems ugly to
expose an implementation detail (static sizing of shared memory) to
applications. While a lot of applications are only using LISTEN/NOTIFY
for cache invalidation (and so spurious notifications are just a
performance hit), this behaviour still seems unfortunate to me. Using
NOTIFY ALL also makes NOTIFY 'msg' far less useful, which is a feature
several users have asked for in the past.

I think it would be better to either fail the NOTIFY when there is not
enough shared memory to add a new notification to the queue, or have the
NOTIFY block until shared memory does become available (applications
could of course implement the latter on top of the former by using
savepoints and a loop, either on the client-side or in PL/PgSQL). I
guess we could add an option to NOTIFY to specify how to handle
failures.

A related question is when to add the notification to the shared memory
queue. We don't want the notification to fire until the NOTIFY's
transaction commits, so one alternative would be to delay appending to
the queue until transaction-commit time. However, that would mean we
wouldn't notice NOTIFY failure until the end of the transaction, or else
that we would block waiting for free space during the transaction-commit
process. I think it would be better to add an entry to shared memory
during the NOTIFY itself, and stamp that entry with the NOTIFY's
toplevel XID. Other backends can read that the notification immediately
(and once all the backends have seen it, the notification can be removed
from the queue). Each backend can use the XID to determine when to
"fire" the notification (and if the notifying backend rolls back, they
can just discard the notification). This scheme is more expensive when
the notifying transaction rolls back, but I don't think that is the
common case.

Comments? (I'm still thinking about how to organize the shared memory
queue, and whether any of the sinval stuff can be reused...)

-Neil



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq


- Heikki

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to