Safe operations?

2018-08-12 Thread Samuel Williams
I wish the documentation would include performance details, i.e. this
operation is O(N) or O(1) relative to the number of rows.

I found renaming a table was okay.

How about renaming a column? Is it O(1) or proportional to the amount of
data?

Is there any documentation about this?

Thanks
Samuel


Re: Safe operations?

2018-08-13 Thread Samuel Williams
Thanks everyone for your prompt help. It sounds like a rename operation is
almost never an issue unless you literally had millions of indexes. Thanks
for all the follow on questions and answers, it was most helpful and
interesting to learn a bit more about PG internals.

On Mon, 13 Aug 2018 at 12:01, Tom Lane  wrote:

> Tim Cross  writes:
> > On Mon, 13 Aug 2018 at 12:23, Olivier Gautherot 
> >> On Sun, Aug 12, 2018 at 11:06 PM, Tim Cross 
> wrote:
> >>> Just wondering - what about the case when the column being renamed is
> >>> also referenced in an index or check constraint?
>
> >> Tim, as far as I know, names are only an attribute tagged to an OID.
> >> Internal relations are though these OIDs, not names, so renaming a
> column
> >> is really one-shot. Names are mainly a more convenient way of referring
> to
> >> objects.
>
> > thanks Olivier, that is what I suspected and your explanation fits with
> my
> > mental model. I had assumed table/column names are convenience for humans
> > and that the system would use OIDs etc for internal references.
>
> Right, catalog internal references are all via OIDs or column numbers,
> so that the only thing the system thinks it needs to do is update the
> "name" field in a single catalog row.  (A problem with this is that
> user-defined function bodies are stored as text; so you may well have
> to run around and fix your functions by hand.  But that doesn't
> contribute to the cost of the RENAME operation per se.)
>
> Getting back to Samuel's original question, the reason we don't try
> to document performance issues like this is that there are just too
> many moving parts.  Yeah, the update of the catalog row should be
> more or less O(1), and then the required updates of the catalog's
> indexes will be more or less O(log N) (N being the number of rows
> in that catalog).  But in practice the main constraint is often the
> need to obtain locks on the relevant database objects, and that's
> really hard to give a prediction for.
>
> regards, tom lane
>


libpq pipelineing

2020-06-26 Thread Samuel Williams
Hello,

Using the asynchronous interface of libpq, is it possible to pipeline
multiple queries?

i.e.

PQsendQuery(query1)
PQsendQuery(query2)

followed by

query1_results = PQgetResult(...)
query2_results = PQgetResult(...)

I tried it but got "another command is already in progress" error.

So, maybe it's not supported, or maybe I'm doing something wrong.

Thanks
Samuel




Re: libpq pipelineing

2020-06-26 Thread Samuel Williams
Thanks David,

You are correct.

I was giving an example of what I was hoping to achieve, not what I
expected to work with the current interface.

I found some discussion in the past relating to batch processing which
appears to support some kind of pipelining:

https://2ndquadrant.github.io/postgres/libpq-batch-mode.html

However it seems to be abandoned.

Kind regards,
Samuel

On Sat, 27 Jun 2020 at 16:15, David G. Johnston
 wrote:
>
> On Friday, June 26, 2020, Samuel Williams  
> wrote:
>>
>> Hello,
>>
>> Using the asynchronous interface of libpq, is it possible to pipeline
>> multiple queries?
>>
>> i.e.
>>
>> PQsendQuery(query1)
>> PQsendQuery(query2)
>>
>> followed by
>>
>> query1_results = PQgetResult(...)
>> query2_results = PQgetResult(...)
>>
>> I tried it but got "another command is already in progress" error.
>
>
> The documentation seems to leave zero ambiguity:
>
>  After successfully calling PQsendQuery, call PQgetResult one or more times 
> to obtain the results. PQsendQuery cannot be called again (on the same 
> connection) until PQgetResult has returned a null pointer, indicating that 
> the command is done.
>
> David J.
>




Re: libpq pipelineing

2020-06-26 Thread Samuel Williams
> What about, as it says, sending multiple statements in a single sendQuery and 
> then polling for multiple results?

I tried this, and even in single row streaming mode, I found that
there are cases where the results would not be streamed until all the
queries were sent.

>From the users point of view, they may generate a loop sending
multiple queries and don't care about the result, so a pipeline/batch
processing is ideal to avoid RTT per loop iteration, if database
access is slow, this can be a significant source of latency.

Kind regards,
Samuel




Re: libpq pipelineing

2020-06-27 Thread Samuel Williams
Here is a short example:

https://gist.github.com/ioquatix/2f08f78699418f65971035785c80cf18

It makes 10 queries in one "PQsendQuery" and sets single row mode. But
all the results come back at once as shown by the timestamps.

Next I'm planning to investigate streaming large recordsets to see if
it works better/incrementally.




Re: libpq pipelineing

2020-06-29 Thread Samuel Williams
I think libmariadb has a nicer interface for this.

Essentially what you do is send your query, and then read a result set
(one result set per query), and then you stream individual rows using:

mysql_fetch_row_start
mysql_fetch_row_cont

Those methods don't seem to have an equivalent in libpq - you can use
PQgetResult but it buffers all the rows. Using single row mode results
in many results for each query (seems like a big overhead). The
difference between this and MariaDB is that mysql_fetch_row still
operates within one logical set of results, but single row mode breaks
the single logical set of results into lots of individual results.

Maybe the statement about efficiency is incorrect, but it would be
nice if you could incrementally stream a single result set more
easily.


On Sun, 28 Jun 2020 at 02:40, Tom Lane  wrote:
>
> Samuel Williams  writes:
> > Here is a short example:
> > https://gist.github.com/ioquatix/2f08f78699418f65971035785c80cf18
> > It makes 10 queries in one "PQsendQuery" and sets single row mode. But
> > all the results come back at once as shown by the timestamps.
>
> That looks to be less about what libpq will do than what the Ruby
> interface code will do.
>
> The volume of return data may also be an issue.  I don't think the
> backend will flush data out to the client except when it (a) reaches
> an idle state or (b) fills the output buffer.  Ten occurrences of
> a short query result aren't gonna be enough for (b) --- from memory,
> that buffer is probably 8KB.
>
> regards, tom lane




Re: libpq pipelineing

2020-06-29 Thread Samuel Williams
Tom, I'm implementing a small abstraction layer for event-driven
result streaming on PostgreSQL and MariaDB for Ruby, and I'll endeavor
to report back with some numbers once I have enough of it working to
benchmark something meaningful.

Thanks for your patience and help.

Kind regards,
Samuel

On Tue, 30 Jun 2020 at 02:06, Tom Lane  wrote:
>
> Samuel Williams  writes:
> > Those methods don't seem to have an equivalent in libpq - you can use
> > PQgetResult but it buffers all the rows. Using single row mode results
> > in many results for each query (seems like a big overhead).
>
> Have you got any actual evidence for that?  Sure, the overhead is
> more than zero, but does it mean anything in comparison to the other
> costs of data transmission?
>
> > Maybe the statement about efficiency is incorrect, but it would be
> > nice if you could incrementally stream a single result set more
> > easily.
>
> More easily than what?  If we did not construct a PGresult then we would
> need some other abstraction for access to the returned row, dealing with
> error cases, etc etc.  That would mean a lot of very duplicative API code
> in libpq, and a painful bunch of adjustments in client code.
>
> regards, tom lane




Re: libpq pipelineing

2020-07-06 Thread Samuel Williams
Hi,

Here are some initial numbers.

DB::Client
Warming up --
 db-postgres   281.000  i/100ms
  db-mariadb   399.000  i/100ms
  mysql2   533.000  i/100ms
  pg   591.000  i/100ms
Calculating -
 db-postgres  2.725k (± 1.8%) i/s - 13.769k in   5.053750s
  db-mariadb  3.990k (± 2.4%) i/s - 19.950k in   5.002453s
  mysql2  5.153k (± 4.7%) i/s - 26.117k in   5.079570s
  pg  5.772k (± 4.4%) i/s - 28.959k in   5.027423s

Comparison:
  pg: 5771.7 i/s
  mysql2: 5152.8 i/s - 1.12x  (± 0.00) slower
  db-mariadb: 3990.3 i/s - 1.45x  (± 0.00) slower
 db-postgres: 2725.5 i/s - 2.12x  (± 0.00) slower

The db-* gems are event driven. However, that is of less interest right now.

This benchmark creates a table, inserts 1000 (or some fixed number) of
rows, and then selects them all back out.

What I noticed is that `PQgetvalue` and `PQgetisnull` is not
particularly efficient, at least via FFI. Requires rows * columns * 2
FFI calls.

libmariadb provides `mysql_fetch_row` which returns a `char **` per
row. Requires only rows FFI calls.

Does a similar method exist for libpq? e.g. `PGgetrow(index) ->
char**` (array of strings, one for each column, may be nil to indicate
null).

Kind regards,
Samuel

On Tue, 30 Jun 2020 at 12:50, Samuel Williams
 wrote:
>
> Tom, I'm implementing a small abstraction layer for event-driven
> result streaming on PostgreSQL and MariaDB for Ruby, and I'll endeavor
> to report back with some numbers once I have enough of it working to
> benchmark something meaningful.
>
> Thanks for your patience and help.
>
> Kind regards,
> Samuel
>
> On Tue, 30 Jun 2020 at 02:06, Tom Lane  wrote:
> >
> > Samuel Williams  writes:
> > > Those methods don't seem to have an equivalent in libpq - you can use
> > > PQgetResult but it buffers all the rows. Using single row mode results
> > > in many results for each query (seems like a big overhead).
> >
> > Have you got any actual evidence for that?  Sure, the overhead is
> > more than zero, but does it mean anything in comparison to the other
> > costs of data transmission?
> >
> > > Maybe the statement about efficiency is incorrect, but it would be
> > > nice if you could incrementally stream a single result set more
> > > easily.
> >
> > More easily than what?  If we did not construct a PGresult then we would
> > need some other abstraction for access to the returned row, dealing with
> > error cases, etc etc.  That would mean a lot of very duplicative API code
> > in libpq, and a painful bunch of adjustments in client code.
> >
> > regards, tom lane




Partial index on JSON column

2019-02-19 Thread Samuel Williams
Hello

I have a table with ~3 billion events.

Of this, there are a small subset of events which match the following query:

CREATE INDEX index_user_event_for_suggestion_notification ON
public.user_event USING btree parameters ->>
'suggestion_id'::text))::integer), what) WHERE ((parameters ->>
'suggestion_id'::text) IS NOT NULL)

When I do this query:

EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
'suggestion_id'::text)::integer = 26) AND what = 'suggestion_notification';

Aggregate  (cost=7115959.48..7115959.49 rows=1 width=8)
  ->  Bitmap Heap Scan on user_event  (cost=37360.24..7115907.56 rows=20771
width=0)
Recheck Cond: ((what)::text = 'suggestion_notification'::text)
Filter: (((parameters ->> 'suggestion_id'::text))::integer = 26)
->  Bitmap Index Scan on index_user_event_for_clustering
 (cost=0.00..37355.05 rows=4154273 width=0)
  Index Cond: ((what)::text = 'suggestion_notification'::text)

It's slow. I need to explicitly add the NULL constraint:

EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
'suggestion_id'::text)::integer = 26) AND ((parameters ->> 'suggestion_id'::
text) IS NOT NULL) AND what = 'suggestion_notification';

Aggregate  (cost=38871.48..38871.49 rows=1 width=8)
  ->  Index Scan using index_user_event_for_suggestion_notification on
user_event  (cost=0.42..38819.81 rows=20668 width=0)
Index Cond: parameters ->> 'suggestion_id'::text))::integer =
26) AND ((what)::text = 'suggestion_notification'::text))

I feel like the null constraint should be implicit.

That being said:

- Is my partial index wrong? Should I write it differently so the optimiser
knows this?
- Is my query wrong? How can I make the most use of this index without
being explicit?
- Any other suggestions for how I can do this?

Thanks
Samuel


Re: Partial index on JSON column

2019-02-19 Thread Samuel Williams
Thanks for the quick reply Tom,

I will try your advice.

The reason why I used ::integer for the INDEX is because I assumed it would
be more efficient both in space and performance.

In the JSONB field, it is actually an integer, i.e. {"location_age": 1,
"suggestion_id": 26}

So, now that I think about it, maybe the way I'm using ::text is wrong.

Any further advice is most appreciated.

Kind regards,
Samuel

On Wed, 20 Feb 2019 at 10:14, Tom Lane  wrote:

> Samuel Williams  writes:
> > When I do this query:
>
> > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
> > 'suggestion_id'::text)::integer = 26) AND what =
> 'suggestion_notification';
>
> > It's slow. I need to explicitly add the NULL constraint:
>
> Try it like
>
> EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
> 'suggestion_id'::text) = '26') AND what = 'suggestion_notification';
>
> I don't think we assume that CoerceViaIO is strict, and without that
> the deduction that the value couldn't be null doesn't hold.  In any
> case you're better off without the runtime type conversion: that
> isn't doing much for you except raising the odds of getting an error.
>
> regards, tom lane
>


Re: Partial index on JSON column

2019-02-19 Thread Samuel Williams
So, I checked my original code, and I didn't add `::text`, it was actually
added by Postgres:

EXPLAIN for: SELECT "user_event".* FROM "user_event" WHERE
"user_event"."what" = $1 AND (((parameters ->> 'suggestion_id')::integer) =
119 AND ((parameters ->> 'suggestion_id') IS NOT NULL)) [["what",
"suggestion_notification"]]
   QUERY PLAN

 Index Scan using index_user_event_for_suggestion_notification on
user_event  (cost=0.42..77193.95 rows=20669 width=138)
   Index Cond: parameters ->> 'suggestion_id'::text))::integer = 119)
AND ((what)::text = 'suggestion_notification'::text))
(2 rows)

Is there some way to directly use the integer value in the index with
minimal type coercions?

Thanks
Samuel

On Wed, 20 Feb 2019 at 10:24, Samuel Williams <
space.ship.travel...@gmail.com> wrote:

> Thanks for the quick reply Tom,
>
> I will try your advice.
>
> The reason why I used ::integer for the INDEX is because I assumed it
> would be more efficient both in space and performance.
>
> In the JSONB field, it is actually an integer, i.e. {"location_age": 1,
> "suggestion_id": 26}
>
> So, now that I think about it, maybe the way I'm using ::text is wrong.
>
> Any further advice is most appreciated.
>
> Kind regards,
> Samuel
>
> On Wed, 20 Feb 2019 at 10:14, Tom Lane  wrote:
>
>> Samuel Williams  writes:
>> > When I do this query:
>>
>> > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
>> > 'suggestion_id'::text)::integer = 26) AND what =
>> 'suggestion_notification';
>>
>> > It's slow. I need to explicitly add the NULL constraint:
>>
>> Try it like
>>
>> EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
>> 'suggestion_id'::text) = '26') AND what = 'suggestion_notification';
>>
>> I don't think we assume that CoerceViaIO is strict, and without that
>> the deduction that the value couldn't be null doesn't hold.  In any
>> case you're better off without the runtime type conversion: that
>> isn't doing much for you except raising the odds of getting an error.
>>
>> regards, tom lane
>>
>


Re: Partial index on JSON column

2019-02-19 Thread Samuel Williams
Thanks Tom, I did solve the problem by adding the null constraint for now,
it's a quick solution, and I look forward to the future where this case is
handled appropriately.

On Wed, 20 Feb 2019 at 12:17, Tom Lane  wrote:

> I wrote:
> > Try it like
>
> > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->>
> > 'suggestion_id'::text) = '26') AND what = 'suggestion_notification';
>
> > I don't think we assume that CoerceViaIO is strict, and without that
> > the deduction that the value couldn't be null doesn't hold.
>
> Hmm ... on closer inspection, we do assume that CoerceViaIO is strict,
> but not everyplace knows that, so there's something that could be
> improved here.  In the meantime, try it as stated above.
>
> regards, tom lane
>


libpq read/write

2019-03-30 Thread Samuel Williams
I've been doing some profiling and I was surprised to see that libpq uses
epoll when handling what essentially amounts to blocking reads/writes.

https://github.com/postgres/postgres/blob/fc22b6623b6b3bab3cb057ccd282c2bfad1a0b30/src/backend/libpq/pqcomm.c#L207-L227

https://github.com/postgres/postgres/blob/97c39498e5ca9208d3de5a443a2282923619bf91/src/backend/libpq/be-secure.c#L163-L215

I was just wondering why it needed to be so complicated?

What's wrong with just using read/write when blocking semantics are
desired? You can still restart them if they are interrupted (might not be
desired default behaviour). The problem is, it's hard for me to measure
"blocking" time when libpq uses epoll rather than blocking read/write.

I guess I'm not expecting to rewrite it or anything, just wondering why
it's designed that way.

Kind regards,
Samuel


Re: libpq read/write

2019-03-30 Thread Samuel Williams
Tom, you seem to know everything related to Postgres, so thanks for your
time and answers. I'm blown away by your dedication and knowledge.

Regarding PQisBusy, and similar, even for "non-blocking" behaviour, you are
essentially expecting the user to have their own event loop, and only
invoke the relevant libpq functions when I/O is actually possible, right?

e.g. in many cases, you'd set the socket to be non-blocking, and then just
return to the user "I want to read more data".

What's actually stopping the implementation calling read/write directly? In
the blocking case, that's correct behaviour. In the non-blocking case, I
don't see how it's helpful to use epoll, because you should just return to
the user right away.

Thanks
Samuel

On Sun, 31 Mar 2019 at 03:17, Tom Lane  wrote:

> Samuel Williams  writes:
> > I've been doing some profiling and I was surprised to see that libpq uses
> > epoll when handling what essentially amounts to blocking reads/writes.
>
> Yup.
>
> > I was just wondering why it needed to be so complicated?
>
> So that we can also support nonblocking behavior (cf PQisBusy).
>
> If the library were being written from scratch today, I doubt anybody
> would bother with that; it'd make more sense for an application to
> use a separate thread for the database interaction, if there were
> other things it needed to pay attention to concurrently.  But it is
> what it is.
>
> regards, tom lane
>