Safe operations?
I wish the documentation would include performance details, i.e. this operation is O(N) or O(1) relative to the number of rows. I found renaming a table was okay. How about renaming a column? Is it O(1) or proportional to the amount of data? Is there any documentation about this? Thanks Samuel
Re: Safe operations?
Thanks everyone for your prompt help. It sounds like a rename operation is almost never an issue unless you literally had millions of indexes. Thanks for all the follow on questions and answers, it was most helpful and interesting to learn a bit more about PG internals. On Mon, 13 Aug 2018 at 12:01, Tom Lane wrote: > Tim Cross writes: > > On Mon, 13 Aug 2018 at 12:23, Olivier Gautherot > >> On Sun, Aug 12, 2018 at 11:06 PM, Tim Cross > wrote: > >>> Just wondering - what about the case when the column being renamed is > >>> also referenced in an index or check constraint? > > >> Tim, as far as I know, names are only an attribute tagged to an OID. > >> Internal relations are though these OIDs, not names, so renaming a > column > >> is really one-shot. Names are mainly a more convenient way of referring > to > >> objects. > > > thanks Olivier, that is what I suspected and your explanation fits with > my > > mental model. I had assumed table/column names are convenience for humans > > and that the system would use OIDs etc for internal references. > > Right, catalog internal references are all via OIDs or column numbers, > so that the only thing the system thinks it needs to do is update the > "name" field in a single catalog row. (A problem with this is that > user-defined function bodies are stored as text; so you may well have > to run around and fix your functions by hand. But that doesn't > contribute to the cost of the RENAME operation per se.) > > Getting back to Samuel's original question, the reason we don't try > to document performance issues like this is that there are just too > many moving parts. Yeah, the update of the catalog row should be > more or less O(1), and then the required updates of the catalog's > indexes will be more or less O(log N) (N being the number of rows > in that catalog). But in practice the main constraint is often the > need to obtain locks on the relevant database objects, and that's > really hard to give a prediction for. > > regards, tom lane >
libpq pipelineing
Hello, Using the asynchronous interface of libpq, is it possible to pipeline multiple queries? i.e. PQsendQuery(query1) PQsendQuery(query2) followed by query1_results = PQgetResult(...) query2_results = PQgetResult(...) I tried it but got "another command is already in progress" error. So, maybe it's not supported, or maybe I'm doing something wrong. Thanks Samuel
Re: libpq pipelineing
Thanks David, You are correct. I was giving an example of what I was hoping to achieve, not what I expected to work with the current interface. I found some discussion in the past relating to batch processing which appears to support some kind of pipelining: https://2ndquadrant.github.io/postgres/libpq-batch-mode.html However it seems to be abandoned. Kind regards, Samuel On Sat, 27 Jun 2020 at 16:15, David G. Johnston wrote: > > On Friday, June 26, 2020, Samuel Williams > wrote: >> >> Hello, >> >> Using the asynchronous interface of libpq, is it possible to pipeline >> multiple queries? >> >> i.e. >> >> PQsendQuery(query1) >> PQsendQuery(query2) >> >> followed by >> >> query1_results = PQgetResult(...) >> query2_results = PQgetResult(...) >> >> I tried it but got "another command is already in progress" error. > > > The documentation seems to leave zero ambiguity: > > After successfully calling PQsendQuery, call PQgetResult one or more times > to obtain the results. PQsendQuery cannot be called again (on the same > connection) until PQgetResult has returned a null pointer, indicating that > the command is done. > > David J. >
Re: libpq pipelineing
> What about, as it says, sending multiple statements in a single sendQuery and > then polling for multiple results? I tried this, and even in single row streaming mode, I found that there are cases where the results would not be streamed until all the queries were sent. >From the users point of view, they may generate a loop sending multiple queries and don't care about the result, so a pipeline/batch processing is ideal to avoid RTT per loop iteration, if database access is slow, this can be a significant source of latency. Kind regards, Samuel
Re: libpq pipelineing
Here is a short example: https://gist.github.com/ioquatix/2f08f78699418f65971035785c80cf18 It makes 10 queries in one "PQsendQuery" and sets single row mode. But all the results come back at once as shown by the timestamps. Next I'm planning to investigate streaming large recordsets to see if it works better/incrementally.
Re: libpq pipelineing
I think libmariadb has a nicer interface for this. Essentially what you do is send your query, and then read a result set (one result set per query), and then you stream individual rows using: mysql_fetch_row_start mysql_fetch_row_cont Those methods don't seem to have an equivalent in libpq - you can use PQgetResult but it buffers all the rows. Using single row mode results in many results for each query (seems like a big overhead). The difference between this and MariaDB is that mysql_fetch_row still operates within one logical set of results, but single row mode breaks the single logical set of results into lots of individual results. Maybe the statement about efficiency is incorrect, but it would be nice if you could incrementally stream a single result set more easily. On Sun, 28 Jun 2020 at 02:40, Tom Lane wrote: > > Samuel Williams writes: > > Here is a short example: > > https://gist.github.com/ioquatix/2f08f78699418f65971035785c80cf18 > > It makes 10 queries in one "PQsendQuery" and sets single row mode. But > > all the results come back at once as shown by the timestamps. > > That looks to be less about what libpq will do than what the Ruby > interface code will do. > > The volume of return data may also be an issue. I don't think the > backend will flush data out to the client except when it (a) reaches > an idle state or (b) fills the output buffer. Ten occurrences of > a short query result aren't gonna be enough for (b) --- from memory, > that buffer is probably 8KB. > > regards, tom lane
Re: libpq pipelineing
Tom, I'm implementing a small abstraction layer for event-driven result streaming on PostgreSQL and MariaDB for Ruby, and I'll endeavor to report back with some numbers once I have enough of it working to benchmark something meaningful. Thanks for your patience and help. Kind regards, Samuel On Tue, 30 Jun 2020 at 02:06, Tom Lane wrote: > > Samuel Williams writes: > > Those methods don't seem to have an equivalent in libpq - you can use > > PQgetResult but it buffers all the rows. Using single row mode results > > in many results for each query (seems like a big overhead). > > Have you got any actual evidence for that? Sure, the overhead is > more than zero, but does it mean anything in comparison to the other > costs of data transmission? > > > Maybe the statement about efficiency is incorrect, but it would be > > nice if you could incrementally stream a single result set more > > easily. > > More easily than what? If we did not construct a PGresult then we would > need some other abstraction for access to the returned row, dealing with > error cases, etc etc. That would mean a lot of very duplicative API code > in libpq, and a painful bunch of adjustments in client code. > > regards, tom lane
Re: libpq pipelineing
Hi, Here are some initial numbers. DB::Client Warming up -- db-postgres 281.000 i/100ms db-mariadb 399.000 i/100ms mysql2 533.000 i/100ms pg 591.000 i/100ms Calculating - db-postgres 2.725k (± 1.8%) i/s - 13.769k in 5.053750s db-mariadb 3.990k (± 2.4%) i/s - 19.950k in 5.002453s mysql2 5.153k (± 4.7%) i/s - 26.117k in 5.079570s pg 5.772k (± 4.4%) i/s - 28.959k in 5.027423s Comparison: pg: 5771.7 i/s mysql2: 5152.8 i/s - 1.12x (± 0.00) slower db-mariadb: 3990.3 i/s - 1.45x (± 0.00) slower db-postgres: 2725.5 i/s - 2.12x (± 0.00) slower The db-* gems are event driven. However, that is of less interest right now. This benchmark creates a table, inserts 1000 (or some fixed number) of rows, and then selects them all back out. What I noticed is that `PQgetvalue` and `PQgetisnull` is not particularly efficient, at least via FFI. Requires rows * columns * 2 FFI calls. libmariadb provides `mysql_fetch_row` which returns a `char **` per row. Requires only rows FFI calls. Does a similar method exist for libpq? e.g. `PGgetrow(index) -> char**` (array of strings, one for each column, may be nil to indicate null). Kind regards, Samuel On Tue, 30 Jun 2020 at 12:50, Samuel Williams wrote: > > Tom, I'm implementing a small abstraction layer for event-driven > result streaming on PostgreSQL and MariaDB for Ruby, and I'll endeavor > to report back with some numbers once I have enough of it working to > benchmark something meaningful. > > Thanks for your patience and help. > > Kind regards, > Samuel > > On Tue, 30 Jun 2020 at 02:06, Tom Lane wrote: > > > > Samuel Williams writes: > > > Those methods don't seem to have an equivalent in libpq - you can use > > > PQgetResult but it buffers all the rows. Using single row mode results > > > in many results for each query (seems like a big overhead). > > > > Have you got any actual evidence for that? Sure, the overhead is > > more than zero, but does it mean anything in comparison to the other > > costs of data transmission? > > > > > Maybe the statement about efficiency is incorrect, but it would be > > > nice if you could incrementally stream a single result set more > > > easily. > > > > More easily than what? If we did not construct a PGresult then we would > > need some other abstraction for access to the returned row, dealing with > > error cases, etc etc. That would mean a lot of very duplicative API code > > in libpq, and a painful bunch of adjustments in client code. > > > > regards, tom lane
Partial index on JSON column
Hello I have a table with ~3 billion events. Of this, there are a small subset of events which match the following query: CREATE INDEX index_user_event_for_suggestion_notification ON public.user_event USING btree parameters ->> 'suggestion_id'::text))::integer), what) WHERE ((parameters ->> 'suggestion_id'::text) IS NOT NULL) When I do this query: EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> 'suggestion_id'::text)::integer = 26) AND what = 'suggestion_notification'; Aggregate (cost=7115959.48..7115959.49 rows=1 width=8) -> Bitmap Heap Scan on user_event (cost=37360.24..7115907.56 rows=20771 width=0) Recheck Cond: ((what)::text = 'suggestion_notification'::text) Filter: (((parameters ->> 'suggestion_id'::text))::integer = 26) -> Bitmap Index Scan on index_user_event_for_clustering (cost=0.00..37355.05 rows=4154273 width=0) Index Cond: ((what)::text = 'suggestion_notification'::text) It's slow. I need to explicitly add the NULL constraint: EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> 'suggestion_id'::text)::integer = 26) AND ((parameters ->> 'suggestion_id':: text) IS NOT NULL) AND what = 'suggestion_notification'; Aggregate (cost=38871.48..38871.49 rows=1 width=8) -> Index Scan using index_user_event_for_suggestion_notification on user_event (cost=0.42..38819.81 rows=20668 width=0) Index Cond: parameters ->> 'suggestion_id'::text))::integer = 26) AND ((what)::text = 'suggestion_notification'::text)) I feel like the null constraint should be implicit. That being said: - Is my partial index wrong? Should I write it differently so the optimiser knows this? - Is my query wrong? How can I make the most use of this index without being explicit? - Any other suggestions for how I can do this? Thanks Samuel
Re: Partial index on JSON column
Thanks for the quick reply Tom, I will try your advice. The reason why I used ::integer for the INDEX is because I assumed it would be more efficient both in space and performance. In the JSONB field, it is actually an integer, i.e. {"location_age": 1, "suggestion_id": 26} So, now that I think about it, maybe the way I'm using ::text is wrong. Any further advice is most appreciated. Kind regards, Samuel On Wed, 20 Feb 2019 at 10:14, Tom Lane wrote: > Samuel Williams writes: > > When I do this query: > > > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> > > 'suggestion_id'::text)::integer = 26) AND what = > 'suggestion_notification'; > > > It's slow. I need to explicitly add the NULL constraint: > > Try it like > > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> > 'suggestion_id'::text) = '26') AND what = 'suggestion_notification'; > > I don't think we assume that CoerceViaIO is strict, and without that > the deduction that the value couldn't be null doesn't hold. In any > case you're better off without the runtime type conversion: that > isn't doing much for you except raising the odds of getting an error. > > regards, tom lane >
Re: Partial index on JSON column
So, I checked my original code, and I didn't add `::text`, it was actually added by Postgres: EXPLAIN for: SELECT "user_event".* FROM "user_event" WHERE "user_event"."what" = $1 AND (((parameters ->> 'suggestion_id')::integer) = 119 AND ((parameters ->> 'suggestion_id') IS NOT NULL)) [["what", "suggestion_notification"]] QUERY PLAN Index Scan using index_user_event_for_suggestion_notification on user_event (cost=0.42..77193.95 rows=20669 width=138) Index Cond: parameters ->> 'suggestion_id'::text))::integer = 119) AND ((what)::text = 'suggestion_notification'::text)) (2 rows) Is there some way to directly use the integer value in the index with minimal type coercions? Thanks Samuel On Wed, 20 Feb 2019 at 10:24, Samuel Williams < space.ship.travel...@gmail.com> wrote: > Thanks for the quick reply Tom, > > I will try your advice. > > The reason why I used ::integer for the INDEX is because I assumed it > would be more efficient both in space and performance. > > In the JSONB field, it is actually an integer, i.e. {"location_age": 1, > "suggestion_id": 26} > > So, now that I think about it, maybe the way I'm using ::text is wrong. > > Any further advice is most appreciated. > > Kind regards, > Samuel > > On Wed, 20 Feb 2019 at 10:14, Tom Lane wrote: > >> Samuel Williams writes: >> > When I do this query: >> >> > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> >> > 'suggestion_id'::text)::integer = 26) AND what = >> 'suggestion_notification'; >> >> > It's slow. I need to explicitly add the NULL constraint: >> >> Try it like >> >> EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> >> 'suggestion_id'::text) = '26') AND what = 'suggestion_notification'; >> >> I don't think we assume that CoerceViaIO is strict, and without that >> the deduction that the value couldn't be null doesn't hold. In any >> case you're better off without the runtime type conversion: that >> isn't doing much for you except raising the odds of getting an error. >> >> regards, tom lane >> >
Re: Partial index on JSON column
Thanks Tom, I did solve the problem by adding the null constraint for now, it's a quick solution, and I look forward to the future where this case is handled appropriately. On Wed, 20 Feb 2019 at 12:17, Tom Lane wrote: > I wrote: > > Try it like > > > EXPLAIN SELECT COUNT(*) FROM "user_event" WHERE ((parameters ->> > > 'suggestion_id'::text) = '26') AND what = 'suggestion_notification'; > > > I don't think we assume that CoerceViaIO is strict, and without that > > the deduction that the value couldn't be null doesn't hold. > > Hmm ... on closer inspection, we do assume that CoerceViaIO is strict, > but not everyplace knows that, so there's something that could be > improved here. In the meantime, try it as stated above. > > regards, tom lane >
libpq read/write
I've been doing some profiling and I was surprised to see that libpq uses epoll when handling what essentially amounts to blocking reads/writes. https://github.com/postgres/postgres/blob/fc22b6623b6b3bab3cb057ccd282c2bfad1a0b30/src/backend/libpq/pqcomm.c#L207-L227 https://github.com/postgres/postgres/blob/97c39498e5ca9208d3de5a443a2282923619bf91/src/backend/libpq/be-secure.c#L163-L215 I was just wondering why it needed to be so complicated? What's wrong with just using read/write when blocking semantics are desired? You can still restart them if they are interrupted (might not be desired default behaviour). The problem is, it's hard for me to measure "blocking" time when libpq uses epoll rather than blocking read/write. I guess I'm not expecting to rewrite it or anything, just wondering why it's designed that way. Kind regards, Samuel
Re: libpq read/write
Tom, you seem to know everything related to Postgres, so thanks for your time and answers. I'm blown away by your dedication and knowledge. Regarding PQisBusy, and similar, even for "non-blocking" behaviour, you are essentially expecting the user to have their own event loop, and only invoke the relevant libpq functions when I/O is actually possible, right? e.g. in many cases, you'd set the socket to be non-blocking, and then just return to the user "I want to read more data". What's actually stopping the implementation calling read/write directly? In the blocking case, that's correct behaviour. In the non-blocking case, I don't see how it's helpful to use epoll, because you should just return to the user right away. Thanks Samuel On Sun, 31 Mar 2019 at 03:17, Tom Lane wrote: > Samuel Williams writes: > > I've been doing some profiling and I was surprised to see that libpq uses > > epoll when handling what essentially amounts to blocking reads/writes. > > Yup. > > > I was just wondering why it needed to be so complicated? > > So that we can also support nonblocking behavior (cf PQisBusy). > > If the library were being written from scratch today, I doubt anybody > would bother with that; it'd make more sense for an application to > use a separate thread for the database interaction, if there were > other things it needed to pay attention to concurrently. But it is > what it is. > > regards, tom lane >