Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-14 Thread David Rowley
On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:
> On the whole, I don't especially like this approach, because of the
> confusion between peak lock count and end-of-xact lock count.  That
> seems way too likely to cause problems.

Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

I've not tested what Andres mentioned about simplehash instead of
dynahash yet. I did a quick scan of simplehash and it looked like
SH_START_ITERATE would suffer the same problems as dynahash's
hash_seq_search(), albeit, perhaps with some more efficient memory
lookups. i.e it still has to skip over empty buckets, which might be
costly in a bloated table.

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: progress report for ANALYZE

2019-08-14 Thread Etsuro Fujita
Hi,

On Tue, Aug 13, 2019 at 11:01 PM Alvaro Herrera
 wrote:
> On the subject of FDW support: I did look into supporting that before
> submitting this.  I think it's not academically difficult: just have the
> FDW's acquire_sample_rows callback invoke the update_param functions
> once in a while.  Sadly, in practical terms it looks like postgres_fdw
> is quite stupid about ANALYZE (it scans the whole table??) so doing
> something that's actually useful may not be so easy.  At least, we know
> the total relation size and maybe we can add the ctid column to the
> cursor in postgresAcquireSampleRowsFunc so that we have a current block
> number to report (becing careful about synchronized seqscans).

I don't follow this thread fully, so I might miss something, but I
don't think that's fully applicable, because foreign tables managed by
postgres_fdw can be eg, views on the remote side.

> I do wonder why doesn't postgres_fdw use TABLESAMPLE.

Yeah, that's really what I'm thinking for PG13; but I think we would
still need to scan the whole table in some cases (eg, when the foreign
table is a view on the remote side), because the TABLESAMLE clause can
only be applied to regular tables and materialized views.

> I did not look at other FDWs at all, mind.

IIUC, oracle_fdw already uses the SAMPLE BLOCK clause for that.  Right?

Best regards,
Etsuro Fujita




Re: use of the term "verifier" with SCRAM

2019-08-14 Thread Heikki Linnakangas

On 14/08/2019 08:59, Peter Eisentraut wrote:

I'm confused by how the code uses the term "verifier" in relation to SCRAM.

ISTM that the code uses the term as meaning whatever is or would be
stored in pg_auth.rolpassword.

I don't see this usage supported in the RFCs.  In RFC 5802,

 verifier= "v=" base64
 ;; base-64 encoded ServerSignature.

where

 ServerSignature := HMAC(ServerKey, AuthMessage)
 ServerKey   := HMAC(SaltedPassword, "Server Key")
 AuthMessage := client-first-message-bare + "," +
server-first-message + "," +
client-final-message-without-proof

whereas what is stored in rolpassword is

 SCRAM-SHA-256$:$:

where

 StoredKey   := H(ClientKey)
 ClientKey   := HMAC(SaltedPassword, "Client Key")

So while these are all related, I don't think it's accurate to call what
is in rolpassword a SCRAM "verifier".


Huh, you're right.


RFC 5803 is titled "Lightweight Directory Access Protocol (LDAP) Schema
for Storing Salted Challenge Response Authentication Mechanism (SCRAM)
Secrets".  Following that, I think calling the contents of rolpassword a
"secret" or a "stored secret" would be better.


RFC 5802 uses the term "Authentication information". See section "2.1 
Terminology":


   o  Authentication information: Information used to verify an identity
  claimed by a SCRAM client.  The authentication information for a
  SCRAM identity consists of salt, iteration count, "StoredKey" and
  "ServerKey" (as defined in the algorithm overview) for each
  supported cryptographic hash function.

But I agree that "secret", as used in RFC5803 is better.

- Heikki




Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
>
> Hi,
>

> - I think there's two fairly fundamental, and related, problems with
>   the sequence outlined above:
>
>   - We can't search for target buffers to store undo data, while holding
> the "table" page content locked.
>
> The easy way to solve would be to require sites performing UNDO
> logging to acquire victim pages before even acquiring other content
> locks. Perhaps the better approach could be for the undo layer to
> hold onto a number of clean buffers, and to keep the last page in an
> already written to undo log pinned.
>
>   - We can't search for victim buffers for further undo data while
> already holding other undo pages content locked. Doing so means that
> while we're e.g. doing IO to clean out the new page, old undo data
> on the previous page can't be read.
>
> This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> pinning and locking buffers, it'd need to be split into two
> operations. One that acquires buffers and pins them, and one that
> locks them.  I think it's quite possible that the locking operation
> could just be delayed until InsertPreparedUndo().  But if we solve
> the above problem, most of this might already be solved.

Basically, that means
- the caller should call PreparedUndoInsert before acquiring table
page content lock right? because the PreparedUndoInsert just compute
the size, allocate the space and pin+lock the buffers and for pinning
the buffers we must compute the size and allocate the space using undo
storage layer.
- So basically, if we delay the lock till InsertPreparedUndo and call
PrepareUndoInsert before acquiring table page content lock this
problem is solved?

Although I haven't yet analyzed the AM specific part that whether it's
always possible to call the PrepareUndoInsert(basically getting all
the undo record ready) before the page content lock.  But, I am sure
that won't be much difficult part.

> - To me the current split between the packed and unpacked UNDO record
>   formats makes very little sense, the reasoning behind having them is
>   poorly if at all documented, results in extremely verbose code, and
>   isn't extensible.
>
>   When preparing to insert an undo record the in-buffer size is computed
>   with UndoRecordHeaderSize() (needs to know about all optional data)
>   from within PrepareUndoInsert() (which has a bunch a bunch of
>   additional knowledge about the record format). Then during insertion
>   InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
>   UndoPackContext (again needing ...), and then, via InsertUndoData(),
>   copies that in small increments into the respective buffers (again
>   needing knowledge about the complete record format, two copies
>   even). Beside the code duplication, that also means the memory copies
>   are very inefficient, because they're all done in tiny increments,
>   multiple times.
>
>   When reading undo it's smilar: UnpackUndoData(), again in small
>   chunks, reads the buffer data into an UndoPackContext (another full
>   copy of the unpacked record format). But then FinishUnpackUndo()
>   *again* copies all that data, into an actual UnpackedUndoRecord
>   (again, with a copy of the record format, albeit slightly different
>   looking).
>
>   I'm not convinced by Heikki's argument that we shouldn't have
>   structure within undo records. In my opinion that is a significant
>   weakness of how WAL was initially designed, and even after Heikki's
>   work, still is a problem.  But this isn't the right design either.
>
>   Even if were to stay with the current fixed record format, I think
>   the current code needs a substantial redesign:
>
>   - I think 'packing' during insertion needs to serialize into a char*
> allocation during PrepareUndoInsert

ok
 computing the size in parallel
> (or perhaps in InsertPreparedUndo, but probably not).  The size of
> the record should never be split across record boundaries
> (i.e. we'll leave that space unused if we otherwise would need to
> split the size).
I think before UndoRecordAllocate we need to detect this part that
whether the size of the record will start from the last byte of the
page and if so then allocate one extra byte for the undo record.  Or
always allocate one extra byte for the undo record for handling this
case.  And, in FinalizeUndoAdvance only pass the size how much we have
actually consumed.

  The actual insertion would be a maximally sized
> memcpy() (so we'd as many memcpys as the buffer fits in, rather than
> one for each sub-type of a record).
>
> That allows to remove most of the duplicated knowledge of the record
> format, and makes insertions faster (by doing only large memcpys
> while holding exclusive content locks).
Right.
>
>   - When reading an undo record, the whole stage of UnpackUndoData()
> reading data into a the UndoPackContext 

Re: Zedstore - compressed in-core columnar storage

2019-08-14 Thread Ashutosh Sharma
Hi Ashwin,

I tried playing around with the zedstore code a bit today and there
are couple questions that came into my mind.

1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
table set the VM bits associated with it.

2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?

Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.
However, the seqscan on zedstore table is quite faster than seqscan on
heap table because the time taken for I/O is quite less in case for
zedstore.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Tue, Jul 2, 2019 at 12:45 AM Ashwin Agrawal  wrote:
>
> On Thu, May 30, 2019 at 8:07 AM DEV_OPS  wrote:
>>
>>
>> it's really cool and very good progress,
>>
>> I'm interesting if SIDM/JIT will be supported
>
>
> That's something outside of Zedstore work directly at least now. The intent 
> is to work with current executor code or enhance it only wherever needed. If 
> current executor code supports something that would work for Zedstore. But 
> any other enhancements to executor will be separate undertaking.




Re: accounting for memory used for BufFile during hash joins

2019-08-14 Thread Hubert Zhang
On Fri, Jul 12, 2019 at 1:16 AM Robert Haas  wrote:

> On Mon, May 6, 2019 at 9:49 PM Thomas Munro 
> wrote:
> > Stepping back a bit, I think there is something fishy about the way we
> > detect extreme skew.  Is that a factor in this case?  Right now we
> > wait until we have a batch that gets split into child batches
> > containing exactly 0% and 100% of the tuples before we give up.
> > Previously I had thought of that as merely a waste of time, but
> > clearly it's also a waste of unmetered memory.  Oops.
> >
> > I think our extreme skew detector should go off sooner, because
> > otherwise if you have N nicely distributed unique keys and also M
> > duplicates of one bad egg key that'll never fit in memory, we keep
> > repartitioning until none of the N keys fall into the batch containing
> > the key for the M duplicates before we give up!  You can use
> > balls-into-bins maths to figure out the number, but I think that means
> > we expect to keep splitting until we have N * some_constant batches,
> > and that's just silly and liable to create massive numbers of
> > partitions proportional to N, even though we're trying to solve a
> > problem with M.  In another thread I suggested we should stop when
> > (say) 95% of the tuples go to one child batch.  I'm not sure how you
> > pick the number.
>
> Another thing that is fishy about this is that we can't split a batch
> or a bucket without splitting them all.  Let's say that nbatches *
> nbuckets = 16 million. One bucket in one batch contains 90% of the
> tuples. Splitting *that* bucket might be a good idea if only 5% of the
> tuples end up moving, perhaps even if only 1% end up moving. But, if
> you have to double the total number of batches to get that benefit,
> it's a lot less compelling, because now you have to rescan the outer
> side more times.

It seems to me that a good chunk of what's being proposed right now
> basically ignores the fact that we're not really responding to the
> skew in a very effective way.  Thomas wants to stop splitting all the
> buckets when splitting one of the buckets produces only a very small
> benefit rather than when it produces no benefit at all, but he's not
> asking why we're splitting all of the buckets in the first place.
> Tomas wants to slice the array of batches because there are so many of
> them, but why are there so many? As he said himself, "it gets to that
> many batches because some of the values are very common and we don't
> disable the growth earlier."  Realistically, I don't see how there can
> be so many batches that we can't even fit the metadata about the
> matches into memory unless we're unnecessarily creating a lot of
> little tiny batches that we don't really need.
>
>
+1 on Robert's suggestion. It's worth to find the root cause of the batch
explosion problem.
As Robert pointed out "we can't split a batch without spilling them all".
In fact, the hybrid hash join algorithm should only split the overflow
batch and avoid to split the small batch which could be processed in
memory. Planner should calculate the initial batch number which ensure the
average size batch could be processed in memory giving different data
distribution. Executor should spilt skew batch in an one-batch-a-time
manner.

I will firstly show an example to help understand batch explosion problem.
Suppose we are going to join R and S and planner calculate the initial
nbatch as 4.
In the first batch run, during HJ_BUILD_HASHTABLE state we Scan R and build
in memory hash table for batch1 and spill other tuples of R into different
batch files(R2-R4). During HJ_NEED_NEW_OUTER and HJ_SCAN_BUCKET state, we
do two things: 1. if tuple in S belong to current batch, match it with in
memory R1 and emit result to parent plan node; 2. if tuple in S doesn't
belong to current batch, spill it to batch files of S2-S4.  As a result,
after the first batch run we get:
6 disk files: batch2(R2,S2), batch3(R3,S3) batch4(R4,S4)

Now we run into HJ_NEED_NEW_BATCH state and begin to process R2 and S2.
Suppose the second batch R2 is skewed and need to split batch number to 8.
When building in-memory hash table for R2, we also split some tuples in R2
into spill file R6.(Based on our hash function, tuples belong to R2 will
not be shuffled to batches except R6). After R2's hash table is built, we
begin to probe tuples in S2. Since batch number is changed from 4 to 8,
some of tuples in S2 now belong to S6 and we spilt them to disk file S6.
For other tuples in S2, we match them with R2 and output the result to
parent plannode. After the second batch processed, we got:
disk files: batch3(R3,S3), batch4(R4,S4),batch(R6,S6)

Next, we  begin to process R3 and S3. The third batch R3 is not skewed, but
since our hash function depends on batch number, which is 8 now. So we have
to split some  tuples in R3 to disk file R7, *which is not necessary*. When
Probing S3, we also need to spilt some tuples in S3 into S7, *which is not
necessary either*. Since R3 could be loade

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-14 Thread Antonin Houska
Bruce Momjian  wrote:

> On Sat, Aug 10, 2019 at 08:06:17AM -0400, Bruce Momjian wrote:
> > So, I just had an indea if we use separate encryption keys for
> > heap/index and for WAL --- we already know we will have an offline tool
> > that can rotate the passphrase or encryption keys.  If we allow the
> > encryption keys to be rotated independently, we can create a standby,
> > and immediately rotate its heap/index encryption key.  We can then start
> > streaming replication.  When we promote the standby to primary, we can
> > then shut it down and rotate the WAL encryption key --- the new primary
> > would then have no shared keys with the old primary.
> 
> To help move this forward, I created a new wiki TDE section titled "TODO
> for Full-Cluster Encryption" and marked some unresolved items with
> question marks:
> 
>   
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption
> 
> I have also updated some of the other text to match conclusions we have
> made.
> 
> I know some of the items are done, but if we have agreement on moving
> forward, I can help with some of the missing code.  This looks doable
> for PG 13 if we start soon.

I can work on it right away but don't know where to start.

First, I think we should use a code repository to integrate [1] and [2]
instead of sending diffs back and forth. That would force us to resolve
conflicts soon and help to avoid duplicate work. The diffs would be created
only whe we need to post the next patch version to pgsql-hackers for review,
otherwise the discussions of details can take place elsewhere.

A separate branch can be created for the Full-Cluster Encryption at some point
- there are probably multiple branching strategies.

The most difficult problem I see now regarding the collaboration is agreement
on the key management user interface. The Full-Cluster Encryption feature [1]
should not add configuration variables or even tools that the next, more
sophisticated version [2] deprecates immediately. Part of the problem is that
[2] puts all (key management related) interaction of postgres with the
environment into an external library. As I pointed out in my response to [2],
this will not work for frontend applications (e.g. pg_waldump). I think the
key management UI for [2] needs to be designed first even if PG 13 should
adopt only [1].

At least it should be clear how [2] will retrieve the master key because [1]
should not do it in a differnt way. (The GUC cluster_passphrase_command
mentioned in [3] seems viable, although I think [1] uses approach which is
more convenient if the passphrase should be read from console.) Rotation of
the master key is another thing that both versions of the feature should do in
the same way. And of course, the fronend applications need consistent approach
too.

I'm not too happy to start another (potentially long) discussion in this
already huge thread, but I think the UI stuff belongs to the -hackers list
rather than to an offline discussion.


[1] https://commitfest.postgresql.org/23/2104/

[2] 
https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO%3D8N%3Dnc2xVZPB0d9e-VjJ%3DYaRnw%40mail.gmail.com

[3]
https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: use valgrind --leak-check=yes to detect memory leak

2019-08-14 Thread Tom Lane
Alex  writes:
> I want to use valgrind to detect memory leak issue.   Then I run into 2
> problems I want to confirm them here.

> 1.  In https://wiki.postgresql.org/wiki/Valgrind,   the wiki indicates to
> use '--leak-check=no'

That's just a sample configuration.

> 2.  in https://github.com/afiskon/pgscripts/blob/master/valgrind.sh#L55,
> it use 'leak-check=no' as well with the wold "No point to check for memory
> leaks, Valgrind doesn't understand MemoryContexts and stuff"

That info is many years out-of-date.  You can do it with USE_VALGRIND,
and sometimes that's helpful, but ...

> Q2:   do we check memory leak for some new commits or we can ignore them
> based on we use memory context carefully?  If we want to check memory leak
> for some new commits,  how to handle the existing memory leak case?

Generally, the philosophy in PG is to not bother with freeing data
explicitly if letting it be reclaimed at context deletion is good enough.
Sometimes that's not good enough, but it is in most places, and for that
reason plain valgrind leak checking is of limited use.

valgrind can be pretty helpful if you're trying to identify the origin
of a serious leak --- the kind that accumulates memory wastage
repetitively over a query, for example.  But what you have to do is
look for big leaks and ignore all the minor "leaks".

regards, tom lane




Re: Auxiliary Processes and MyAuxProc

2019-08-14 Thread Yuli Khodorkovskiy
On Mon, Feb 25, 2019 at 5:25 PM Mike Palmiotto
 wrote:
>
> On Mon, Feb 25, 2019 at 1:41 PM Mike Palmiotto
>  wrote:
> >
> > 
> > >
> > > If memory serves, StartChildProcess already was an attempt to unify
> > > the treatment of postmaster children.  It's possible that another
> > > round of unification would be productive, but I think you'll find
> > > that there are random small differences in requirements that'd
> > > make it messy.
> >
> > It kind of seemed like it, but I noticed the small differences in
> > requirements, which made me a bit hesitant. I'll go ahead and see what
> > I can do and submit the patch for consideration.
>
> I'm considering changing StartChildProcess to take a struct with data
> for forking/execing each different process. Each different backend
> type would build up the struct and then pass it on to
> StartChildProcess, which would handle each separately. This would
> ensure that the fork type is set prior to InitPostmasterChild and
> would provide us with the information necessary to do what we need in
> the InitPostmasterChild_hook.
>
> Attached is a patch to fork_process.h which shows roughly what I'm
> thinking. Does this seem somewhat sane as a first step?
>

All,

Mike and I have written two patches to solve the issues
discussed in this thread.

The first patch centralizes the startup of workers and extends
worker identification that was introduced by AuxProcType. The worker
id can then be leveraged by extensions for identification of each
process.

The second patch adds a new hook that allows extensions to modify a worker
process' metadata in backends.

These patches should make future worker implementation more
straightforward as there is now one function to call that sets up
each worker. There is also code cleanup and removal of startup
redundancies.

Please let me know your thoughts. I look forward to your feedback.

Thanks,

Yuli


Add-a-hook-to-allow-extensions-to-set-worker-metadat.patch
Description: Binary data


Refactor-child-process-startup.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-13 13:53:59 +0530, Dilip Kumar wrote:
> On Tue, Jul 30, 2019 at 1:32 PM Andres Freund  wrote:
> > > + /* Loop until we have fetched all the buffers in which we need to 
> > > write. */
> > > + while (size > 0)
> > > + {
> > > + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, 
> > > RBM_NORMAL);
> > > + xact_info->idx_undo_buffers[index++] = bufidx;
> > > + size -= (BLCKSZ - starting_byte);
> > > + starting_byte = UndoLogBlockHeaderSize;
> > > + cur_blk++;
> > > + }
> >
> > So, this locks a very large number of undo buffers at the same time, do
> > I see that correctly?  What guarantees that there are no deadlocks due
> > to multiple buffers locked at the same time (I guess the order inside
> > the log)? What guarantees that this is a small enough number that we can
> > even lock all of them at the same time?
> 
> I think we are locking them in the block order and that should avoid
> the deadlock.  I have explained in the comments.

Sorry for harping on this so much: But please, please, *always* document
things like this *immediately*. This is among the most crucial things to
document. There shouldn't need to be a reviewer prodding you to do so
many months after the code has been written. For one you've likely
forgotten details by then, but more importantly dependencies on the
locking scheme will have crept into further places - if it's not well
thought through that can be hrad to undo. And it wastes reviewer /
reader bandwidth.


> > Why do we need to lock all of them at the same time? That's not clear to
> > me.
> 
> Because this is called outside the critical section so we keep all the
> buffers locked what we want to update inside the critical section for
> single wal record.

I don't understand this explanation. What does keeping the buffers
locked have to do with the critical section? As explained in a later
email, I think the current approach is not acceptable - but even without
those issues, I don't see why we couldn't just lock the buffers at a
later stage?




> > > + for (i = 0; i < context->nprepared_undo_buffer; i++)
> > > + {
> >
> > How large do we expect this to get at most?
> >
> In BeginUndoRecordInsert we are computing it
> 
> + /* Compute number of buffers. */
> + nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO;

Since nprepared is variable, that doesn't really answer the question.



Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-13 17:05:27 +0530, Dilip Kumar wrote:
> On Mon, Aug 5, 2019 at 11:59 PM Andres Freund  wrote:
> > (as I was out of context due to dealing with bugs, I've switched to
> > lookign at the current zheap/undoprocessing branch.
> >
> > On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > > +/*
> > > + * Insert a previously-prepared undo records.
> > > + *
> > > + * This function will write the actual undo record into the buffers 
> > > which are
> > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. 
> > >  This
> > > + * step should be performed inside a critical section.
> > > + */
> >
> > Again, I think it's not ok to just assume you can lock an essentially
> > unbounded number of buffers. This seems almost guaranteed to result in
> > deadlocks. And there's limits on how many lwlocks one can hold etc.
> 
> I think for controlling that we need to put a limit on max prepared
> undo?  I am not sure any other way of limiting the number of buffers
> because we must lock all the buffer in which we are going to insert
> the undo record under one WAL logged operation.

I heard that a number of times. But I still don't know why that'd
actually be true. Why would it not be sufficient to just lock the buffer
currently being written to, rather than all buffers? It'd require a bit
of care updating the official current "logical end" of a log, but
otherwise ought to not be particularly hard? Only one backend can extend
the log after all, and until the log is externally visibily extended,
nobody can read or write those buffers, no?


> >
> > As far as I can tell there's simply no deadlock avoidance scheme in use
> > here *at all*? I must be missing something.
> 
> We are always locking buffer in block order so I am not sure how it
> can deadlock?  Am I missing something?

Do we really in all circumstances? Note that we update the transinfo
(and also progress) from earlier in the log. But my main point is more
that there's no documented deadlock avoidance scheme. Which imo means
there's none, because nobody will know to maintain it.



> > > + /*
> > > +  * During recovery, there might be some blocks 
> > > which are already
> > > +  * deleted due to some discard command so we can 
> > > just skip
> > > +  * inserting into those blocks.
> > > +  */
> > > + if (!BufferIsValid(buffer))
> > > + {
> > > + Assert(InRecovery);
> > > +
> > > + /*
> > > +  * Skip actual writing just update the 
> > > context so that we have
> > > +  * write offset for inserting into next 
> > > blocks.
> > > +  */
> > > + SkipInsertingUndoData(&ucontext, BLCKSZ - 
> > > starting_byte);
> > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > > + break;
> > > + }
> >
> > How exactly can this happen?
> 
> Suppose you insert one record for the transaction which split in
> block1 and 2.  Now, before this block is actually going to the disk
> the transaction committed and become all visible the undo logs are
> discarded.  It's possible that block 1 is completely discarded but
> block 2 is not because it might have undo for the next transaction.
> Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> their so we need to skip inserting undo for block 1 as it does not
> exist.

Hm. I'm quite doubtful this is a good idea. How will this not force us
to a emit a lot more expensive durable operations while writing undo?
And doesn't this reduce error detection quite remarkably?

Thomas, Robert?





> > > + /* Read the undo record. */
> > > + UndoGetOneRecord(uur, urecptr, rnode, category, 
> > > &buffer);
> > > +
> > > + /* Release the discard lock after fetching the 
> > > record. */
> > > + if (!InHotStandby)
> > > + LWLockRelease(&slot->discard_lock);
> > > + }
> > > + else
> > > + UndoGetOneRecord(uur, urecptr, rnode, category, 
> > > &buffer);
> >
> >
> > And then we do none of this in !one_page mode.
> UndoBulkFetchRecord is always called from the aborted transaction so
> its undo can never get discarded concurrently so ideally, we don't
> need to check for discard.

That's an undocumented assumption. Why would anybody reading the
interface know that?



> > > +static uint16
> > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > > +  UndoLogCategory category)
> > > +{
> > > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> > > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> > > + Buff

Re: POC: Cleaning up orphaned files using undo logs

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> > - I think there's two fairly fundamental, and related, problems with
> >   the sequence outlined above:
> >
> >   - We can't search for target buffers to store undo data, while holding
> > the "table" page content locked.
> >
> > The easy way to solve would be to require sites performing UNDO
> > logging to acquire victim pages before even acquiring other content
> > locks. Perhaps the better approach could be for the undo layer to
> > hold onto a number of clean buffers, and to keep the last page in an
> > already written to undo log pinned.
> >
> >   - We can't search for victim buffers for further undo data while
> > already holding other undo pages content locked. Doing so means that
> > while we're e.g. doing IO to clean out the new page, old undo data
> > on the previous page can't be read.
> >
> > This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > pinning and locking buffers, it'd need to be split into two
> > operations. One that acquires buffers and pins them, and one that
> > locks them.  I think it's quite possible that the locking operation
> > could just be delayed until InsertPreparedUndo().  But if we solve
> > the above problem, most of this might already be solved.
> 
> Basically, that means
> - the caller should call PreparedUndoInsert before acquiring table
> page content lock right? because the PreparedUndoInsert just compute
> the size, allocate the space and pin+lock the buffers and for pinning
> the buffers we must compute the size and allocate the space using undo
> storage layer.

I don't think we can normally pin the undo buffers properly at that
stage. Without knowing the correct contents of the table page - which we
can't know without holding some form of lock preventing modifications -
we can't know how big our undo records are going to be. And we can't
just have buffers that don't exist on disk in shared memory, and we
don't want to allocate undo that we then don't need. So I think what
we'd have to do at that stage, is to "pre-allocate" buffers for the
maximum amount of UNDO needed, but mark the associated bufferdesc as not
yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
would not be set.

So at the start of a function that will need to insert undo we'd need to
pre-reserve the maximum number of buffers we could potentially
need. That reservation stage would

a) pin the page with the current end of the undo
b) if needed pin the page of older undo that we need to update (e.g. to
   update the next pointer)
c) perform clock sweep etc to acquire (find or create) enough clean to
   hold the maximum amount of undo needed. These buffers would be marked
   as !BM_TAG_VALID | BUF_REFCOUNT_ONE.

I assume that we'd make a) cheap by keeping it pinned for undo logs that
a backend is actively attached to. b) should only be needed once in a
transaction, so it's not too bad. c) we'd probably need to amortize
across multiple undo insertions, by keeping the unused buffers pinned
until the end of the transaction.

I assume that having the infrastructure c) might also make some code
for already in postgres easier. There's obviously some issues around
guaranteeing that the maximum number of such buffers isn't high.


> - So basically, if we delay the lock till InsertPreparedUndo and call
> PrepareUndoInsert before acquiring table page content lock this
> problem is solved?
> 
> Although I haven't yet analyzed the AM specific part that whether it's
> always possible to call the PrepareUndoInsert(basically getting all
> the undo record ready) before the page content lock.  But, I am sure
> that won't be much difficult part.

I think that is somewhere between not possible, and so expensive in a
lot of cases that we'd not want to do it anyway. You'd at leasthave to
first acquire a content lock on the page, mark the target tuple as
locked, then unlock the page, reserve undo, lock the table page,
actually update it.


> >   - When reading an undo record, the whole stage of UnpackUndoData()
> > reading data into a the UndoPackContext is omitted, reading directly
> > into the UnpackedUndoRecord. That removes one further copy of the
> > record format.
> So we will read member by member to UnpackedUndoRecord?  because in
> context we have at least a few headers packed and we can memcpy one
> header at a time like UndoRecordHeader, UndoRecordBlock.

Well, right now you then copy them again later, so not much is gained by
that (although that later copy can happen without the content lock
held). As I think I suggested before, I suspect that the best way would
be to just memcpy() the data from the page(s) into an appropriately
sized buffer with the content lock held, and then perform unpacking
directly into UnpackedUndoRecord. Especially with the bulk API that will
avoid having to do m

Re: Zedstore - compressed in-core columnar storage

2019-08-14 Thread Ashwin Agrawal
 On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma 
wrote:

> Hi Ashwin,
>
> I tried playing around with the zedstore code a bit today and there
> are couple questions that came into my mind.
>

Great! Thank You.


>
> 1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
> table set the VM bits associated with it.
>

Zedstore tables can be vacuumed. On vacuum, minimal work is performed
though compared to heap. Full table is not scanned.  Only UNDO log is
truncated/discarded based on RecentGlobalXmin. Plus, only TidTree or
Meta column is scanned to find dead tuples and index entries cleaned
for them, based on the same.

Currently, for zedstore we have not used the VM at all. So, it doesn't
touch the same during any operation.

2) Is there a chance that IndexOnlyScan would ever be required for
> zedstore tables considering the design approach taken for it?
>

We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.


> Further, I tried creating a zedstore table with btree index on one of
> it's column and loaded around 50 lacs record into the table. When the
> indexed column was scanned (with enable_seqscan flag set to off), it
> went for IndexOnlyScan and that took around 15-20 times more than it
> would take for IndexOnly Scan on heap table just because IndexOnlyScan
> in zedstore always goes to heap as the visibility check fails.
> However, the seqscan on zedstore table is quite faster than seqscan on
> heap table because the time taken for I/O is quite less in case for
> zedstore.
>

Thanks for reporting, we will look into it. Should be able to optimize
it. Given no VM exists, IndexOnlyScans currently for zedstore behave
more or less like IndexScans. Planner picks IndexOnlyScans for
zedstore, mostly due to off values for reltuples, relpages, and
relallvisible.

We have been focused on implementing and optimizing the AM pieces. So,
not much work has been done for planner estimates and tunning yet. The
first step for the same to get the needed columns in the planner
instead of the executor in [1] is proposed. Once, that bakes will use
the same to perform more planner estimates and all. Also, analyze
needs work to properly reflect reltuples and relpages to influence the
planner correctly.


1]
https://www.postgresql.org/message-id/CAAKRu_ZQ0Jy7LfZDCY0JdxChdpja9rf-S8Y5%2BU4vX7cYJd62dA%40mail.gmail.com


Re: Feature: Use DNS SRV records for connecting

2019-08-14 Thread Andres Freund
Hi,

On 2019-08-13 10:43:07 -0400, Tom Lane wrote:
> How would we get at that data without writing our own DNS client?
> (AFAIK, our existing DNS interactions are all handled by getnameinfo()
> or other library-supplied functions.)

> Maybe that'd be worth doing, but it sounds like a lot of work and a
> lot of new code to maintain, relative to the value of the feature.

It might have enough independent advantages to make it worthwhile
though.

Right now our non-blocking interfaces aren't actually in a number of
cases, due to name resolution being blocking. While that's documented,
it imo means that our users need to use a non-blocking DNS library, if
they need non-blocking PQconnectPoll() - it's imo not that practical to
just use IPs in most cases.

We also don't have particularly good control over the order of hostnames
returned by getaddrinfo, which makes it harder to implement reliable
round-robin etc.

Greetings,

Andres Freund




Re: proposal: make NOTIFY list de-duplication optional

2019-08-14 Thread Tom Lane
I wrote:
> I think that we'd probably be better off fixing the root performance issue
> than adding semantic complexity to bypass it. ...
> Accordingly, I looked into making a hash table when there are more than
> a small number of notifications pending, and attached is a lightly-tested
> version of that.  This seems to be more or less similar speed to the
> existing code for up to 100 or so distinct notifies, but it soon pulls
> away above that.

I noticed that the cfbot was unhappy with this, because it (intentionally)
changes the results of the async-notify isolation tests I added awhile
ago.  So here's an updated version that adjusts that test, and also
changes the NOTIFY documentation to remove the old weasel wording about
whether we de-dup or not.

> I also noticed that as things stand, it costs us two or three pallocs to
> construct a Notification event.  It wouldn't be terribly hard to reduce
> that to one palloc, and maybe it'd be worthwhile if we're thinking that
> transactions with many many notifies are a case worth optimizing.
> But I didn't do that here; it seems like a separable patch.

I also did that, attached as the second patch below.  This way ends up
requiring us to palloc the Notification event and then pfree it again,
if it turns out to be a dup.  Despite that, it's faster than the first
patch alone, and also faster than HEAD in every case I tried.  Not
much faster, if there's not a lot of dups, but as far as I can find
there isn't any case where it loses compared to HEAD.  Even with
subtransactions, where in principle the time to merge subtransaction
event lists into the parent transaction ought to cost us.  You can't
get that to matter unless the subtransaction had a lot of distinct
events, and then HEAD hits its O(N^2) behavior inside the subxact.

So I can't really see any reason not to commit these.

That leaves the question of whether we want to continue pursuing
the proposed feature for user control of de-duping.  I'd tend
to vote against, because it seems like semantic complexity we
don't need.  While the idea sounds straightforward, I think it
isn't so much when you start to think hard about how notifies
issued with and without "collapse" ought to interact.

regards, tom lane

diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index e0e125a..d7dcbea 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -94,9 +94,9 @@ NOTIFY channel [ , 
 
   
-   If the same channel name is signaled multiple times from the same
-   transaction with identical payload strings, the
-   database server can decide to deliver a single notification only.
+   If the same channel name is signaled multiple times with identical
+   payload strings within the same transaction, only one instance of the
+   notification event is delivered to listeners.
On the other hand, notifications with distinct payload strings will
always be delivered as distinct notifications. Similarly, notifications from
different transactions will never get folded into one notification.
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6e9c580..3f5f054 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -135,6 +135,7 @@
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
+#include "utils/hashutils.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/snapmgr.h"
@@ -323,14 +324,25 @@ static List *upperPendingActions = NIL; /* list of upper-xact lists */
 
 /*
  * State for outbound notifies consists of a list of all channels+payloads
- * NOTIFYed in the current transaction. We do not actually perform a NOTIFY
- * until and unless the transaction commits.  pendingNotifies is NIL if no
- * NOTIFYs have been done in the current transaction.
+ * NOTIFYed in the current transaction.  We do not actually perform a NOTIFY
+ * until and unless the transaction commits.  pendingNotifies is NULL if no
+ * NOTIFYs have been done in the current (sub) transaction.
+ *
+ * We discard duplicate notify events issued in the same transaction.
+ * Hence, in addition to the list proper (which we need to track the order
+ * of the events, since we guarantee to deliver them in order), we build a
+ * hash table which we can probe to detect duplicates.  Since building the
+ * hash table is somewhat expensive, we do so only once we have at least
+ * MIN_HASHABLE_NOTIFIES events queued in the current (sub) transaction;
+ * before that we just scan the events linearly.
  *
  * The list is kept in CurTransactionContext.  In subtransactions, each
  * subtransaction has its own list in its own CurTransactionContext, but
- * successful subtransactions attach their lists to their parent's list.
- * Failed subtransactions simply discard their lists.
+ * successful subtransactions add their entries to their parent's list.
+ * Failed subtransactions simply discard their l

Re: "long" type is not appropriate for counting tuples

2019-08-14 Thread Andres Freund
Hi,

On 2019-05-22 16:40:28 +0200, Peter Eisentraut wrote:
> On 2019-04-29 19:52, Andres Freund wrote:
> > Hm. It appears that gettext supports expanding PRId64 PRIu64 etc in
> > translated strings.
> 
> That won't work in non-GNU gettext.

Which of those do we actually support? We already depend on
bind_textdomain_codeset, which afaict wasn't present in older gettext
implementations.

- Andres




Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-08-14 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Jul-22, Alvaro Herrera wrote:
>> After looking at the code some more, I think calling the new function in
>> the Prep phase is correct.  The attached patch is pretty much final form
>> for this bugfix.  I decided to unwrap a couple of error messages (I did
>> get bitten while grepping because of this), and reordered one of the new
>> Identity command cases in ATPrepCmd since it appeared in inconsistent
>> order in that one place of four.

> Pushed to all three branches.

This is still listed as a live issue in

https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Live_issues

Should that be closed now?

regards, tom lane




Re: Removing unneeded downlink field from nbtree stack struct

2019-08-14 Thread Peter Geoghegan
On Mon, Aug 12, 2019 at 9:43 AM Anastasia Lubennikova
 wrote:
> The refactoring is clear, so I set Ready for committer status.
> I have just a couple of notes about comments:
>
> 1) I think that it's worth to add explanation of the case when we use
> right sibling to this comment:
> +* stack to work back up to the parent page.  We use the
> child block
> +* number (or possibly the block number of a page to its
> right)

That appears over _bt_getstackbuf().

> 2) It took me quite some time to understand why does page deletion case
> doesn't need a lock.
> I propose to add something like "For more see comments for
> _bt_lock_branch_parent()" to this line:

I ended up removing the reference to page deletion here (actually, I
removed the general discussion about the need to keep the child page
locked). This seemed like something that was really up to the callers.

Pushed a version with that change. Thanks for the review!

-- 
Peter Geoghegan




Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-08-14 Thread Alvaro Herrera
On 2019-Aug-14, Tom Lane wrote:

> Alvaro Herrera  writes:
> > On 2019-Jul-22, Alvaro Herrera wrote:
> >> After looking at the code some more, I think calling the new function in
> >> the Prep phase is correct.  The attached patch is pretty much final form
> >> for this bugfix.  I decided to unwrap a couple of error messages (I did
> >> get bitten while grepping because of this), and reordered one of the new
> >> Identity command cases in ATPrepCmd since it appeared in inconsistent
> >> order in that one place of four.
> 
> > Pushed to all three branches.
> 
> This is still listed as a live issue in
> 
> https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Live_issues
> 
> Should that be closed now?

Yep, done, thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-14 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Perhaps we could put some of these details into the Notes section of the
>> ALTER SYSTEM ref page.  But I wonder how much of this is needed at all.

> I'd be alright with that too, but I'd be just as fine with even a README
> or something that we feel other hackers and external tool developers
> would be likely to find.  I agree that all of this isn't something that
> your run-of-the-mill DBA needs to know, but they are things that I'm
> sure external tool authors will care about (including myself, David S,
> probably the other backup/restore tool maintainers, and at least the
> author of pg_conftool, presumably).

In hopes of moving this along, I've pushed Ian's last code change,
as there seems to be no real argument about that anymore.

As for the doc changes, how about the attached revision of what
I wrote previously?  It gives some passing mention to what ALTER
SYSTEM will do, without belaboring it or going into things that
are really implementation details.

As an example of the sort of implementation detail that I *don't*
want to document, I invite you to experiment with the difference
between
ALTER SYSTEM SET TimeZone = 'America/New_York';
ALTER SYSTEM SET "TimeZone" = 'America/New_York';

regards, tom lane

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cdc30fa..e99482d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -153,6 +153,8 @@ shared_buffers = 128MB
  identifiers or numbers must be single-quoted.  To embed a single
  quote in a parameter value, write either two quotes (preferred)
  or backslash-quote.
+ If the file contains multiple entries for the same parameter,
+ all but the last one are ignored.
 
 
 
@@ -185,18 +187,29 @@ shared_buffers = 128MB
  In addition to postgresql.conf,
  a PostgreSQL data directory contains a file
  postgresql.auto.confpostgresql.auto.conf,
- which has the same format as postgresql.conf but should
- never be edited manually.  This file holds settings provided through
- the  command.  This file is automatically
- read whenever postgresql.conf is, and its settings take
- effect in the same way.  Settings in postgresql.auto.conf
- override those in postgresql.conf.
+ which has the same format as postgresql.conf but
+ is intended to be edited automatically not manually.  This file holds
+ settings provided through the  command.
+ This file is read whenever postgresql.conf is,
+ and its settings take effect in the same way.  Settings
+ in postgresql.auto.conf override those
+ in postgresql.conf.
+
+
+
+ External tools may also
+ modify postgresql.auto.conf.  It is not
+ recommended to do this while the server is running, since a
+ concurrent ALTER SYSTEM command could overwrite
+ such changes.  Such tools might simply append new settings to the end,
+ or they might choose to remove duplicate settings and/or comments
+ (as ALTER SYSTEM will).
 
 
 
  The system view
  pg_file_settings
- can be helpful for pre-testing changes to the configuration file, or for
+ can be helpful for pre-testing changes to the configuration files, or for
  diagnosing problems if a SIGHUP signal did not have the
  desired effects.
 


Improve Assert output

2019-08-14 Thread Peter Eisentraut
This kind of output is usually not helpful:

TRAP: BadArgument("((context) != ((void *)0) && (const
Node*)((context)))->type) == T_AllocSetContext) || const
Node*)((context)))->type) == T_SlabContext) || const
Node*)((context)))->type) == T_GenerationContext)))", File:
"../../../../src/include/utils/memutils.h", Line: 129)

What we probably want is something like:

TRAP: BadArgument("MemoryContextIsValid(context)", File:
"../../../../src/include/utils/memutils.h", Line: 129)

The problem is that the way the Assert macros are written they
macro-expand the arguments before stringifying them.  The attached patch
fixes that.  This requires both replacing CppAsString by plain "#" and
not redirecting Assert() to Trap().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 7db60e612e1d26c7a1859948594c28dc3b54353e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 14 Aug 2019 22:23:34 +0200
Subject: [PATCH] Improve Assert output

If an assertion expression contained a macro, the failed assertion
message would print the expanded macro, which is usually unhelpful and
confusing.  Restructure the Assert macros to not expand any macros
when constructing the failure message.
---
 src/include/c.h | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index 2a082afab1..f461628a24 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -755,7 +755,7 @@ typedef NameData *Name;
 #define Trap(condition, errorType) \
do { \
if (condition) \
-   ExceptionalCondition(CppAsString(condition), 
(errorType), \
+   ExceptionalCondition(#condition, (errorType), \
 __FILE__, 
__LINE__); \
} while (0)
 
@@ -768,20 +768,34 @@ typedef NameData *Name;
  */
 #define TrapMacro(condition, errorType) \
((bool) (! (condition) || \
-(ExceptionalCondition(CppAsString(condition), 
(errorType), \
+(ExceptionalCondition(#condition, (errorType), \
   __FILE__, 
__LINE__), 0)))
 
 #define Assert(condition) \
-   Trap(!(condition), "FailedAssertion")
+   do { \
+   if (!(condition)) \
+   ExceptionalCondition(#condition, "FailedAssertion", \
+__FILE__, 
__LINE__); \
+   } while (0)
 
 #define AssertMacro(condition) \
-   ((void) TrapMacro(!(condition), "FailedAssertion"))
+   ((void) ((condition) || \
+(ExceptionalCondition(#condition, "FailedAssertion", \
+  __FILE__, 
__LINE__), 0)))
 
 #define AssertArg(condition) \
-   Trap(!(condition), "BadArgument")
+   do { \
+   if (!(condition)) \
+   ExceptionalCondition(#condition, "BadArgument", \
+__FILE__, 
__LINE__); \
+   } while (0)
 
 #define AssertState(condition) \
-   Trap(!(condition), "BadState")
+   do { \
+   if (!(condition)) \
+   ExceptionalCondition(#condition, "BadState", \
+__FILE__, 
__LINE__); \
+   } while (0)
 
 /*
  * Check that `ptr' is `bndr' aligned.
-- 
2.22.0



Re: Improve Assert output

2019-08-14 Thread Tom Lane
Peter Eisentraut  writes:
> This kind of output is usually not helpful:
> TRAP: BadArgument("((context) != ((void *)0) && (const
> Node*)((context)))->type) == T_AllocSetContext) || const
> Node*)((context)))->type) == T_SlabContext) || const
> Node*)((context)))->type) == T_GenerationContext)))", File:
> "../../../../src/include/utils/memutils.h", Line: 129)

> What we probably want is something like:

> TRAP: BadArgument("MemoryContextIsValid(context)", File:
> "../../../../src/include/utils/memutils.h", Line: 129)

+1, that would be a big improvement.  The other thing that this
is fixing is that the existing output for Assert et al shows
the *inverted* condition, which I for one always found confusing.

I didn't try to test the patch, but it passes eyeball examination.

regards, tom lane




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-14 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Perhaps we could put some of these details into the Notes section of the
> >> ALTER SYSTEM ref page.  But I wonder how much of this is needed at all.
> 
> > I'd be alright with that too, but I'd be just as fine with even a README
> > or something that we feel other hackers and external tool developers
> > would be likely to find.  I agree that all of this isn't something that
> > your run-of-the-mill DBA needs to know, but they are things that I'm
> > sure external tool authors will care about (including myself, David S,
> > probably the other backup/restore tool maintainers, and at least the
> > author of pg_conftool, presumably).
> 
> In hopes of moving this along, I've pushed Ian's last code change,
> as there seems to be no real argument about that anymore.
> 
> As for the doc changes, how about the attached revision of what
> I wrote previously?  It gives some passing mention to what ALTER
> SYSTEM will do, without belaboring it or going into things that
> are really implementation details.

It's certainly better than what we have now.

> As an example of the sort of implementation detail that I *don't*
> want to document, I invite you to experiment with the difference
> between
>   ALTER SYSTEM SET TimeZone = 'America/New_York';
>   ALTER SYSTEM SET "TimeZone" = 'America/New_York';

Implementation details and file formats / acceptable transformations
are naturally different things- a given implementation may sort things
one way or another but if there's no requirement that the file be sorted
then that's just fine and can be an implementation detail possibly based
around how duplicates are dealt with.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Change ereport level for QueuePartitionConstraintValidation

2019-08-14 Thread Alvaro Herrera
On 2019-Jul-23, David Rowley wrote:

> I don't know my way around the tap tests that well, but I started to
> look at this and ended up a bit stuck on where the test should be
> located.  I see src/test/modules/brin has some brin related tests, so
> I thought that src/test/modules/alter_table might be the spot, but
> after looking at src/test/README I see it mentions that only tests
> that are themselves an extension should be located within:
> 
> modules/
>   Extensions used only or mainly for test purposes, generally not suitable
>   for installing in production databases
> 
> There are a few others in the same situation as brin; commit_ts,
> snapshot_too_old, unsafe_tests.   I see unsafe_tests does mention the
> lack of module in the README file.

The readme in src/test/modules says "extensions or libraries", and I see
no reason to think that a TAP test would be totally out of place there.
I think the alter_table/ subdir is a perfect place.

Sergei, can we enlist you to submit a patch for this?  Namely reduce the
log level to DEBUG1 and add a TAP test in src/test/modules/alter_table/
that verifies that the message is or isn't emitted, as appropriate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: progress report for ANALYZE

2019-08-14 Thread Alvaro Herrera
On 2019-Aug-14, Etsuro Fujita wrote:

> On Tue, Aug 13, 2019 at 11:01 PM Alvaro Herrera
>  wrote:
> > On the subject of FDW support: I did look into supporting that before
> > submitting this.  I think it's not academically difficult: just have the
> > FDW's acquire_sample_rows callback invoke the update_param functions
> > once in a while.  Sadly, in practical terms it looks like postgres_fdw
> > is quite stupid about ANALYZE (it scans the whole table??) so doing
> > something that's actually useful may not be so easy.  At least, we know
> > the total relation size and maybe we can add the ctid column to the
> > cursor in postgresAcquireSampleRowsFunc so that we have a current block
> > number to report (becing careful about synchronized seqscans).
> 
> I don't follow this thread fully, so I might miss something, but I
> don't think that's fully applicable, because foreign tables managed by
> postgres_fdw can be eg, views on the remote side.

Oh, hmm, well I guess that covers the tables and matviews then ... I'm
not sure there's a good way to cover foreign tables that are views or
other stuff.  Maybe that'll have to do.  But at least it covers more
cases than none.

> > I do wonder why doesn't postgres_fdw use TABLESAMPLE.
> 
> Yeah, that's really what I'm thinking for PG13; but I think we would
> still need to scan the whole table in some cases (eg, when the foreign
> table is a view on the remote side), because the TABLESAMLE clause can
> only be applied to regular tables and materialized views.

Sure.

> > I did not look at other FDWs at all, mind.
> 
> IIUC, oracle_fdw already uses the SAMPLE BLOCK clause for that.  Right?

Yeah, it does that, I checked precisely oracle_fdw this morning.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Don't like getObjectDescription results for pg_amop/pg_amproc

2019-08-14 Thread Tom Lane
I notice that for a pg_amop object, getObjectDescription does this:

/*--
   translator: %d is the operator strategy (a number), the
   first two %s's are data type names, the third %s is the
   description of the operator family, and the last %s is the
   textual form of the operator with arguments.  */
appendStringInfo(&buffer, _("operator %d (%s, %s) of %s: %s"),
 amopForm->amopstrategy,
 format_type_be(amopForm->amoplefttype),
 format_type_be(amopForm->amoprighttype),
 opfam.data,
 format_operator(amopForm->amopopr));

This might seem all right in isolation, but it produces completely horrid
results as soon as you plug it into some larger message.  For example,

contrib_regression=# alter operator family gin__int_ops using gin drop operator 
 8 (integer[],integer[]);
ERROR:  cannot drop operator 8 (integer[], integer[]) of operator family 
gin__int_ops for access method gin: <@(integer[],integer[]) because operator 
class gin__int_ops for access method gin requires it
HINT:  You can drop operator class gin__int_ops for access method gin instead.

The colon seems like it ought to introduce a subsidiary sentence, but
it does not, and the reader is led off into the weeds trying to figure
out what connects to what.

I follow the point of trying to show the actual operator name, but
we gotta work harder on the presentation.  Perhaps this would work:

appendStringInfo(&buffer, _("operator %d (%s, %s) (that is, %s) 
of %s"),
 amopForm->amopstrategy,
 format_type_be(amopForm->amoplefttype),
 format_type_be(amopForm->amoprighttype),
 format_operator(amopForm->amopopr),
 opfam.data);

leading to

ERROR:  cannot drop operator 8 (integer[], integer[]) (that is, 
<@(integer[],integer[])) of operator family gin__int_ops for access method gin 
because operator class gin__int_ops for access method gin requires it

Likewise for pg_amproc entries, of course.

Or maybe we're just being too ambitious here and we should discard some of
this information.  I'm not really sure that the format_operator result
can be included without complete loss of intelligibility.

Thoughts?  I'm particularly unclear on how any of this might translate
into other languages, though I doubt that the current text is giving
good guidance to translators.

regards, tom lane




Re: Add "password_protocol" connection parameter to libpq

2019-08-14 Thread Jeff Davis
On Wed, 2019-08-14 at 11:38 +0900, Michael Paquier wrote:
> What I got in mind was a comma-separated list of authorized protocols
> which can be specified as a connection parameter, which extends to
> all
> the types of AUTH_REQ requests that libpq can understand, plus an
> extra for channel binding.  I also liked the idea mentioned upthread
> of "any" to be an alias to authorize everything, which should be the
> default.  So you basically get at that:
> auth_protocol = {any,password,md5,scram-sha-256,scram-sha-256-
> plus,krb5,gss,sspi}

What about something corresponding to the auth methods "trust" and
"cert", where no authentication request is sent? That's a funny case,
because the server trusts the client; but that doesn't imply that the
client trusts the server.

This is another reason I don't really like the list. It's impossible to
make it cleanly map to the auth methods, and there are a few ways it
could be confusing to the users.

Given that we all pretty much agree on the need for the separate
channel_binding param, the question is whether we want to (a) address
additional use cases with specific parameters that also justify
themselves; or (b) have a generic list that is supposed to solve many
future use cases.

I vote (a). With (b), the generic list is likely to cause more
confusion, ugliness, and clients that break needlessly in the future.

Regards,
Jeff Davis






Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-14 Thread Bruce Momjian
On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> I can work on it right away but don't know where to start.

I think the big open question is whether there will be acceptance of an
all-cluster encyption feature.  I guess if no one objects, we can move
forward.

> First, I think we should use a code repository to integrate [1] and [2]
> instead of sending diffs back and forth. That would force us to resolve
> conflicts soon and help to avoid duplicate work. The diffs would be created
> only whe we need to post the next patch version to pgsql-hackers for review,
> otherwise the discussions of details can take place elsewhere.

Well, we can do that, or just follow the TODO list and apply items as we
complete them.  We have found that doing everything in one big patch is
just too hard to review and get accepted.

> The most difficult problem I see now regarding the collaboration is agreement
> on the key management user interface. The Full-Cluster Encryption feature [1]
> should not add configuration variables or even tools that the next, more
> sophisticated version [2] deprecates immediately. Part of the problem is that

Yes, the all-cluster encryption feature has _no_ SQL-level API to
control it, just a GUC variable that you can use SHOW to see the
encryption mode.

> [2] puts all (key management related) interaction of postgres with the
> environment into an external library. As I pointed out in my response to [2],
> this will not work for frontend applications (e.g. pg_waldump). I think the
> key management UI for [2] needs to be designed first even if PG 13 should
> adopt only [1].

I think there are several directions we can go after all-cluster
encryption, and it does matter because we would want minimal API
breakage.  The options are:

1)  Allow per-table encyption control to limit encryption overhead,
though all of WAL still needs to be encrypted;  we could add a
per-record encyption flag to WAL records to avoid that.

2)  Allow user-controlled keys, which are always unlocked, and encrypt
WAL with one key

3)  Encrypt only the user-data portion of pages with user-controlled
keys.  FREEZE and crash recovery work since only the user data is
encrypted.  WAL is not encrypted, except for the user-data portion

I think once we implement all-cluster encryption, there will be little
value to #1 unless we find that page encryption is a big performance
hit, which I think is unlikely based on performance tests so far.

I don't think #2 has much value since the keys have to always be
unlocked to allow freeze and crash recovery.

I don't think #3 is viable since there is too much information leakage,
particularly for indexes because the tid is visible.

Now, if someone says they still want 2 & 3, which has happened many
times, explain how these issues can be reasonable addressed.

I frankly think we will implement all-cluster encryption, and nothing
else.  I think the next big encryption feature after that will be
client-side encryption support, which can be done now but is complex; 
it needs to be easier.

> At least it should be clear how [2] will retrieve the master key because [1]
> should not do it in a differnt way. (The GUC cluster_passphrase_command
> mentioned in [3] seems viable, although I think [1] uses approach which is
> more convenient if the passphrase should be read from console.) Rotation of
> the master key is another thing that both versions of the feature should do in
> the same way. And of course, the fronend applications need consistent approach
> too.

I don't see the value of an external library for key storage.

> I'm not too happy to start another (potentially long) discussion in this
> already huge thread, but I think the UI stuff belongs to the -hackers list
> rather than to an offline discussion.

I think the big question is whether we do anything, or just decide we
can't agree and stop.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: progress report for ANALYZE

2019-08-14 Thread Tatsuro Yamada

Hi Alvaro,

On 2019/08/13 23:01, Alvaro Herrera wrote:

Hello,

On 2019-Jul-03, Tatsuro Yamada wrote:


My ex-colleague Vinayak created same patch in 2017 [1], and he
couldn't get commit because there are some reasons such as the
patch couldn't handle analyzing Foreign table. Therefore, I wonder
whether your patch is able to do that or not.



[1] ANALYZE command progress checker
https://www.postgresql.org/message-id/flat/968b4eda-2417-8b7b-d468-71643cf088b6%40openscg.com#574488592fcc9708c38fa44b0dae9006


So just now I went to check the jold thread (which I should have
searched for before writing my own implementation!).  It seems clear
that many things are pretty similar in both patch, but I think for the
most part they are similar just because the underlying infrastructure
imposes a certain design already, rather than there being any actual
copying.  (To be perfectly clear: I didn't even know that this patch
existed and I didn't grab any code from there to produce my v1.)



I know your patch was not based on Vinayak's old patch because
coding style is different between him and you.

 

However, you've now modified the patch from what I submitted and I'm
wondering if you've taken any inspiration from Vinayak's old patch.  If
so, it seems fair to credit him as co-author in the commit message.  It
would be good to get his input on the current patch, though.



Yeah, I'm happy if you added his name as co-authors because I checked
the document including his old patch as a reference. :)

 

I have not looked at the current version of the patch yet, but I intend
to do so during the upcoming commitfest.

Thanks for moving this forward!



Thanks!
Committing the patch on PG13 makes me happy because Progress reporting
features are important for DBA. :)



On the subject of FDW support: I did look into supporting that before
submitting this.  I think it's not academically difficult: just have the
FDW's acquire_sample_rows callback invoke the update_param functions
once in a while.  Sadly, in practical terms it looks like postgres_fdw



Actually, I've changed my mind.
Even if there is no FDW support, Progress report for ANALYZE is still useful. 
Therefore, FDW support would be preferable but not required for committing
the patch, I believe. :)


Thanks,
Tatsuro Yamada







Re: Don't like getObjectDescription results for pg_amop/pg_amproc

2019-08-14 Thread Alexander Korotkov
On Thu, Aug 15, 2019 at 2:08 AM Tom Lane  wrote:
> Or maybe we're just being too ambitious here and we should discard some of
> this information.  I'm not really sure that the format_operator result
> can be included without complete loss of intelligibility.
>
> Thoughts?  I'm particularly unclear on how any of this might translate
> into other languages, though I doubt that the current text is giving
> good guidance to translators.

Can left and right types of pg_amop mismatch to those of pg_operatror?
 It probably could for domains, any* types or something.  But for
builtin opclasses they always match.

# select * from pg_amop amop join pg_operator op on op.oid =
amop.amopopr where amop.amoplefttype != op.oprleft or
amop.amoprighttype != op.oprright;
(0 rows)

Could we discard one pair of types from output?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] CLUSTER command progress monitor

2019-08-14 Thread Tatsuro Yamada

Hi Michael, Alvaro and Robert!

On 2019/08/14 11:52, Michael Paquier wrote:

On Wed, Aug 14, 2019 at 11:38:01AM +0900, Tatsuro Yamada wrote:

On 2019/08/13 14:40, Tatsuro Yamada wrote:

On 2019/08/02 3:43, Alvaro Herrera wrote:

Hmm, I'm trying this out now and I don't see the index_rebuild_count
ever go up.  I think it's because the indexes are built using parallel
index build ... or maybe it was the table AM changes that moved things
around, not sure.  There's a period at the end when the CLUSTER command
keeps working, but it's gone from pg_stat_progress_cluster.


Thanks for your report.
I'll investigate it. :)


I did "git bisect" and found the commit:

  03f9e5cba0ee1633af4abe734504df50af46fbd8
  Report progress of REINDEX operations


I am adding an open item for this one.
--
Michael



Okay, I checked it on the wiki.

  https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
  - index_rebuild_count in CLUSTER reporting never increments

To be clear, 03f9e5cb broke CLUSTER progress reporting, but
I investigated little more and share my ideas to fix the problem.

* Call stack

cluster_rel
  pgstat_progress_start_command(CLUSTER) *A1
  rebuild_relation
finish_heap_swap
  reindex_relation
reindex_index
  pgstat_progress_start_command(CREATE_INDEX) *B1
  pgstat_progress_end_command() *B2
pgstat_progress_update_param(INDEX_REBUILD_COUNT, i) <- failed :(
  pgstat_progress_end_command() *A2

  Note
These are sets:
  A1 and A2,
  B1 and B2



* Ideas to fix
  There are Three options, I guess.

  1. Call pgstat_progress_start_command(CLUSTER) again
 before pgstat_progress_update_param(INDEX_REBUILD_COUNT, i).

  2. Add "save and restore" functions for the following two
 variables of MyBeentry in pgstat.c.
- st_progress_command
- st_progress_command_target

  3. Use Hash or List to store multiple values for the two
 variables in pgstat.c.



I tried 1. and it shown index_rebuild_count, but it also shown
"initializing" phase again and other columns were empty. So, it is
not suitable to fix the problem. :(
I'm going to try 2. and 3., but, unfortunately, I can't get enough
time to do that after PGConf.Asia 2019.
If we selected 3., it affects following these progress reporting
features: VACUUM, CLUSTER, CREATE_INDEX and ANALYZE. But it's okay,
I suppose. Any comments welcome! :)


Thanks,
Tatsuro Yamada






Re: Why is infinite_recurse test suddenly failing?

2019-08-14 Thread Tom Lane
Thomas Munro  writes:
> Here's another crash like that.

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38
> 2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
> was terminated by signal 11: Segmentation fault
> 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
> running: select infinite_recurse();

It occurred to me to scrape the buildfarm database for these failures,
and what I got was

   sysname|branch |  snapshot   |  stage  | 
data
  |   architecture
--+---+-+-+---+--
 demoiselle   | HEAD  | 2019-04-27 14:55:52 | pg_upgradeCheck | 
2019-04-27 15:00:42.736 UTC [1457:66] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 buri | HEAD  | 2019-04-27 23:54:46 | Check   | 
2019-04-28 00:01:49.794 UTC [3041:66] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 takin| HEAD  | 2019-05-01 08:16:48 | pg_upgradeCheck | 
2019-05-01 08:23:27.159 UTC [32303:59] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 bonito   | HEAD  | 2019-05-01 23:05:36 | Check   | 
2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 shoveler | HEAD  | 2019-05-10 14:04:34 | Check   | 
2019-05-10 14:11:26.833 UTC [13456:73] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER8) 
 demoiselle   | HEAD  | 2019-05-19 14:22:23 | pg_upgradeCheck | 
2019-05-19 14:26:17.002 UTC [23275:80] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 vulpes   | HEAD  | 2019-06-15 09:16:45 | pg_upgradeCheck | 
2019-06-15 09:22:22.268 UTC [4885:77] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le  
 ayu  | HEAD  | 2019-06-19 22:13:23 | pg_upgradeCheck | 
2019-06-19 22:18:16.805 UTC [2708:71] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER8) 
 quokka   | HEAD  | 2019-07-10 14:20:13 | pg_upgradeCheck | 
2019-07-10 15:24:06.102 BST [5d25f4fb.2644:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 cavefish | HEAD  | 2019-07-13 03:49:38 | pg_upgradeCheck | 
2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 pintail  | REL_12_STABLE | 2019-07-13 19:36:51 | Check   | 
2019-07-13 19:39:29.013 UTC [31086:5] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 bonito   | HEAD  | 2019-07-19 23:13:01 | Check   | 
2019-07-19 23:16:33.330 UTC [24191:70] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 takin| HEAD  | 2019-07-24 08:24:56 | Check   | 
2019-07-24 08:28:01.735 UTC [16366:75] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 quokka   | HEAD  | 2019-07-31 02:00:07 | pg_upgradeCheck | 
2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 elasmobranch | HEAD  | 2019-08-01 03:13:38 | Check   | 
2019-08-01 03:19:05.394 UTC [22888:62] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 buri | HEAD  | 2019-08-02 00:10:23 | Check   | 
2019-08-02 00:17:11.075 UTC [28222:73] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 urocryon | HEAD  | 2019-08-02 05:43:46 | Check   | 
2019-08-02 05:51:51.944 UTC [2724:64] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le  
 batfish  | HEAD  | 2019-08-04 19:02:36 | pg_upgradeCheck | 
2019-08-04 19:08:11.728 UTC [23899:79] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 buri | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck | 
2019-08-07 00:11:24.500 UTC [1405:5] DETAIL:  Failed process was running: 
select infinite_recurse();  | ppc64le (POWER9) 
 quokka   | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck | 
2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 quokka   | HEAD  | 2019-08-08

Extension development

2019-08-14 Thread Yonatan Misgan
Hello, I am trying to develop calendar extension for PostgreSQL  but there is a 
difficulties on how to get day, month and year from PostgreSQL source code 
because when am read the PostgreSQL source code it uses DateADT as a data type 
and this DateADT returns the total numbers of day. So how can  I get day, month 
or year only. For example the below code is PostgreSQL source code to return 
current date.
/*
* GetSQLCurrentDate -- implements CURRENT_DATE
*/
DateADT
GetSQLCurrentDate(void)
{
TimestampTz ts;
struct pg_tm tt,
   *tm = &tt;
fsec_t   fsec;
int   tz;

ts = GetCurrentTransactionStartTimestamp();

if (timestamp2tm(ts, &tz, tm, &fsec, NULL, NULL) != 0)
ereport(ERROR,

(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),

errmsg("timestamp out of range")));

return date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) - 
POSTGRES_EPOCH_JDATE;
}
>From this source code how can I get only the year to convert my own calendar 
>year.  I need this because Ethiopian calendar is totally differ from GC in 
>terms of day, month and year.


Regards,

Yonathan Misgan
Assistant Lecturer, @ Debre Tabor University
Faculty of Technology
Department of Computer Science
Studying MSc in Computer Science (in Data and Web Engineering)
@ Addis Ababa University
E-mail: yona...@dtu.edu.et
yonathanmisga...@gmail.com
Tel:   (+251)-911180185 (mob)