from:"Dongming Liu"

Re: DSA failed to allocate memory

2022-03-18 Thread Dongming Liu

>
> So it's OK for a segment to be in a bin that suggests that it has more
> consecutive free pages than it really does. But it's NOT ok for a
> segment to be in a bin that suggests it has fewer consecutive pages
> than it really does. If dsa_free() is putting things back into the
> wrong place, that's what we need to fix.


I'm trying to move segments into appropriate bins in dsa_free().
In 0001-Re-bin-segment-when-dsa-memory-is-freed.patch, I extract
the re-bin segment logic into a separate function called rebin_segment,
call it to move the segment to the appropriate bin when dsa memory is
freed. Otherwise, when allocating memory, due to the segment with
enough contiguous pages is in a smaller bin, a suitable segment
may not be found to allocate memory.

Fot test, I port the test_dsa patch from [1] and add an OOM case to
test memory allocation until OOM, free and then allocation, compare
the number of allocated memory before and after.

Any thoughts?

[1]
https://www.postgresql.org/message-id/CAEepm%3D3U7%2BRo7%3DECeQuAZoeFXs8iDVX56NXGCV7z3%3D%2BH%2BWd0Sw%40mail.gmail.com


0001-port-test_dsa.patch
Description: Binary data


0001-Re-bin-segment-when-dsa-memory-is-freed.patch
Description: Binary data

Problem with synchronous replication

2019-10-25 Thread Dongming Liu


Hi,

I recently discovered two possible bugs about synchronous replication.

1. SyncRepCleanupAtProcExit may delete an element that has been deleted
SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is 
not detached, 
acquires the SyncRepLock lock and deletes it. If this element has been deleted 
by walsender, 
it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. 

IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check
whether the queue is detached or not.


2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one 
interrupt.
For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate 
the wait 
with suitable warning. As follows:

a. set QueryCancelPending to false
b. errport outputs one warning
c. calls SyncRepCancelWait to delete one element from the queue

If another cancel interrupt arrives when we are outputting warning at step b, 
the errfinish
will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling 
autovacuum
task", then the process will jump to the sigsetjmp. Unfortunately, the step c 
will be skipped
and the element that should be deleted by SyncRepCancelWait is remained.

The easiest way to fix this is to swap the order of step b and step c. On the 
other hand, 
let sigsetjmp clean up the queue may also be a good choice. What do you think?

Attached the patch, any feedback is greatly appreciated.

Best regards,
--
Dongming Liu

0001-Fix-two-bugs-for-synchronous-replication.patch
Description: Binary data

Re: Problem with synchronous replication

2019-10-29 Thread Dongming Liu

Can someone help me to confirm that these two problems are bugs?
If they are bugs, please help review the patch or provide better fix 
suggestions.
Thanks.

Best regards,
--
Dongming Liu
--
From:LIU Dongming 
Sent At:2019 Oct. 25 (Fri.) 15:18
To:pgsql-hackers 
Subject:Problem with synchronous replication


Hi,

I recently discovered two possible bugs about synchronous replication.

1. SyncRepCleanupAtProcExit may delete an element that has been deleted
SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is 
not detached, 
acquires the SyncRepLock lock and deletes it. If this element has been deleted 
by walsender, 
it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. 

IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check
whether the queue is detached or not.


2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one 
interrupt.
For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate 
the wait 
with suitable warning. As follows:

a. set QueryCancelPending to false
b. errport outputs one warning
c. calls SyncRepCancelWait to delete one element from the queue

If another cancel interrupt arrives when we are outputting warning at step b, 
the errfinish
will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling 
autovacuum
task", then the process will jump to the sigsetjmp. Unfortunately, the step c 
will be skipped
and the element that should be deleted by SyncRepCancelWait is remained.

The easiest way to fix this is to swap the order of step b and step c. On the 
other hand, 
let sigsetjmp clean up the queue may also be a good choice. What do you think?

Attached the patch, any feedback is greatly appreciated.

Best regards,
--
Dongming Liu

Re: Problem with synchronous replication

2019-10-29 Thread Dongming Liu

On Wed, Oct 30, 2019 at 9:12 AM Michael Paquier  wrote:

> On Tue, Oct 29, 2019 at 01:40:41PM +0800, Dongming Liu wrote:
> > Can someone help me to confirm that these two problems are bugs?
> > If they are bugs, please help review the patch or provide better fix
> > suggestions.
>
> There is no need to send periodic pings.  Sometimes it takes time to
> get replies as time is an important resource that is always limited.
>

 Thank you for your reply. I also realized my mistake, thank you for
correcting me.


> I can see that Horiguchi-san has already provided some feedback, and I
> am looking now at your suggestions.
>

Thanks again.

Re: DSA failed to allocate memory

2022-03-28 Thread Dongming Liu

On Fri, Mar 18, 2022 at 3:30 PM Dongming Liu  wrote:

> So it's OK for a segment to be in a bin that suggests that it has more
>> consecutive free pages than it really does. But it's NOT ok for a
>> segment to be in a bin that suggests it has fewer consecutive pages
>> than it really does. If dsa_free() is putting things back into the
>> wrong place, that's what we need to fix.
>
>
> I'm trying to move segments into appropriate bins in dsa_free().
> In 0001-Re-bin-segment-when-dsa-memory-is-freed.patch, I extract
> the re-bin segment logic into a separate function called rebin_segment,
> call it to move the segment to the appropriate bin when dsa memory is
> freed. Otherwise, when allocating memory, due to the segment with
> enough contiguous pages is in a smaller bin, a suitable segment
> may not be found to allocate memory.
>
> Fot test, I port the test_dsa patch from [1] and add an OOM case to
> test memory allocation until OOM, free and then allocation, compare
> the number of allocated memory before and after.
>
> Any thoughts?
>
> [1]
> https://www.postgresql.org/message-id/CAEepm%3D3U7%2BRo7%3DECeQuAZoeFXs8iDVX56NXGCV7z3%3D%2BH%2BWd0Sw%40mail.gmail.com
>
>
Fix rebin_segment not working on in-place dsa.

-- 
Best Regards,
Dongming


0001-Re-bin-segment-when-dsa-memory-is-freed.patch
Description: Binary data


0002-port-test_dsa.patch
Description: Binary data

Re: DSA failed to allocate memory

2022-04-06 Thread Dongming Liu

On Mon, Mar 28, 2022 at 3:53 PM Thomas Munro  wrote:

> Hi Dongming,
>
> Thanks for the report, and for working on the fix.  Can you please
> create a commitfest entry (if you haven't already)?  I plan to look at
> this soon, after the code freeze.

I created a commitfest entry https://commitfest.postgresql.org/38/3607/.
Thanks for your review.

Are you proposing that the test_dsa module should be added to the
> tree?  If so, some trivial observations: "#ifndef
> HAVE_INT64_TIMESTAMP" isn't needed anymore (see commit b6aa17e0, which
> is in all supported branches), the year should be updated, and we use
> size_t instead of Size in new code.
>
Yes, I think test_dsa is very helpful and necessary to develop dsa related
features. I have removed the HAVE_INT64_TIMESTAMP related code.
Most of the code for test_dsa comes from your patch[1] and I add some
test cases.

In addition, I add a few OOM test cases that allocate a fixed size of
memory
until the memory overflows, run it twice and compare the amount of memory
they allocate. These cases will fail on the current master branch.

[1]
https://www.postgresql.org/message-id/CAEepm%3D3U7%2BRo7%3DECeQuAZoeFXs8iDVX56NXGCV7z3%3D%2BH%2BWd0Sw%40mail.gmail.com
-- 
Best Regards,
Dongming


0001-Re-bin-segment-when-dsa-memory-is-freed-v2.patch
Description: Binary data


0002-port-test_dsa-v2.patch
Description: Binary data

DSA failed to allocate memory

2022-01-24 Thread Dongming Liu

Hi,

I am using Dynamic shared memory areas(DSA) to manage some variable
length shared memory, I've found that in some cases allocation fails even
though there are enough contiguous pages.

The steps to reproduce are as follows:
1. create a dsa area with a 1MB DSM segment
2. set its size limit to 1MB
3. allocate 4KB memory until fails
4. free all allocated memory in step 3
5. repeat step 3 and step 4

When I first run the step 3, there is 240 4KB memory allocated successfully.
But when I free all and allocate again, no memory can be allocated even
though there are 252 contiguous pages. IMO, this should not be expected
to happen, right?

The call stack is as follows:
#0  get_best_segment (area=0x200cc70, npages=16) at dsa.c:1972
#1  0x00b46b36 in ensure_active_superblock (area=0x200cc70,
pool=0x7fa7b51f46f0,
size_class=33) at dsa.c:1666
#2  0x00b46555 in alloc_object (area=0x200cc70, size_class=33)
at dsa.c:1460
#3  0x00b44f05 in dsa_allocate_extended (area=0x200cc70,
size=4096, flags=2) at dsa.c:795

I read the relevant code and found that get_best_segment re-bin the segment
to segment index 4 when first run the step 3. But when free all and run the
step 3 again, get_best_segment search from the first bin that *might*
have enough
contiguous pages, it is calculated by contiguous_pages_to_segment_bin(),
for a superblock with 16 pages, contiguous_pages_to_segment_bin is 5.
So the second time, get_best_segment search bin from segment index 5 to 16,
but the suitable segment has been re-bin to 4 that we do not check.
Finally, the get_best_segment return NULL and dsa_allocate_extended return
a invalid dsa pointer.

Maybe we can use one of the following methods to fix it:
1. re-bin segment to suitable segment index when called dsa_free
2. get_best_segment search all bins

I wrote a simple test code that is attached to reproduce it.

Ant thoughts?

--
Best Regards
Dongming(https://www.aliyun.com/)


0001-add-dsa-test-module.patch
Description: Binary data

Re: DSA failed to allocate memory

Problem with synchronous replication

Re: Problem with synchronous replication

Re: Problem with synchronous replication

Re: DSA failed to allocate memory

Re: DSA failed to allocate memory

DSA failed to allocate memory

7 matches

Site Navigation

Mail list logo

Footer information