> On Jun 25, 2015, at 2:50 PM, Marcus Gelderie <[email protected]> wrote:
> 
> Hey all,
> 
> answers in text below...
> 
> TLDR: I can remove the QKERSIZE field, as I believe that it does not affect he
> RLIMIT accounting (details [=code] below). Question is: Should it?
> 
> Before I provide another patch, I would appreciate another round of feedback, 
> though.
> 
> So…
> On Thu, Jun 25, 2015 at 09:23:33AM +0200, Michael Kerrisk (man-pages) wrote:
>> On 25 June 2015 at 07:47, Davidlohr Bueso <[email protected]> wrote:
>>> Good catch, and a nice opportunity to make the mq manpage more specific
>>> wrt to queue sizes.
>>> 
>>> [...]
>>> 
> ACK, I can write a patch for the manpage once we figure out what to do
> here.
>>>> Reporting the size of the message queue in kernel has its merits, but
>>>> doing so in the QSIZE field of the pseudo file corresponding to the
>>>> queue is a breaking change, as mentioned above. This patch therefore
>>>> returns the QSIZE  field to its original meaning. At the same time,
>>>> it introduces a new field QKERSIZE that reflects the size of the queue
>>>> in kernel (user data + kernel data).
>>> 
>>> Hmmm I'm not sure about this. What are the specific benefits of having
>>> QKERSIZE? We don't export in-kernel data like this in any other ipc
>>> (posix or sysv) mechanism, afaik. Plus, we do not compromise kernel data
>>> structures like this, as we would break userspace if later we change
>>> posix_msg_tree_node. So NAK to this.
>>> 
>>> I would just remove the extra
>>> +       info->qsize += sizeof(struct posix_msg_tree_node);
>>> 
>>> bits from d6629859b36 (along with -stable v3.5), plus a patch updating
>>> the manpage that this field only reflects user data.
>> 
> This can be done if the RLIMIT accounting is not done against that
> value (more below).
> 
>> I've been hoping that Doug would jump into this discussion…

Sorry to have been quiet so far.  PTO interfered.

>> If I recall/understand Doug correctly (see
>> http://thread.gmane.org/gmane.linux.man/7050/focus=1797645 ), his
>> rationale for the QSIZE change was that it then revealed a value that
>> was closer to what was being used to account against the
>> RLIMIT_MSGQUEUE resource limit. (Even with these changes, the QSIZE
>> value was not 100% accurate for accounting against RLIMIT_MSGQUEUE,
>> since some pieces of kernel overhead were still not being accounted
>> for. Nevertheless, it's much closer than the old (pre 3.5) QSIZE for
>> some corner cases.) Thus, Marcus's rationale for preserving this info
>> as QKERSIZE.

Yes.  A bit more rationale was that the change to use an rbtree for priorities 
made the worst case scenario for queue size grow quite a lot.  This is 
especially true when dealing with queues with a very small message size.  If we 
didn’t account for some of this growth in size, we opened ourselves up to a 
DDoS type attack where a malicious user could create lots of queues with very 
small message sizes and lots of messages and then just create enough messages 
of different priorities to force allocating all of these structs.  Since they 
would be allowed to create 1 byte messages in the queue, and could easily 
create a 1024 element queue, that would be a 1k accounting for user data that 
in worst case scenario used up something like 96k of kernel memory.  With the 
default RLIMIT for msg queues being something like 800K, let’s just assume one 
user id can create 800 * 96k of kernel memory allocations, so 72MB of kernel 
memory used per user id.  Given a few user ids, it can actually be significant. 
 And all the while the sysadmin is scratching his head going “where the hell is 
all my memory going?” because there is no clear indication that a 1 byte 
message in the POSIX mqueue subsystem is consuming 96bytes or so of kernel 
memory.

So, just to be clear, my rationale here is “Breaking expectations is bad, but 
leaving a possible DDoS style exploit is worse”.  That’s why I changed the 
accounting.  I had one or two users complain that they had to administratively 
adjust the RLIMIT_MSGQUEUE setting on their servers to compensate, but that’s 
as it should be.  Now they *know* where their memory is being used instead of 
having a lurking memory black hole in their system.  For those users that use 
large message sizes, they didn’t even notice the change, it was just lost in 
the noise.

> 
> Actually, looking at the code, it seems to me that the QKERSIZE
> (a.k.a. the current qsize field) is actually not used for the purpose of
> accounting at all. From ipc/mqueue.c (line 281 ff, comments added by me):
> 
>               /* worst case estimate for the btree in memory */
> 
>               mq_treesize = info->attr.mq_maxmsg * sizeof(struct msg_msg) +
>                       min_t(unsigned int, info->attr.mq_maxmsg, MQ_PRIO_MAX) *
>                       sizeof(struct posix_msg_tree_node);
> 
>               /* worst case estimate for the user data in the queue */
> 
>               mq_bytes = mq_treesize + (info->attr.mq_maxmsg *
>                                         info->attr.mq_msgsize);
> 
>               spin_lock(&mq_lock);
> 
>                   /* acutal accounting; u->mq_bytes is the other
>                    * message queus for this user (computed in the way
>                    * above (i.e. worst-case estimates)
>                    */
> 
>               if (u->mq_bytes + mq_bytes < u->mq_bytes ||
>                   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
>                       [....]
>                       ret = -EMFILE;
>                       [.....]
> 
> So, I think that if the RLIMIT should take the actual size (not the
> worst case) into account, then we have a bug. I can write a patch, but that
> should be separate from this... meaning I'll turn this into a patchset.
> 
> However, we should decide what we want first: If I read the manpage 
> (mq_overview), I
> am inclined to think that the RLIMIT should use the acutal amount of
> data occupied by the queue (or at least the user data). But it does not
> necessarily rule out an accounting against worst-case estimation (to me
> at least).

You have to use worst case.  You don’t have a choice.  The reason is that the 
check is only performed at queue creation time (and the user API expects this), 
so once the queue is created and the user is allowed to use it, failure due to 
RLIMIT issues is no longer allowed.  We therefore must use whatever worst case 
scenario we wish to use when we test the size during creation and then we have 
to be OK with the variance we might get from that during real world usage.  My 
tests/maths were intended to be “close enough” that they couldn’t be fooled 
with small message sizes and large queue sizes, but not intended to be perfect.

>> Now whether QKERSIZE is actually useful or used by anyone is another
>> question. As far as I know, there was no user request that drove the
>> change. But Doug can perhaps say something to this. QSIZE should I
>> think definitely be fixed (reverted to pre-3.5 behavior). I'm agnostic
>> about QKERSIZE.

Here’s what I would prefer to see happen:

1) Keep the maths as they are.  We need to account worst case because apps 
don’t expect run time checks/failures.
2) Keep accounting for kernel data size as part of queue size and adding both 
before checking it against the rlimit for the user.
3) Create a new mq_create call that take a new mq_attr struct.  This struct 
should add a field for max_priorities.  For existing apps, the current default 
of 32,768 priorities can be used.  For other apps, let them select their 
requested number of priorities.  By selecting lower numbers of priorities, that 
worst case scenario gets much better (as long as their queue size is larger 
than their number of priorities).

Or, as an alternative:

1) Change the maths to account for the msg struct, but ignore the rbtree 
structs.  This is still different from when I made my change in that the maths 
used to only account for a msg struct *.  This would put the overall total 
change somewhere in between real memory usage and memory queue size.  For 
instance, if you had a 1k queue of 1 byte messages, here’s how it lays out:

Pre-3.5 math: 1k * (1byte + (sizeof *)) = 5k or 9k (32/64bit arch)
Pre-3.5 reality: 1k * (1byte + (sizeof *) + sizeof (struct msg_msg) + 
unaccounted stuff) = 50ishk

Post-3.5 math: 1k * (1byte + sizeof (struct msg_msg) + sizeof (struct 
msg_node)) = 90ishk
Post-3.5 reality: 90ishk

Proposed math: 1k * (qbyte + sizeof (struc msg_msg)) = 50ishk
Proposed reality: 90ishk

The reason I suggest this is that the real world usage I’ve seen does not use 
lots of priorities.  So, even though the *worst case* reality is 90k, the 
common case will be in the 50k range like the math.  Given that we are less 
than a 2 to 1 ratio of worst case to math, a DDoS is unlikely to go undetected. 
 The only real problem here is that if you have an admin that wants to create 
lots of queues and use as much ram as possible, and they use lots of 
priorities, then they might get confused when their calculated sizes of mqueues 
fills RAM faster than they expected and the actually run out of RAM in use.

—
Doug Ledford <[email protected]>
        GPG Key ID: 0E572FDD





Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to