On 12/14/2015 04:26 PM, Atsushi Kumagai wrote:
Think about this, in a huge memory, most of the page will be filtered, and
we have 5 buffers.

page1       page2      page3     page4     page5      page6       page7 .....
[buffer1]   [2]        [3]       [4]       [5]
unfiltered  filtered   filtered  filtered  filtered   unfiltered  filtered

Since filtered page will take a buffer, when compressing page1,
page6 can't be compressed at the same time.
That why it will prevent parallel compression.

Thanks for your explanation, I understand.
This is just an issue of the current implementation, there is no
reason to stand this restriction.

Further, according to Chao's benchmark, there is a big performance
degradation even if the number of thread is 1. (58s vs 240s)
The current implementation seems to have some problems, we should
solve them.


If "-d 31" is specified, on the one hand we can't save time by compressing
parallel, on the other hand we will introduce some extra work by adding
"--num-threads". So it is obvious that it will have a performance degradation.

Sure, there must be some overhead due to "some extra work"(e.g. exclusive lock),
but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds
too slow, the degradation is too big to be called "some extra work".

Both --num-threads=0 and --num-threads=1 are serial processing,
the above "buffer fairness issue" will not be related to this degradation.
What do you think what make this degradation ?


I can't get such result at this moment, so I can't do some further investigation
right now. I guess it may be caused by the underlying implementation of pthread.
I reviewed the test result of the patch v2 and found in different machines,
the results are quite different.

Unluckily, I also can't reproduce such big degradation.
According to the Chao's verification, this issue seems different form
the "too many page fault issue" that we solved.
I have no ideas, but at least I want to confirm whether this issue
is avoidable or not.

It seems that I can get almost the same result of Chao from "PRIMEQUEST 1800E".

###################################
- System: PRIMERGY RX300 S6
- CPU: Intel(R) Xeon(R) CPU x5660
- memory: 16GB
###################################
************ makedumpfile -d 7 ******************
                 core-data       0       256
         threads-num
-l
         0                       10      144
         4                       5       110
         8                       5       111
         12                      6       111

************ makedumpfile -d 31 ******************
                 core-data       0       256
         threads-num
-l
         0                       0       0
         4                       2       2
         8                       2       3
         12                      2       3

###################################
- System: PRIMEQUEST 1800E
- CPU: Intel(R) Xeon(R) CPU E7540
- memory: 32GB
###################################
************ makedumpfile -d 7 ******************
                 core-data        0       256
         threads-num
-l
         0                        34      270
         4                        63      154
         8                        64      131
         12                       65      159

************ makedumpfile -d 31 ******************
                 core-data        0       256
         threads-num
-l
         0                        2       1
         4                        48      48
         8                        48      49
         12                       49      50

I'm not so sure if it is a problem that the performance degradation is so big.
But I think if in other cases, it works as expected, this won't be a problem(
or a problem needs to be fixed), for the performance degradation existing
in theory.

Or the current implementation should be replaced by a new arithmetic.
For example:
We can add an array to record whether the page is filtered or not.
And only the unfiltered page will take the buffer.

We should discuss how to implement new mechanism, I'll mention this later.

But I'm not sure if it is worth.
For "-l -d 31" is fast enough, the new arithmetic also can't do much help.

Basically the faster, the better. There is no obvious target time.
If there is room for improvement, we should do it.


Maybe we can improve the performance of "-c -d 31" in some case.

Yes, the buffer is used for -c, -l and -p, not only for -l.
It would be useful to improve that.

BTW, we can easily get the theoretical performance by using the "--split".

Are you sure ? You persuaded me in the thread below:

   http://lists.infradead.org/pipermail/kexec/2015-June/013881.html

--num-threads is orthogonal to --split, it's better to use the both
option since they try to solve different bottlenecks.
That's why I decided to merge your multi thread feature.

However, what you said sounds --split is a superset of --num-threads.
You don't need the multi thread feature ?


I just mean the performance.
There is no doubt that we will use multi-threads in --split in the future.

But as we all known, threads and processes have some common characters.
And in makedumpfile, if we use "--split core1 core2 core3 core4" and
"--num-threads 4" separately, the spent time should not be quite different.

Since the logic of "--split" is more simple, if we can't improve the performance
of "-l -d 31" by "--split", we also don't have much chance to do it by 
"--num-threads".

I just mean that.
It is of course that --split is not a super set of --num-threads.

--
Thanks
Zhou



_______________________________________________
kexec mailing list
[email protected]
http://lists.infradead.org/mailman/listinfo/kexec

Reply via email to