Thank you, Aurelien!

Aha, "vader btl", that is new to me!
I tought Vader was that man dressed in black in Star Wars,
Obi-Wan Kenobi's nemesis.
That was a while ago, my kids were children,
and Alec Guiness younger than Harrison Ford is today.
Oh, how nostalgic code developers can get when it comes
to naming things ...

If I am using "vader", it is totally inadvertent.
There was no such a thing in Open MPI 1.6 and earlier.

Now that you mentioned, I can see lots of it in the 1.8.3
ompi_info output.
In addition, my stderr files show messages like this:

imb.e38352:[1,5]<stddiag>:[node13:16334] mca: bml: Not using sm btl to [[59987,1],26] on node node13 because vader btl has higher exclusivity (65536 > 65535)

So, you are right, "vader" is taking over and knocking off "sm" (and openib and everybody else).
Darn Vader!
Probably knem is going down the tubes along with sm, right?

I was used to sm, openib, self and tcp BTLs.
I normally just do "btl = ^tcp" in the MCA parameters file,
to stick to sm, openib, and self.

That worked fine in 1.6.5 (and earlier), and knem worked
flawlessly there.
The same settings in 1.8.3 don't bring up the knem functionality.
So, this seems to be yet another change in 1.8.3 that I need to learn.

Can you or some other list subscriber elaborate a bit about
this 'vader' btl?
The Open MPI FAQ doesn't have anthing about it.
What is it after all?
Does it play the same role as "sm", i.e., an intra-node btl?
Considering the name, is "vader" good or bad?
Or better: In which circumstances is "vader" good and when is it bad?
Should I give in to the dark side of the force and keep "vader"
turned on, or should I just do something like
"btl = ^tcp,^vader" ?

I am in CentOS 6.5, stock kernel 2.6.32, no 3.1,no  CMA linux,
so I believe I need knem for now.

I tried '-mca btl_base_verbose 30' but no knem information came out.

Many thanks,
Gus Correa

On 10/16/2014 04:40 PM, Aurélien Bouteiller wrote:
Are you sure you are not using the vader BTL ?

Setting mca_btl_base_verbose and/or sm_verbose should spit
out some knem initialization info.

The CMA linux system (that ships with most 3.1x linux kernels)
has similar features, and is also supported in sm.

Aurelien
--
           ~~~ Aurélien Bouteiller, Ph.D. ~~~
              ~ Research Scientist @ ICL ~
The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375       fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/




Le 16 oct. 2014 à 16:35, Gus Correa <g...@ldeo.columbia.edu> a écrit :

Dear Open MPI developers

Well, I just can't keep my promises for too long ...
So, here I am pestering you again, although this time
it is not a request for more documentation.
Hopefully it is something more legit.

I am having trouble using knem with Open MPI 1.8.3,
and need your help.

I configured Open MPI 1.8.3 with knem.
I had done the same with some builds of Open MPI 1.6.5 before.

When I build and launch the Intel MPI benchmarks (IMB)
with Open MPI 1.6.5,
'cat /dev/knem'
starts showing non-zero-and-growing statistics right away.

However, when I build and launch IMB with Open MPI 1.8.3,
/dev/knem shows only zeros,
no statistics growing, nothing.
Knem just seems to be completely asleep.

So, my conclusion is that somehow knem is not working with OMPI 1.8.3,
at least not for me.

***

The runtime environment related to knem is setup the
same way on both OPMI releases.
I tried setting it up both on the command line:

-mca btl_sm_eager_limit 32768 -mca btl_sm_knem_dma_min 1048576

and on the MCA parameter file:

btl_sm_use_knem = 1
btl_sm_eager_limit = 32768
btl_sm_knem_dma_min = 1048576

and the behavior is the same (i.e., knem is active in 1.6.5,
but doesn't seem to be used by 1.8.3, as indicated by the
/dev/knem statistics.)

***

When I 'grep -i knem config.log', both 1.6.5 and 1.8.3 builds show:

#define OMPI_BTL_SM_HAVE_KNEM 1

suggesting that both configurations picked up knem correctly.

On the other hand, when I do 'ompi_info --all --all |grep knem',
OMPI 1.6.5 shows "btl_sm_have_knem_support":

'MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source: 
default value)  Whether this component supports the knem Linux kernel module or not'

By contrast, in OMPI 1.8.3 ompi_info doesn't show this particular item 
("btl_sm_have_knem_support"),
although the *other* 'btl sm knem' items are there,
namely "btl_sm_use_knem","btl_sm_knem_dma_min", "btl_sm_knem_max_simultaneous".

I am scratching my head to understand why a parameter with such a
suggestive name ("btl_sm_have_knem_support"),
so similar to the OMPI_BTL_SM_HAVE_KNEM cpp macro,
somehow vanished from ompi_info in OMPI 1.8.3.

***

Questions:

- Am I doing something totally wrong,
perhaps with the knem runtime environment?

- Was knem somehow phased out in 1.8.3?

- Could there be a bad interaction with other runtime parameters that
somehow is knocking out knem in 1.8.3?
(FYI, besides knem, I'm just excluding the tcp btl, binding to core, and 
reporting the bindings, which is exactly what I do on 1.6.5,
although the runtime parameter syntax has changed.)

- Is knem inadvertently not being activated at runtime in OMPI 1.8.3?
(i.e. a bug)

- Is there a way to increase verbosity to detect if knem is being
used by OMPI?
That would certainly help to check what is going on.
I tried '-mca btl_base_verbose 30' but there was no trace of knem
in sderr/stdout of either 1.6.5 or 1.8.3.
So, the evidence I have that knem is
active in 1.6.5 but not in 1.8.3 comes only from the statistics in
/dev/knem.

***


Thank you,
Gus Correa

***

PS - As an aside, I also have some questions on the knem setup,
which I mostly copied from the knem web site
(hopefully Brice Goglin is listening ...):

- Is 32768 in 'btl_sm_eager_limit 32768' a good number,
or should it be larger/smaller/something else?
[OK, I know I should benchmark it, but exploring the whole parameter
space takes long, so why not asking? ]

- Is it worth using 'btl_sm_knem_dma_min 1048576'?
[I think I read somewhere that this dma engine offload
is an Intel thing, not AMD.]

- How about btl_sm_knem_max_simultaneous?
That one is not mentioned in the knem web site.
Should I leave it default to zero or set it to 1? 2? 4? Something else?


Thanks again,
Gus Correa
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/10/25511.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/10/25512.php


Reply via email to