Hi Brian,

On 7/22/24 6:00 PM, Brian Hutchinson wrote:
Hi Kienan,

Took a while to gather your grocery list but I think I have most of it below ;)

thanks for all the extra info. Replies inline below, but I'll cut a lot of the long output for readability.

tl;dr the environment continues to be weird, but my present suspicion is that something in either compilation, the linking of your app (eg. with ld when producing the executable), or some post linking stripping might be causing issues.

I will stop digging into further hypotheticals on my side as there is no reproducer for both the environment and the application. If you ever end up with a minimal reproducer that you can share, I'd be more than happy to examine it.


I may have not been clear.  Most of the application components are statically linked but I think there are some that are built as shared objects (.so's) so that's what I was referring to.  I know that lttng-ust is dynamically linked ... I think the lttng-ust docs say this is only option but also makes reference to the fact static linking was once possible (in some versions of the documentation) but not supported anymore (I probably have the docs memorized by now ha, ha ... I've looked at many, many versions of them).

Just for full disclosure my ldd looks like:

        linux-vdso.so.1 (0x0000ffffab196000)
        libfcgi.so.0 => /usr/lib/libfcgi.so.0 (0x0000ffffa57f0000)
       liblttng-ust.so.1 => /usr/lib/liblttng-ust.so.1 (0x0000ffffa5750000)
        libxml2.so.2 => /usr/lib/libxml2.so.2 (0x0000ffffa55d0000)
        librt.so.1 => /lib/librt.so.1 (0x0000ffffa55b0000)
        libm.so.6 => /lib/libm.so.6 (0x0000ffffa5510000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x0000ffffa52f0000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x0000ffffa52c0000)
        libc.so.6 => /lib/libc.so.6 (0x0000ffffa5110000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffab15d000)
       liblttng-ust-common.so.1 => /usr/local/lib/liblttng-ust-common.so.1 (0x0000ffffa50e0000)        liblttng-ust-tracepoint.so.1 => /usr/local/lib/liblttng-ust-tracepoint.so.1 (0x0000ffffa50a0000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x0000ffffa5080000)
        libz.so.1 => /lib/libz.so.1 (0x0000ffffa5050000)



I find it very suspicious that `liblttng-ust.so.1` is in `/usr/lib`, while the other lttng-ust libraries are being loaded from `/usr/local/lib`.

This information also matches the statedump and the LD_DEBUG info from later on.

Could you verify some of the following information:

1. In your build root for lttng-ust, enumerate all the liblttng*so files. For each shared object, run `file $libname` and record the value of the BuildID hash.

2. In your deployed environment, enumerate all the liblttng*so files in /usr/local/lib and record the value of the BuildID hash.

3. In your deployed environment, enumerate all the liblttng*so files in /usr/lib and record the value of the BuildID hash.

The BuildID will vary from one object to the next, but the deployed environment BuildIDs should match the BuildIDs of the corresponding library from the build root.

In particular my interest is the `liblttng-ust.so.1` library that is loaded from `/usr/lib` in your deployed environment, and if there are any in `/usr/local/lib`.

In the working hello example, does it also end up linking against `/usr/lib/liblttng-ust.so.1`?

    Further along in stepping back:

       - Does make check for lttng-ust pass in your environment?


So I'm glad you mentioned this.  I didn't try to run make check because #1 have never seen anything in the documentation about it and what it does and #2 It said that it depended on Perl and since this is an embedded system, I really didn't want to pull it and its dependencies in.

Nevertheless, I ran it on my native target build I setup to see what would happen.  I didn't notice it complaining about perl so then I checked my manifest and perl is in fact in our image (I guess some other recipe pulled it in and I never paid attention to that):

Making check in include
make[1]: Entering directory '/home/iadmin/lttng-ust/lttng-ust-2.13.8/include'
...
>

       - Does make check for lttng-tools pass in your environment?

I'm still using lttng-tools built by yocto so I don't know how to run make check in that situation.  From devtool maybe???  I don't know, I'd have to look into that one.   I never upgraded lttng-tools either, only upgraded lttng-ust and built from source tar to get the fix you mentioned for TLS stuff.

The poky version of the tools recipe seems to have some stuff relating to ptest, which seems to be a yocto tool for package testing. There might be a run to run that?

https://git.yoctoproject.org/poky/plain/meta/recipes-kernel/lttng/lttng-tools_2.13.13.bb
https://wiki.yoctoproject.org/wiki/Ptest

Otherwise I would get the lttng-tools source for your version, bootstrap, configure, build, and run the tests in tree.

The bulk of regression testing for lttng-ust, -modules, and -tools happens during `make check` for lttng-tools.


       - Is this reproducible in non-yocto environments or on other
    architectures with the same project?

Hmm, that's a good question.  I don't have any other embedded targets to try this on that are a different arch.  I think I heard at one point early on a heavily stubbed out version ran on a PC but I'd have to look into that.

       - Does running the traced application with `LTTNG_UST_DEBUG=1` yield
    more information?


So far I've just been trying to get the app to run without even starting lttng just to get past segfault issue.  Here is what it looks like with debug turned on without starting lttng or a lttng-sessiond:

liblttng_ust_tracepoint[599/599]: Your compiler treats weak symbols with hidden visibility for integer objects as SAME address between compile units part of the same module. (in check_weak_hidden() at liblttng_ust[599/601]: Waiting for local apps sessiond (in wait_for_sessiond() at lttng-ust-comm.c:1758) liblttng_ust[599/600]: Info: sessiond not accepting connections to global apps socket (in ust_listener_thread() at lttng-ust-comm.c:1884)
Segmentation fault

       - I'd also run lttng-sessiond with the environment variable
    `LTTNG_UST_DEBUG=1` set and `-vvv --verbose-consumer` in the program
    arguments and capture for stdout & stderr into a file for analysis.
       - Verify if the main program is dynamically linked with `ldd`
       - Verify which libraries are loaded at runtime and which calls are
    shimmed with `LD_DEBUG=libs,binding`


I provided some of this above.
The output of LD_DEBUG=libs,bindings is attached as file ld_debug-libs-bindings.txt

Thanks. For your own debugging, I would recommend to run the hello world example with the same environment variables in order to have comparison point to a working example.

From my observations in my own environment, for the hello example (and presumably your app if it's set up in the same way) I would expect something similar to the following

```
  3624895:     initialize program: ./hello
   3624895:
3624895: binding file ./hello [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `dlopen' [GLIBC_2.34] 3624895: opening file=/x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]; direct_opencount=3
   3624895:
3624895: binding file ./hello [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `dlsym' [GLIBC_2.34] 3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tp_rcu_re
ad_lock'
3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tp_rcu_re
ad_unlock'
3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tp_rcu_de
reference_sym'
3624895: binding file x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tracepoin
t_module_register'
3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tracepoin
t_module_unregister'
3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tp_disabl
e_destructors'
3624895: binding file /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0] to /x/lttng/master/usr/lib/liblttng-ust-tracepoint.so.1 [0]: normal symbol `lttng_ust_tp_get_de
structors_state'
3624895: binding file ./hello [0] to /x/lttng/master/usr/lib/liblttng-ust.so.1 [0]: normal symbol `lttng_ust_probe_register'
   3624895:
   3624895:     transferring control: ./hello
   3624895:
```

The bit that is missing from your program is the binding of `lttng_ust_probe_register`. Here, it's not like the symbol isn't resolved at all: it's found earlier and if it didn't exist at all the error would probably be relatively clear like:

```
./doc/examples/hello/hello: symbol lookup error: /x/lttng/master/usr/lib/liblttng-ust.so.1: undefined symbol: lttng_ust_probe_register
```

I'd like you to run another test with your app. This should indicate that all symbols should be immediately resolved during dynamic loading rather than performing lazy symbol resolution. The goal of this test is to separate clearly the symbol resolution from the call that happens in the program's initialization. You can also run it with the hello demo to compare the runs.

```
LTTNG_UST_DEBUG=1 LD_DEBUG=all LD_BIND_NOW=1 ./my_app
```


       - Review in detail which gcc commands are executed to produce the
    tracepoint provider and link it to the main executable.

Exactly like hello world example.

Okay. I see your program also has at least some portions in C++, which is different than the hello world example.

Does your build process use particular optimizations or strip parts from the resulting object files?

When you're building the hello world example, does it get compiled in the same buildroot as your app, or is it done differently?

Do you use a custom linker script(s)?





So the embedded target doesn't have babletrace so I archived the session and transferred it over to my dev PC and used tracecompas to view the lttng-session.

You can see that in the attached file my_session.csv

This is good, the statedump is happening properly.


    ```

     >
     > Thanks!  I'm running out of things to try.


For the next steps I would start disassembling "my_app" or checking in gdb which bad address is being fed into strnlen.

Sections of interest (`objdump -sD my_app` _and_ `readelf -W -a my_app`):

- lttng_ust_tracepoints_ptrs
- lttng_ust_tracepoints_strings

Symbols of interest (`objdump -sD my_app` _and `nm my_app`):

- lttng_ust_tracepoint_hello_world___my_first_tracepoint
- lttng_ust_tracepoints_ptrs
- __start_lttng_ust_tracepoints_ptrs
- __stop_lttng_ust_tracepoints_ptrs

thanks,
kienan

    Ultimately, given the bespoke environment, build steps, and application
    it's tough to diagnose a lot of things that we would go over with a
    fine
    tooth comb: seeing how the TPPs are built and linked, seeing how the
    application is built and linked, analyzing what's happening at runtime,
    having a coredump that allows us to see the variables, etc.

    If you can provide a minimal reproducer, it's possible to dig further
    into it.

    Otherwise, to look in more detail at your specific project would be
    covered under a service contract (for which we can sign NDAs, etc. as
    needed). Feel free to reach out the sa...@efficios.com
    <mailto:sa...@efficios.com> to organise that.


I understand but so far schedules & budget/priorities haven't aligned so once again I've been scheduled to look at this issue again since we last tried it on older Dunfell OS with older versions of lttng and it looks like we have the same problem as before.

Thanks for your help!

Regards,

Brian


     >
     > I don't fully understand the implications of some of this
    documentation,
     > but I have learned enough to know LD_PRELOAD means nothing if
    everything
     > is static built.  Our application is heavily multi threaded, uses
    fork,
     > clone and who knows if it's doing double closes or not ... so
    I've tried
     > these LD_PRELOAD "helpers" or whatever they are, before and
    didn't get a
     > different result, but now know it's only for shared library
    object.  So
     > now I will experiment with making our tpp a shared object and try
     > LD_PRELOAD of fork, fd and pthread helpers and starting my app
    that way.

    LD_PRELOAD has an effect if the application you are running is
    dynamically linked (note: the TPP can be statically linked inside an
    dynamically linked executable, and the preloads are still useful and/or
    needed in that case).

    The function of the helpers are to ensure that various system calls
    don't clobber things in use by lttng-ust. E.g., close() is shimmed by
    liblttng-ust-fd.so so that FDs that lttng-ust uses aren't suddenly shut
    by the program.

    If you want to see it in action, you could try an experiment with a
    slightly modified hello-static-lib application, with instructions in
    the
    top comment:
    https://gist.github.com/kienanstewart/299eb6a511d92d458569b210e4a418a8 
<https://gist.github.com/kienanstewart/299eb6a511d92d458569b210e4a418a8>

    The effect of the preloads such as liblttng-ust-fd.so are
    independent of
    how the TPP is built and linked to the application or library.

    thanks,
    kienan

     >
     > This is what I'm referring to above from the lttng docs:
     >
     >
     >           Use LTTng-UST with daemons
>  <https://lttng.org/docs/v2.13/#doc-using-lttng-ust-with-daemons
    <https://lttng.org/docs/v2.13/#doc-using-lttng-ust-with-daemons>>
     >
     > If your instrumented application calls fork(2)
     > <https://man7.org/linux/man-pages/man2/fork.2.html
    <https://man7.org/linux/man-pages/man2/fork.2.html>>, clone(2)
     > <https://man7.org/linux/man-pages/man2/clone.2.html
    <https://man7.org/linux/man-pages/man2/clone.2.html>>, or BSD’s
    rfork(2)
     >
    <http://www.freebsd.org/cgi/man.cgi?query=rfork&sektion=2&manpath=FreeBSD+4.10-RELEASE 
<http://www.freebsd.org/cgi/man.cgi?query=rfork&sektion=2&manpath=FreeBSD+4.10-RELEASE>>, without a 
following exec(3) <https://man7.org/linux/man-pages/man3/exec.3.html 
<https://man7.org/linux/man-pages/man3/exec.3.html>>-family system call, you must preload the 
|liblttng-ust-fork.so| shared object when you start the application.
     >
     > LD_PRELOAD=liblttng-ust-fork.so ./my-app
     >
     > If your tracepoint provider package is a shared library which you
    also
     > preload, you must put both shared objects in |LD_PRELOAD|:
     >
     > LD_PRELOAD=liblttng-ust-fork.so:/path/to/tp.so ./my-app
     >
     >
     >           Use LTTng-UST with applications which close file
    descriptors
     >           that don’t belong to them
     >           <https://lttng.org/docs/v2.13/#doc-liblttng-ust-fd
    <https://lttng.org/docs/v2.13/#doc-liblttng-ust-fd>>
     >
     > Since 2.9
     >
     > If your instrumented application closes one or more file descriptors
     > which it did not open itself, you must preload the
    |liblttng-ust-fd.so|
     > shared object when you start the application:
     >
     > LD_PRELOAD=liblttng-ust-fd.so ./my-app
     >
     > Typical use cases include closing all the file descriptors after
    fork(2)
     > <https://man7.org/linux/man-pages/man2/fork.2.html
    <https://man7.org/linux/man-pages/man2/fork.2.html>> or rfork(2)
     >
    <http://www.freebsd.org/cgi/man.cgi?query=rfork&sektion=2&manpath=FreeBSD+4.10-RELEASE 
<http://www.freebsd.org/cgi/man.cgi?query=rfork&sektion=2&manpath=FreeBSD+4.10-RELEASE>> 
and buggy applications doing “double closes”.
     >
     >
     > Thanks,
     >
     > Brian
     >
     >

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Reply via email to