We have what should be a simple task: we’re on CentOS 7, and we want to deploy a Go binary that will have USDT tracepoints. USDT is an attractive option for a few debugging purposes. It allows applications to define tracepoints with higher levels of stability and semantic meaning than more ad-hoc methods like dynamic uprobes.
Usage of USDT tracepoints tends to have a different focus from other monitoring techniques like logging, Prometheus, OpenTracing etc. These might identify a general issue such as a poor latency metric: you’d then use USDT probes to dig further into the problems in a production system, to identify precisely what’s happening at a particular endpoint or whatever.
USDT in Go
The normal model for USDT involves placing the trace points at specific
places in the binary: they are statically defined and built, but
dynamically enabled. This is typically done via the DTRACE_PROBE()
family of macros.
The only (?) USDT facility for Go is salp. This uses libstapsdt under the hood. This library dynamically creates probes at runtime, even though Go is a compiled language. Yes, this is dynamic static dynamic tracing.
We’re going to use salpdemo
in our experiment. This has two USDT
probes, p1
and p2
that we’d like to be able to dynamically trace,
using bcc-tools
' handy trace
wrapper. CentOS 7 doesn’t appear to have
support for the later USDT support in perf probe
.
Setting up a Docker container for dynamic tracing
For a few different reasons, we’d like to be able to trace from inside the container itself. This has security implications, given what’s implemented today, but bear in mind we’re on CentOS 7, so even if there’s a finer-grained current solution, there’s a good chance it wouldn’t work here. In reality, we would probably use an ad-hoc debugging sidecar container, but we’re going to just use the one container here.
First, we’re going to deploy the container with ansible for convenience:
$ cat hosts
localhost ansible_connection=local
$ cat playbook.yml
---
- hosts: localhost
become: yes
tasks:
- docker_container:
name: usdt_test
image: centos:7
state: started
command: sleep infinity
network_mode: bridge
ulimits:
- memlock:8192000:8192000
capabilities:
- sys_admin
volumes:
- /sys/kernel/debug:/sys/kernel/debug
$ ansible-playbook -i hosts ./playbook.yml
Note that we’re using sleep infinity
here to keep our container
running so we can play around.
We need the sys_admin
capability to be able to program the probes,
and the BPF compiler needs the locked memory limit bumping. We also
need to mount /sys/kernel/debug
read-write (!) in order to be able to
write to /sys/kernel/debug/tracing/uprobe_events
.
Now let’s install everything we need to be able to trace these probes:
$ docker exec -it usdt_test yum -y install \
kernel-devel-$(uname -r) kernel-$(uname -r) bcc-tools
Yes, it’s a lot, but unavoidable. You can, in theory, use mounted volumes for the kernel sources, as described here; however, the read-only mounts break packaging inside the container, so we’re not doing that here.
Tracing the probes in the container
The above was a big hammer, but we should be good to go right? Let’s start up the demo binary:
$ docker cp ~/salpdemo usdt_test:/root/
$ docker exec -it usdt_test bash
[root@8ccf34663dd2 /]# ~/salpdemo &
[1] 18166
List the go probes in this demo with
sudo tplist -vp "$(pgrep salpdemo)" "salp-demo*"
Trace this process with
sudo trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
or
sudo trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
We can indeed list the probes:
[root@8ccf34663dd2 /]# /usr/share/bcc/tools/tplist -vp $(pgrep salpdemo) | head
salp-demo:p1 [sema 0x0]
1 location(s)
3 argument(s)
salp-demo:p2 [sema 0x0]
1 location(s)
2 argument(s)
libc:setjmp [sema 0x0]
...
So let’s try the suggested trace
invocation:
# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__tmp_salp_demo_I8qitQ_so_0x270_18166_bcc_18175/id): Invalid argument
Failed to attach BPF to uprobe
Huh. This doesn’t seem to be a permissions issue, since we got EINVAL
.
In addition, running from the host has the same problem.
I haven’t proved it, but I think our basic issue here is that Centos 7 is missing this kernel fix:
tracing/uprobe: Add support for overlayfs
I spent way too long trying to work around this by placing the binary
somewhere other than overlayfs, before I finally dug a little bit more
into how libstapsdt
actually works, and figured out the problem.
Working around overlayfs and libstapsdt
To build probes dynamically at runtime, libstapsdt
does something
slightly
crazy:
it generates a temporay ELF shared library at runtime that contains
the USDT probes and uses dlopen()
to bring it into the running binary.
Let’s have a look:
[root@8ccf34663dd2 /]# grep salp-demo /proc/$(pgrep salpdemo)/maps
7fa9373b5000-7fa9373b6000 r-xp 00000000 fd:10 1506373 /tmp/salp-demo-I8qitQ.so
7fa9373b6000-7fa9375b5000 ---p 00001000 fd:10 1506373 /tmp/salp-demo-I8qitQ.so
7fa9375b5000-7fa9375b6000 rwxp 00000000 fd:10 1506373 /tmp/salp-demo-I8qitQ.so
The process has mapped in this temporary file, named after the provider.
It’s on /tmp
, hence overlay2
filesystem, explaining why moving the
salpdemo
binary itself around made no difference.
So maybe we can be more specific?
[root@8ccf34663dd2 /]# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u:/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__tmp_salp_demo_I8qitQ_so_0x270_18166_bcc_18188/id): Invalid argument
Failed to attach BPF to uprobe
Still not there yet. The above bug means that it still can’t find the uprobe given the binary image path. What we really need is the host path of this file. We can get this from Docker:
$ docker inspect usdt_test | json -a GraphDriver.Data.MergedDir
/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/merged
This is not good, as obviously we can’t reach this path from inside the container. Hey, at least we can run it on the host though.
$ sudo /usr/share/bcc/tools/trace 'u:/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/merged/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
Event name (p__data_docker_overlay2_77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce_merged_tmp_salp_demo_I8qitQ_so_0x270) is too long for buffer
Failed to attach BPF to uprobe
SIGH. Luckily, though:
$ sudo /usr/share/bcc/tools/trace 'u:/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/diff/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
PID TID COMM FUNC -
19862 19864 salpdemo p1 i=64 err=An error: 64
19862 19864 salpdemo p1 i=66 err=An error: 66
It worked! But it’s not so great: we wanted to be able to trace inside a
container. If we mounted /data/docker
itself inside the container, we
could do that, but it’s still incredibly awkward.
Using tmpfs?
Instead, can we get the generated file onto a different filesystem type?
libstapsdt
hard-codes
/tmp
which limits our options.
Let’s start again with /tmp
inside the container on tmpfs
:
$ tail -1 playbook.yml
tmpfs: /tmp:exec
We need to force on exec
mount flag here: otherwise, we can’t
dlopen()
the generated file. Yes, not great for security again.
$ docker exec -it usdt_test bash
# ~/salpdemo &
...
[root@1f56af6e7bee /]# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
PID TID COMM FUNC -
Well, we’re sort of there. It started up, but we never get any output. Worse, we get the same if we try this in the host now! I don’t know what the issue here is.
Using a volume?
Let’s try a volume mount instead:
$ tail -3 playbook.yml
volumes:
- /sys/kernel/debug:/sys/kernel/debug
- /tmp/tmp.usdt_test:/tmp
If we run trace
in the host now, we can just use u::p1
:
$ sudo /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
PID TID COMM FUNC -
6864 6866 salpdemo p2 j=120 flag=1
...
But we still need a bit of a tweak inside our container:
# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3'
PID TID COMM FUNC -
<no output>
[root@d72b822cab0f /]# cat /proc/$(pgrep salpdemo | head -n1)/maps | grep /tmp/salp-demo*.so | awk '{print $6}' | head -n1
/tmp/salp-demo-6kcugm.so
[root@d72b822cab0f /]# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u:/tmp/salp-demo-6kcugm.so:p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3'
PID TID COMM FUNC -
11593 11595 salpdemo p1 i=-17 err=`An error: -17` date=`Thu, 06 Aug 2020 13:12:57 +0000`
...
I don’t have any clear idea why the name is required inside the container context, but at least, finally, we managed to trace those USDT probes!