The meme is real, but I think this particular case is sort of interesting,
because it turned out, ultimately, to not be due to DNS configuration, but an
honest-to-goodness bug in glibc
.
As previously
mentioned, I
heavily rely on email-oauth2-proxy for my work
email. Every now and then, I’d see a failure like this:
Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known
This always coincided with a change in my network, but - and this is the issue -
the app never recovered. Even though other processes - even Python ones - could
happily resolve outlook.office365.com
- this long-running daemon remained
stuck, until it was restarted.
A bug in the proxy?
My first suspect here was this bit of code:
1761 def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
1762 # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
1763 for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
1764 super().create_socket(a[0], socket.SOCK_STREAM)
1765 return
We’re looping across the gai results, but returning after the first one, and
there’s no attempt to account for the first address result being unreachable,
but later ones being fine.
Makes no sense, right? My guess was that somehow getaddrinfo()
was returning
IPv6 results first in this list, as at the time, the IPv6 configuration on the
host was a little wonky. Perhaps I needed to tweak
gai.conf ?
However, while this was a proxy bug,
it was not the cause of my issue.
DNS caching?
Perhaps, then, this is a local DNS cache issue? Other processes work OK, even
Python test programs, so it didn’t seem likely to be the system-level resolver
caching stale results. Python itself doesn’t seem to cache
results.
This case triggered (sometimes) when my VPN connection died. The openconnect
vpnc
script had correctly updated /etc/resolv.conf
back
to the original configuration, and as there’s no caching in the way, then the
overall system state looked correct. But somehow, this process still had wonky
DNS?
A live reproduction
I was not going to get any further until I had a live reproduction and
the spare time to investigate it before restarting the proxy.
The running proxy in this state could be triggered easily by waking up
fetchmail
, which made it much easier to investigate what was happening each
time.
So what was the proxy doing on line :1763 above? Here’s an strace
snippet:
[pid 1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid 1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid 1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED")}, 16) = 0
[pid 1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid 1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid 1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid 1552] close(7) = 0
As we might expect, we’re opening a socket, connecting over UDP to port 53,
and sending out a request to the DNS server.
This indicated the proximal issue: the DNS server IP address was wrong - the
DNS servers used were the ones originally set up by openconnect
still. The
process wasn’t incorrectly caching DNS results but the DNS servers. Forever.
Nameserver configuration itself is not something that applications typically
control, so the next question was - how does this work normally? When I update
/etc/resolv.conf
, or the thousand other ways to configure name resolution in
modern Linux systems, what makes getaddrinfo()
continue to work, normally?
/etc/resolv.conf and glibc
So, how does glibc
account for changes in resolver configuration?
The contents of the /etc/resolv.conf
file are the canonical location for
DNS server addresses for processes (like Python ones) using the standard glibc
resolver. Logically then, there must be a way for updates to the file to affect
running processes.
In glibc
, such configuration is represented by struct resolv_context
. This
is lazily initialized via __resolv_context_get()->maybe_init()
, which looks
like
this:
68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if
69 res_init in some other thread requested re-initializing. */
70 static __attribute__ ((warn_unused_result)) bool
71 maybe_init (struct resolv_context *ctx, bool preinit)
72 {
73 struct __res_state *resp = ctx->resp;
74 if (resp->options & RES_INIT)
75 {
76 if (resp->options & RES_NORELOAD)
77 /* Configuration reloading was explicitly disabled. */
78 return true;
79
80 /* If there is no associated resolv_conf object despite the
81 initialization, something modified *ctx->resp. Do not
82 override those changes. */
83 if (ctx->conf != NULL && replicated_configuration_matches (ctx))
84 {
85 struct resolv_conf *current = __resolv_conf_get_current ();
86 if (current == NULL)
87 return false;
88
89 /* Check if the configuration changed. */
90 if (current != ctx->conf)
...
Let’s take a look at __resolv_conf_get_current()
:
123 struct resolv_conf *
124 __resolv_conf_get_current (void)
125 {
126 struct file_change_detection initial;
127 if (!__file_change_detection_for_path (&initial, _PATH_RESCONF))
128 return NULL;
129
130 struct resolv_conf_global *global_copy = get_locked_global ();
131 if (global_copy == NULL)
132 return NULL;
133 struct resolv_conf *conf;
134 if (global_copy->conf_current != NULL
135 && __file_is_unchanged (&initial, &global_copy->file_resolve_conf))
This is the file change detection code we’re looking for: _PATH_RESCONF
is
/etc/resolv.conf
, and __file_is_unchanged()
compares the cached values of
things like the file mtime
and so on against the one on disk.
If it has in fact changed, then maybe_init()
is supposed to go down the
“reload configuration” path.
Now, in my case, this wasn’t happening. And the reason for this is line 83
above: the replicated_configuration_matches()
call.
Resolution options
We already briefly mentioned
gai.conf
. There is
also, as the resolver.3
man
page says, this
interface:
The resolver routines use configuration and state information
contained in a __res_state structure (either passed as the statep
argument, or in the global variable _res, in the case of the
older nonreentrant functions). The only field of this structure
that is normally manipulated by the user is the options field.
So an application can dynamically alter options too, outside of whatever
static configuration there is. And (I think) that’s why we have the
replicated_configuration_matches()
check:
static bool
replicated_configuration_matches (const struct resolv_context *ctx)
{
return ctx->resp->options == ctx->conf->options
&& ctx->resp->retrans == ctx->conf->retrans
&& ctx->resp->retry == ctx->conf->retry
&& ctx->resp->ndots == ctx->conf->ndots;
}
The idea being, if the application has explicitly diverged its options, it
doesn’t want them to be reverted just because the static configuration changed.
Our Python application isn’t changing anything here, so this should still
work as expected.
In fact, though, we find that it’s returning false
: the dynamic configuration
has somehow acquired the extra options RES_SNGLKUP
and RES_SNGLKUPREOP
.
We’re now very close to the source of the problem!
A hack that bites
So what could possibly set these flags? Turns out the send_dg()
function
does:
999 {
1000 /* There are quite a few broken name servers out
1001 there which don't handle two outstanding
1002 requests from the same source. There are also
1003 broken firewall settings. If we time out after
1004 having received one answer switch to the mode
1005 where we send the second request only once we
1006 have received the first answer. */
1007 if (!single_request)
1008 {
1009 statp->options |= RES_SNGLKUP;
1010 single_request = true;
1011 *gotsomewhere = save_gotsomewhere;
1012 goto retry;
1013 }
1014 else if (!single_request_reopen)
1015 {
1016 statp->options |= RES_SNGLKUPREOP;
1017 single_request_reopen = true;
1018 *gotsomewhere = save_gotsomewhere;
1019 __res_iclose (statp, false);
1020 goto retry_reopen;
1021 }
Now, I don’t believe the relevant nameservers have such a bug. Rather, what
seems to be happening is that when the VPN connection drops, making the servers
inaccessible, we hit this path. And these flags are treated by maybe_init()
as
if the client application set them, and has thus diverged from the static
configuration. As the application itself has no control over these options being
set like this, this seemd like a real glibc
bug.
The fix
I originally reported this to the list back in
March;
I was not confident in my analysis but the maintainers confirmed the
issue. More recently,
they fixed
it.
The actual fix was pretty simple: apply the workaround flags to
statp->_flags
instead, so they don’t affect the logic in maybe_init()
.
Thanks DJ Delorie!
I intentionally have mouse support disabled in vim
, as I find not being able to
select text the same way as in any other terminal screen unergonomic.
However, this has an annoying problem as a libvte
/ gnome-terminal
user: the
terminal, on switching to an “alternate screen” application like vim
that has
mouse support disabled, “helpfully” maps scroll wheel events to arrow up/down
events.
This is possibly fine, except I use the scroll wheel click as middle-button
paste, and I’m constantly accidentally pasting something in the wrong place as a
result.
This is unfixable from within vim
, since it only sees normal arrow key
presses (not ScrollWheelUp
and so on).
However, you can turn this off in libvte
, by the magic escape sequence:
echo -ne '\e[?1007l'
Also known as XTERM_ALTBUF_SCROLL
. This is mentioned in passing in this
ticket.
Documentation in general is - at best - sparse, but you can always go to the
source.
As I mentioned in my last
post, I’ve been
experimenting with replacing davmail
with Simon Robinson’s super-cool
email-oauth2-proxy, and
hooking fetchmail
and mutt
up to it. As before, here’s a specific rundown of
how I configured O365 access using this.
Configuration
We need some small tweaks to the shipped configuration file. It’s used for both
permanent configuration and acquired tokens, but the static part looks something
like this:
[[email protected]]
permission_url = https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url = https://login.microsoftonline.com/common/oauth2/v2.0/token
oauth2_scope = https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
redirect_uri = https://login.microsoftonline.com/common/oauth2/nativeclient
client_id = facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret =
We’re re-using davmail
’s client_id
again.
Updated 2023-10-10: emailproxy
now supports a proper headless mode, as
discussed below.
Updated 2022-11-22: you also want to set
delete_account_token_on_password_error
to False
: otherwise, a typo will delete the tokens, and you’ll need to
re-authenticate from scratch.
We’ll configure fetchmail
as follows:
poll localhost protocol IMAP port 1993
auth password username "[email protected]"
is localuser here
keep
sslmode none
mda "/usr/bin/procmail -d %T"
folders INBOX
and mutt
like this:
set smtp_url = "smtp://[email protected]@localhost:1587/"
unset smtp_pass
set ssl_starttls=no
set ssl_force_tls=no
When you first connect, you will get a GUI pop-up and you need to interact with
the tray menu to follow the authorization flow. After that, the proxy will
refresh tokens as necessary.
Running in systemd
Here’s my service
file I use, slightly modified from the upstream’s README:
$ cat /etc/systemd/system/emailproxy.service
[Unit]
Description=Email OAuth 2.0 Proxy
[Service]
ExecStart=/usr/bin/python3 /home/localuser/src/email-oauth2-proxy/emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config
Restart=always
User=joebloggs
Group=joebloggs
[Install]
WantedBy=multi-user.target
Headless operation
Typically, only initial authorizations require the GUI, so you could easily do
the initial dance then use the above systemd service.
Even better, with current versions of email-oauth2-proxy
, you can operate in
an entirely headless manner! With the above --external-auth
and --no-gui
options, the proxy will prompt on stdin
with a URL you can copy into your
browser; pasting the response URL back in will authorize the proxy, and store
the necessary access and refresh tokens in the config file you specify.
For example:
$ sudo systemctl stop emailproxy
$ python3 ./emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config
# Now connect from mutt or fetchmail.
Authorisation request received for [email protected] (external auth mode)
Email OAuth 2.0 Proxy No-GUI external auth mode: please authorise a request for account [email protected]
Please visit the following URL to authenticate account [email protected]: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...
Copy+paste or press [↵ Return] to visit the following URL and authenticate account [email protected]: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...
then paste here the full post-authentication URL from the browser's address bar (it should start with https://login.microsoftonline.com/common/oauth2/nativeclient):
# Paste the updated URL bar contents from your browser in response:
https://login.microsoftonline.com/common/oauth2/nativeclient?code=...
SMTP (localhost:1587; [email protected]) [ Successfully authenticated SMTP connection - releasing session ]
^C
$ sudo systemctl start emailproxy
Obviously, you’ll need to do this interactively from the terminal, then restart
in daemon mode.
email-oauth2-proxy
If you find the above details useful, consider donating to support Simon’s
sterling work on oauth2-email-proxy.
I previously
described accessing Office365 email (and in particular its oauth2
flow) via
davmail
, allowing me to continue using fetchmail
, procmail
and mutt
. As
davmail
is java, it’s a pain to have around, so I thought I’d give some
details on how to do this more directly in fetchmail
, as all the available
docs I found were a little vague, and it’s quite easy to screw up.
As it happens, I came across a generally better
solution shortly after writing
this post, on which more later.
Fetchmail 7
Unfortunately there is little interest in releasing a Fetchmail version with
oauth2
support - the maintainer is taking a political
stance on not
integrating it - so you’ll need to check out the next
branch from git:
cd ~/src/
git clone -b next [email protected]:fetchmail/fetchmail.git fetchmail-next
cd fetchmail-next
./autogen.sh && ./configure --prefix=/opt/fetchmail7 && make && sudo make install
I used the branch as of 43c18a54 Merge branch 'legacy_6x' into next
. Given
that the maintainer warns us they might remove oauth2 support, you might need
this exact hash…
Generate a token
We need to go through the usual flow for getting an initial token. There’s a
helper script for this, but first we need a config file:
[email protected]
client_id=facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret=
refresh_token_file=/home/localuser/.fetchmail-refresh
access_token_file=/home//localuser/.fetchmail-token
imap_server=outlook.office365.com
smtp_server=outlook.office365.com
scope=https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
auth_url=https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url=https://login.microsoftonline.com/common/oauth2/v2.0/token
redirect_uri=https://login.microsoftonline.com/common/oauth2/nativeclient
Replace [email protected]
and localuser
in the above, and put it at
~/.fetchmail.oauth2.cfg
. It’s rare to find somebody mention this, but O365
does not need a client_secret
, and we’re just going to borrow davmail
’s
client_id
- it’s not a secret in any way, and trying to get your own is a
royal pain. Also, if you see a reference to tenant_id
anywhere, ignore it -
common
is what we need here.
Run the flow:
$ # This doesn't get installed...
$ chmod +x ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ # Sigh.
$ sed -i 's+/usr/bin/python+/usr/bin/python3+' ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ ~/src/fetchmail-next/contrib/fetchmail-oauth2.py -c ~/.fetchmail.oauth2.cfg --obtain_refresh_token_file
To authorize token, visit this url and follow the directions:
https://login.microsoftonline.com/common/oauth2/v2.0/authorize?...
Enter verification code:
Unlike davmail
, this needs just the code, not the full returned URL, so you’ll
need to be careful to dig out just the code from the response URL (watch out for
any session_state
parameter at the end!).
This will give you an access token that will last for around an hour.
Fetchmail configuration
Now we need an oauthbearer
.fetchmailrc
like this:
set daemon 60
set no bouncemail
poll outlook.office365.com protocol IMAP port 993
auth oauthbearer username "[email protected]"
passwordfile "/home/localuser/.fetchmail-token"
is localuser here
keep
sslmode wrapped sslcertck
folders INBOX
mda "/usr/bin/procmail -d %T"
Replace [email protected]
and localuser
.
At this point, hopefully starting /opt/fetchmail7/bin/fetchmail
will work!
Refresh tokens
As per the OAUTH2
README,
fetchmail
itself does not take care of refreshing the token, so you need
something like this in your crontab
:
*/2 * * * * $HOME/src/fetchmail-next/contrib/fetchmail-oauth2.py -c $HOME/.fetchmail.oauth2.cfg --auto_refresh
I have the misfortune of maintaining some things using CMake. One major
annoyance is that __FILE__
is an absolute path, and that can’t be changed in
CMake itself. Like most CMake annoyances, you can find a discussion online from
about 15 years ago, but no sign of an actual fix.
Instead, you need a hack: this - I think - is the simplest one.
First, in our top-level CMakeLists.txt
, we’ll define this helper function:
function(add_srcs NAME)
set(${NAME} ${ARGN} PARENT_SCOPE)
foreach(f IN LISTS ARGN)
file(RELATIVE_PATH b ${CMAKE_SOURCE_DIR}
${CMAKE_CURRENT_SOURCE_DIR}/${f})
set_source_files_properties(${f} PROPERTIES COMPILE_DEFINITIONS
"__FILE__=\"${b}\"")
endforeach()
endfunction()
This will take each of the arguments, convert each file into a path relative to
the top-level directory, then re-define __FILE__
on a per-source-file basis.
We also set()
a variable for our parent scope to use.
We’ll also need -Wno-builtin-macro-redefined
.
Then, in each child CMakeLists.txt
, we will do something like:
add_srcs(MYCODE_SRCS mycode.c mycode.h)
add_library(mycode ${MYCODE_SRCS})
add_srcs(CODE2_SRCS code2.c code2.h)
add_library(code2 ${CODE2_SRCS})
I thought it might be interesting, at least to myself, to write up how I
approach code reviews. My history in tech is one where close code review was
emphasized and well-respected: most appreciated that a detailed review was not
only worth the reviewer’s time, but mutually beneficial, and often a great
learning experience for everyone. So my tiny mind can’t process ideas like
post-commit
reviews,
that seem to be predicated on the idea that testing is some kind
of replacement for code review. To me, the kind of things that tests can cover
is only a very small part of what code review is useful for.
I’ve code-reviewed entire projects totalling many thousands of lines, and
single-character changes, but I’m usually following at least some of the below
patterns either way:
Understand the context
First thing I read is the commit message. Of course, this should be in normal
form, but I’m really making sure I can understand
what the change is, based solely upon the commit message. Without looking at the
code, can I, as a casual observer, understand what’s changed? Is the commit
title specific enough? Does the commit message’s contents describe not just
what changed, but why (as usual, the GNU standard is an exemplar of what not
to do here)? Is it clear? Does the message needlessly have things that belong on
another tracking system (target gate)?
I will read any associated ticket for its context - especially keeping an eye
out for anything that doesn’t seem to fit with the PR’s changes. This could be a
missing case, or a fundamental mis-understanding of what the real underlying
problem is. If there is any design doc mentioned (and they should be mentioned!)
I’ll also read that and diff
its contents against what actually got
implemented.
I’m looking mainly for disparities between what everyone agreed we should do,
and what is actually happening in the code, but especially for missing things;
it’s very easy to accidentally skip somebody’s drive-by comment, but that could
turn out to be crucial to the implementation.
I also consider if this change makes sense on its own, and if it could be split
up: this is often a matter of appetite (and personally I find the Linux kernel
approach often goes a little too far), but patch series with one logical change
per commit is often much easier to review. It should hopefully go without saying
that each individual commit in the series should still pass CI, but
unfortunately that’s painful to do with at least github
PRs.
Get an overview
Next I start looking at the actual code changes: often with one tab per
file,
I’m trying to understand how the changes fit together: who calls what, what new
intra-code dependencies there are, what the possible impact of the changes could
be.
I might well look back in git history for each of these files to understand why
the old code is like it is: this is also often very useful in identifying
potential issues with the new changes.
Depending on the change, I will often checkout a local copy, and use git grep
,
ctags
, etc. to help me understand how everything fits together.
My focus at this level is often on interfaces: does a new method have a suitable
name? Is it at the right level of abstraction? What is the ownership of the
relevant objects? Are there any layering violations?
Are there any external dependencies we need to worry about? Equally if anyone is
depending on us, are we providing well-written interfaces? Are they designed
with care and attention to versioning, information hiding, and all the usual API
concerns? Is this going to wear well after it’s been in production for years?
I’m also bearing in mind other ongoing work: if there’s a project underway that
is directly relevant to this specific change, I might ask for some accommodation
that will make the eventual merge of both easier. Equally if there’s a general
desire to take a particular technical direction, I might complain if something
is taking a different tack.
It’s a rare code review where I don’t have to pause to go research something:
systemd
service definition semantics, syscall error modes, how selinux
roles
work etc. As I said above, great learning experience!
Are there potential performance concerns with the change: lots of unnecessary
I/O, potential for big-O issues, needless overhead etc? What are the expected
limits of the objects being handled?
What if there are bugs with this change: is the error handling suitable? Is
there a sufficient level of logging, exception details, etc. to identify in the
field what went wrong? Is there unnecessary noise? How would a stressed-out SRE
deal with this in production?
Have any necessary unit/component tests been updated or added? Do they actually
test something useful?
I almost never build or test changes I’m reviewing: that’s a job for the
submitter and your CI infrastructure. The only exception is if I’m struggling to
understand something, and running the tests would help.
Detailed review
I’m now going to go line-by-line through the whole patch, leaving comments where
necessary. Sometimes I’ll reach something I don’t understand, add leave a
“FIXME” for myself: if, after reading the whole change, I still don’t understand
what’s happening, this will often re-formulate itself into a question for the
submitter, along with a request for expanded code comments, but usually I can
just delete these later.
If I find major - architectural level - issues with what I’m looking at, that’s
often a good prompt to take the discussion elsewhere, perhaps to a Zoom call or
design document discussion. Doing design review inside a PR is not fun for
anyone.
I’ve noticed a tendency to “review the diffs”: the idea that only the changed
lines are relevant to the review - that tools expand 10 lines at a time is a
symptom of this. This is very wrong-headed in my opinion, and I often find
myself in the rest of the code to make sure I can properly review what has
changed.
Comb for nits
Everyone has a different appetite for code review nits: generally, I will always
point out actual typos (often just once, if it’s repeated, expecting the
submitter to apply my comment to all instances). If I have a substantive
comment, I might also suggest some style-level improvements. I never expect
someone to make these sort of changes for existing code: the general idea is to
leave the code in a slightly better place than it was, not re-write whole files
for cosmetic nits.
Often these stylistic nits are marked “optional”: if the submitter feels like
it, they could change it, but it’s no big deal if not.
I’ll very often have style comments on things like:
- unnecessary comments that just describe the code
- missing comments
- variable, function naming
- function size and decomposition
- local customs
Many of these things can be a matter of opinion, so I try to bear in mind other
ways of thinking, up to a point. I’m never going to be happy seeing a ton of
CamelCase and Hungarian notation in code that doesn’t have it already.
Iterate
I haven’t yet found a code review tool that’s ideal at iteration: gerrit is
still pretty hopeless at tracking outstanding comment state. PRs in github are
better at displaying this, but have the fatal flaw that any history rewrite
means all context is lost.
Regardless, when I get a new version of the changes, I’ll often review both the
incremental diff and the whole change, checking that:
- my review comments have been acted upon and the fixes look good
- the change as a whole still makes sense:
- is the commit message still correct?
- are there now unnecessary changes, due to follow-on fixes?
- do the updates go far enough?
Long review cycles for a PR can be grueling, both for reviewers and the PR
owner. But in my opinion it’s almost always worth the effort, especially for
complex changes: this code is
probably going to live a lot longer than you’d think, and be maintained by
people other than you.
Even worse, it’s often work not given the respect it’s due: PR owners can see it
as a combative process, and management can see it as overhead. I don’t really
know what to do about that.
Pay it forward!
It’s still common for a systems library to be written in the default lingua
franca, C, although Rust is encroaching, for good reasons.
However, when it comes to testing, things get tedious quickly: writing unit or
component tests in C is a slow, bug-prone exercise. With
libvfio-user, after fixing too many
bugs that were due to the test rather than the test subject, I decided it would
be worth looking at alternative approaches. The aim was to reduce the time it
takes to develop unit/component tests.
Up until this point, we’d been using
ctest, along with
cmocka when we needed to mock out certain functions (such
as socket handling). Leaving aside my strong feelings on these tools, this was
rather unsatisfactory: libvfio-user
effectively implements a (UNIX) socket
server, but we weren’t actually testing round-trip interactions for the most
part. In terms of code coverage, very little useful could be done via this unit
testing approach, but the “sample” client/server was tedious to work with for
testing purposes.
Python-based testing
After a quick proof of concept, it became clear that using Python would be a
great choice to cover most of our testing needs. libvfio-user
doesn’t ship
with any client bindings, and, given that the main clients are
qemu,
cloud-hypervisor and
SPDK, Python bindings would be of dubious utility.
As a result, we decided against “proper” Python bindings, auto-generated or
otherwise, in favour of a small and simple approach. In particular, by using the
terrible magic of ctypes
, we could easily set up both client and server test
cases that fully represent how the library works in real life.
So, instead of auto-generated bindings, we write - by hand -
simple, thin, layers of
type wrappers:
class vfio_irq_info(Structure):
_pack_ = 1
_fields_ = [
("argsz", c.c_uint32),
("flags", c.c_uint32),
("index", c.c_uint32),
("count", c.c_uint32),
]
small harness routines for socket handling
…
def connect_client(ctx):
sock = connect_sock()
json = b'{ "capabilities": { "max_msg_fds": 8 } }'
# struct vfio_user_version
payload = struct.pack("HH%dsc" % len(json), LIBVFIO_USER_MAJOR,
LIBVFIO_USER_MINOR, json, b'\0')
hdr = vfio_user_header(VFIO_USER_VERSION, size=len(payload))
sock.send(hdr + payload)
...
… interacting with the
library on the server
side
…
def get_pci_header(ctx):
ptr = lib.vfu_pci_get_config_space(ctx)
return c.cast(ptr, c.POINTER(vfu_pci_hdr_t)).contents
… and so on. Writing this by hand might seem immensely tedious, but in practice,
as it’s pretty much all boilerplate, it’s very quick to write and modify, and
easily understandable; something that can rarely be said for any kind of
auto-generated code.
Client/server interactions
Another observation was that, for the purposes of these tests, we really didn’t
need a client process and a server process: in fact, we don’t even need more
than one thread of execution. If we make each test round-robin between acting as
the client, then acting as the server, it becomes trivial to follow the
control flow, and understanding logs, debugging, etc. is much easier. This is
illustrated by the
msg()
helper:
def msg(ctx, sock, cmd, payload=bytearray(), expect_reply_errno=0, fds=None,
rsp=True, expect_run_ctx_errno=None):
"""
Round trip a request and reply to the server. vfu_run_ctx will be
called once for the server to process the incoming message,
@expect_run_ctx_errrno checks the return value of vfu_run_ctx. If a
response is not expected then @rsp must be set to False, otherwise this
function will block indefinitely.
"""
# FIXME if expect_run_ctx_errno == errno.EBUSY then shouldn't it implied
# that rsp == False?
hdr = vfio_user_header(cmd, size=len(payload))
if fds:
sock.sendmsg([hdr + payload], [(socket.SOL_SOCKET, socket.SCM_RIGHTS,
struct.pack("I" * len(fds), *fds))])
else:
sock.send(hdr + payload)
ret = vfu_run_ctx(ctx, expect_errno=expect_run_ctx_errno)
if expect_run_ctx_errno is None:
assert ret >= 0, os.strerror(c.get_errno())
if not rsp:
return
return get_reply(sock, expect=expect_reply_errno)
We are operating as the client when we do the sendmsg()
; the server then
processes that message via vfu_run_ctx()
, before we “become” the client again
and receive the response via get_reply()
.
We can then implement an individual test like this:
def test_dma_region_too_big():
global ctx, sock
payload = vfio_user_dma_map(argsz=len(vfio_user_dma_map()),
flags=(VFIO_USER_F_DMA_REGION_READ |
VFIO_USER_F_DMA_REGION_WRITE),
offset=0, addr=0x10000, size=MAX_DMA_SIZE + 4096)
msg(ctx, sock, VFIO_USER_DMA_MAP, payload, expect_reply_errno=errno.ENOSPC)
which we can run via make pytest
:
...
___________________________ test_dma_region_too_big ____________________________
----------------------------- Captured stdout call -----------------------------
DEBUG: quiescing device
DEBUG: device quiesced immediately
DEBUG: adding DMA region [0x10000, 0x80000011000) offset=0 flags=0x3
ERROR: DMA region size 8796093026304 > max 8796093022208
ERROR: failed to add DMA region [0x10000, 0x80000011000) offset=0 flags=0x3: No space left on device
ERROR: msg0x62: cmd 2 failed: No space left on device
...
This is many times easier to write and test than trying to do this in C, whether
as a client/server, or attempting to use mocking. And we can be reasonably
confident that the test is meaningful, as we are really executing all of the
library’s message handling.
With a little bit of tweaking, we can also use standard C-based tools like
valgrind
and gcov
. Code coverage is simple: after defeating the mini-boss of
cmake
, we can run make gcov
and get code-coverage results for all C code
invoked via the Python tests - it just works!
Running Python tests with valgrind
was a little harder: for leak detection, we
need to make sure the tests clean up after themselves explicitly. But Python
itself also has a lot of valgrind
noise. Eventually we found that this valgrind
invocation worked well:
PYTHONMALLOC=malloc \
valgrind \
--suppressions=$(CURDIR)/test/py/valgrind.supp \
--quiet \
--track-origins=yes \
--errors-for-leak-kinds=definite \
--show-leak-kinds=definite \
--leak-check=full \
--error-exitcode=1 \
$(PYTESTCMD)
We need to force Python to use the system allocator, and add a number of
suppressions for internal Python valgrind
complaints - I was unable to find a
working standard suppression file for Python, so had to construct this myself
based upon the Python versions in our CI infrastructure.
Unfortunately, at least on our test systems, ASAN was completely incompatible,
so we couldn’t directly run that for the Python tests.
Summary
The approach I’ve described here has worked really well for us: it no longer
feels immensely tedious to add tests along with library changes, which can only
help improve overall code quality. They are quick to run and modify, and for the
most part easy to understand what the tests are actually doing.
There’s been a few occasions where ctypes
has been difficult to work with -
for me the documentation is particularly sparse, and callbacks from the C
library into Python are distinctly non-obvious - but we’ve so far always
managed to battle through, and twist it to our needs.
Doing things this way has a few other drawbacks: it’s not clear, for example, how we might test
intermittent allocation failures, or other failure injection scenarios. It’s
also not really suitable for any kind of performance or scalability testing.
I’m curious if others have taken a similar approach, and what their experiences
might be.
Stefan Hajnoczi recently posted about clean commit
history.
It’s a controversial viewpoint that not everyone agrees with - there is a
sizable population in favour of “never rewrite history”. For me, though, the
points he makes there are totally correct: each commit should be a logical
change, main
(neé master
) should stay green, and CI should pass at every
single point in main
’s history. More than just CI though: regardless of
whether it passes CI, the main
branch should be of good quality at all times,
if you want to avoid the Quality Death
Spiral.
Unfortunately, Github pull requests make this model a little difficult for a few
reasons:
You can’t ever rebase a PR undergoing review
It’s important that a non-draft PR is never rebased, or re-written in any way.
Why? Well, aside from making it difficult for a reviewer to see what’s changed
since last looking, if you rebase, the commits previously on the PR disappear
off into reflog
hyperspace.
The View Changes
button on review comments is attached to that particular
commit hash, which is no longer in the history for that branch, and you get the
dreaded:
We went looking everywhere, but couldn’t find those commits.
Note that if your PR is still a draft, you’re fine to edit the history whichever
way you like: in fact, it’s often useful for review purposes to have multiple
commits even at the start of a PR review before you move it from draft. Up to you.
The only other safe time to rebase is on final approach. At that point,
presuming you are keeping to the “single main
commit per PR” approach (see
below), you’ll be wanting to squash the entire branch history into a single
commit to main
. For this, I usually use
prr: it’s handy for picking up Reviewed-by
automatically, and merging commit comments together for final editing.
Github CI only runs on branch tips
You probably don’t want to have a PR where you’re going to merge more than
one commit into main. This is because CI only runs on the top-level commit: if
an ancestor commit breaks the build, you’ll never know. Stefan mentions using
git rebase --exec
for checking commits in a stack, which indeed works great,
but unless you’re running exactly the same CI that’s running under Github
Actions, you can’t rely on it.
If that’s the case, what if you have one or more changes that depend on another?
This is where “stacked PRs” come in, and they’re a bit of a pain…
Stacked PRs are cumbersome
Gerrit has a really useful model for
reviewing stacks of changes: instead of the full history, each “patchset”
corresponds to the single logical change Stefan talks about above. Every time
you push to Gerrit, you’re supposed to have collapsed and rebased additional
changes into single commits corresponding to each Gerrit CR. The model has some
disadvantages as well (in particular, it’s a bit of a pain to keep a full
history locally), but the Gerrit review UI doesn’t suffer from the rebasing
issues Github does.
Presuming - as there is no CI available - gerrithub is a
non-starter, the only option available on Github is to use multiple PRs. This is
better than it used to be, but is still a little painful.
Essentially, a stacked PR is one that’s opened not against the main
branch,
but against another PR branch. Say we have changes A
and B
, where B
is
dependent on A
. You’d create a local branch with A
, then push it to Github
and open a PR. You’d have another local branch with A
and B
, then push
that branch to Github and open a separate PR.
Now we need to make the B
PR be based against the A
PR. You can do this via
the web UI by clicking Edit
, though there is annoying bug here: it doesn’t
reset the title and description. You can use gh pr create --base ...
to avoid
this problem.
Now, in the second PR, you’ll just see the commit for B
. Each PR can be
reviewed separately, and each PR gets its own CI run.
You also might want to merge additional changes up the stack. Let’s say that you
have commit A2
on the A
PR, that you want in PR B
and C
: the best - if
rather tedious - way to do this, is to merge A
into B
, then B
into C
.
That’s a lot of merge commits, but remember we’re squashing a PR every time
before merging a PR to main
.
You’ll see on the web recommendations to “merge downwards”: you wait for commit
approval for the whole stack, then merge the top PR (B
) into the PR underneath
it (A
), and so on, until you merge to main
.
I don’t think that’s necessary these days. Instead, when you have approval for
the base PR - and logically, it will make sense that is reviewed first - you can
merge it to main
. Github will then offer to delete the PR branch. If you do
this, the stacked PR gets automatically reset such that its merge base is now
main
!
There is an annoying thing here though: because of that squash during the merge
to main
, git
, and Github, needs you to merge main
back into the commit
history of the PR that just changed bases. If you already merged the parent PR,
you can always do git merge -Xours master
to fix this, since there shouldn’t
be any actual diff difference between the PR branch diffs as a whole, and what
was merged to master. Or, if you didn’t merge in the parent PR, you’ll need a
normal git merge master
.
Another bug (as far as I’m concerned) is that if you ask for review on a stacked
PR, it doesn’t get tagged with “Review required”, since, technically, you could
merge the PR into its parent without approval. And there is no “Review
requested” tag.
I would love all this to have some tooling: something that lets me do
everything on my local stacked branches, automate merges up, keep track of
dependencies, and updating the branches in Github. But I haven’t been able to
find anything that can do it.
[2022-05-12 update]: I just came across spr
which is so far proving excellent in solving some of these problems. I love it!
For reasons, I now need to interact with Office365 mail and calendar. It should
go without saying that the Outlook webapp is almost painfully unusable (there
really is no key binding for “next unread email”). Thus began the quest to get
mutt interacting with the O365 server. This was a rather painful process: so
much of the information on the web refers to earlier authentication schemes,
Microsoft-special protocols, things that don’t support 2FA, dead Linux software,
useless Linux software, etc.
After many false starts, I eventually found a working solution that allows mutt
usage (and my treasured key binding for “mark current thread as read and move to
the next unread email”). That solution is
davmail. Yes, it’s on sourceforge, and yes,
it’s Java, but it works perfectly.
It’s not very well-documented, but you can run it in headless mode and still
do the interactive OAuth2 flow needed with modern O365. Your settings should
include:
davmail.mode=O365Manual
davmail.url=https://outlook.office365.com/EWS/Exchange.asmx
When davmail
starts, it will ask you to visit a URL and paste the resulting URL
back - this contains the necessary OAuth2 tokens it needs. No need for any GUI!
Once davmail
is running, your .fetchmailrc
can be:
poll localhost protocol IMAP port 1143
auth password username "[email protected]"
is localuser here
sslmode none
keep
mda "/usr/bin/procmail -d %T"
folders INBOX,etc,etc
Note that since davmail
is running locally, there’s not really any need for SSL,
though you can set that up if you like.
When you start fetchmail
, enter your password, and that will initiate the auth
flow against the davmail
instance. Note that you’re not storing passwords
anywhere, unlike the old-style app password approach you might have used
previously on gmail and the like.
I don’t need to send mail often, so I have mutt
set up like this:
set smtp_url= "smtp://[email protected]@localhost:1025/"
unset smtp_pass
set ssl_starttls=no
set ssl_force_tls=no
Having to enter my password each time is not a big deal for me.
Equally I have my calendar app set up to pull over caldav from davmail
. Works
great. I’d love to be able to pull my O365 calendar into Google Calendar, but
apparently Google and Microsoft are unable - or more likely unwilling - to
make this work in any meaningful way.
I’m pretty sure it’s possible to adapt Google’s OAuth2 scripts to directly use
fetchmail
with O365’s modern auth stuff, but I’m not sure I have the energy to
figure it out - and as far as I can find, nobody else has?
I’m apparently old-school enough to find gmail
and co painfully inefficient for
handling significant amounts of mail. I still find procmail
+mutt
hard to beat. One thing mutt
can’t do, however, is filter threads
automatically - there’s no “mute” facility like gmail
has; threads
have to processed manually.
Equally, procmail
itself has no threading facilities or understanding
of Message-Id
or References
.
Matching email threads
It can be done, though, with some cheesy awk:
#!/bin/bash
#
# If a mail message has a References: value found in the refs file, then
# add the requested header.
#
# Usage:
#
# cat mail_msgs | match-thread.sh ~/.mail.refs.muted "Muted: true"
#
ref_file="$1"
header="$2"
mail=/tmp/match-thread.mail.$$
cat - >$mail
newrefs="$(cat $mail | formail -x references -x message-id | tr -d '\n')"
touch $ref_file
cat $ref_file | awk -v newrefs="$newrefs" '
BEGIN {
found = 0;
split(newrefs, tmp);
for (i in tmp) {
refs[tmp[i]]++;
}
}
# Each thread will have one line in the ref file, with
# space-separated references. So we just need to look for any
# reference from the mail.
{
for (ref in refs) {
if (index($0, ref) != 0) {
found = 1;
exit(0);
}
}
}
END {
exit(found ? 0 : 1);
}
'
if [[ $? = 0 ]]; then
cat $mail | formail -i "$header"
else
cat $mail
fi
rm $mail
Essentially, we record all the References
in the thread we’re trying
to act on. Then we can trigger the above to see if the new mail is part
of the thread of interest.
(This seems like the sort of thing formail
could do, given its -D
option has a message ID cache, but I haven’t even bothered to take a
look at how hard that would be…)
procmail
usage
In .procmailrc
, we’d use this like so:
:0 Wfh: formail.lock
| $HOME/src/procmail-thread/match-thread.sh $HOME/.refs.muted "Procmail-Muted: true"
:0 Wfh: formail.lock
| $HOME/src/procmail-thread/match-thread.sh $HOME/.refs.watched "Procmail-Watched: true"
This will add the given header if we find any of the email’s
References
values in our “database”.
Then, we can do what we like with the mails, like deliver them as
already-read, carbon copy them to the inbox, etc.:
:0
* Procmail-Muted: true
{
SWITCHRC=$HOME/.procmailrc.markread
}
:0
* Procmail-Watched: true
{
:0 c:
$DEFAULT
SWITCHRC=$HOME/.procmailrc.markread
}
:0
$DEST/
mutt
usage
To actually watch or mute a thread, we add a couple of mutt
macros:
macro index,pager "M" "|~/src/procmail-thread/add-thread.sh ~/.refs.muted<return>"
macro index,pager "W" "|~/src/procmail-thread/add-thread.sh ~/.refs.watched<return>"
The add-thread.sh
script is similar to the above, but populates the
refs file with all message IDs found in the given email.
I put all this in a git repo.