Race Conditions in glibc (Debian based containers)
Some of you may have run into DNS issues when using a Debian based container.
This discussion is a place to discuss
- Race conditions or other issues found in glibc (technical details)
- Different approaches to mitigation (including using Alpine)
- Reasons for avoiding Debian in the first place
Per my research:
From inside a container on Cycle I ran tcpdump -i any port 53 -vvv
. This gave me the following, interesting information.
- Every DNS query for
someother.domain.com
resulted in both A and AAAA requests being sent in parallel. - The resolver returned correct responses (CNAME + A/AAAA records).
- Despite this, the container still saw intermittent failures.
So at this point I knew, the internal resolver was working correctly and that the failure was happening inside the container's DNS client logic.
So I dove deeper into some research on glibc
and specifically getaddrinfo()
since it handles DNS resolution and found that:
- It does in fact send A and AAAA queries simultaneously.
- If AAAA returns first (and fails with NXDOMAIN, SERVFAIL, or is empty), glibc may prematurely fail the entire resolution, even if a valid A record arrives milliseconds later.
And the second part there, where it prematurely fails seems to be the major issue.
Luckily, the Alpine resolver musl libc
performs the same actions but serially and predictably, which has so far eliminated any occurrence of this error. So if you're in the position to use Alpine, its more reliable (and generally more secure).
Looking forward to hearing some insights and opinions here!