Skip to content

linux: improve openat and openat2 handling#939

Open
cyphar wants to merge 3 commits into
RsyncProject:masterfrom
cyphar:linux-openat-openat2
Open

linux: improve openat and openat2 handling#939
cyphar wants to merge 3 commits into
RsyncProject:masterfrom
cyphar:linux-openat-openat2

Conversation

@cyphar
Copy link
Copy Markdown

@cyphar cyphar commented Jun 1, 2026

When looking at the usage of openat and openat2 on Linux, I noticed two bugs:

  • openat(2) famously does not return -EINVAL for unknown flag bits on Linux, so the fallback code to use O_RESOLVE_BENEATH if defined would result in all open operations with the "secure resolver" becoming insecure. This would require a fairly convoluted build environment to trigger, so this isn't exploitable but if it were triggered it would be a security bug.
  • The usage of openat2(2) did not handle -EAGAIN at all, which could lead to spurious errors. openat2(2) can return -EAGAIN when trying to walk into .. during a scoped lookup like RESOLVE_BENEATH and there was a rename or mount operation anywhere else on the system. The reason for this behaviour is a little esoteric but in short I implemented it this way so that userspace could choose their retry policy and (more importantly) the kernel could avoid potentially-unbounded retries that could lead to a DoS. The man page does mention this but it is easy to miss:

    EAGAIN how.resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a ".." component didn't escape (due to a race condition or potential attack). The caller may choose to retry the openat2() call.
    It is possible for some of the concerns raised in other threads about instability with 3.8.3 were caused by this, but nobody has posted strace logs so it's hard to know for sure, and I haven't had time to test this myself.


I also took a quick look at the path_has_dotdot_component usage and I'm not really convinced that it is actually doing anything useful. The comments seem to be Claude-authored so I'm not sure reading them is super helpful, but the upshot of:

	/* Reject any path with a literal ".." component (bare "..",
	 * "../foo", "foo/..", "foo/../bar", "subdir/.."). The previous
	 * substring-based check caught only "../" prefix and "/../"
	 * substring; bare ".." and trailing "/.." escape on the per-
	 * component walk fallback used by NetBSD/OpenBSD/Solaris/Cygwin
	 * and pre-5.6 Linux. RESOLVE_BENEATH on Linux/FreeBSD/macOS
	 * catches some of these in-kernel with EXDEV, but the front
	 * door must reject them consistently with EINVAL across all
	 * platforms so callers can rely on the validation. */

appears to be that .. is problematic and RESOLVE_BENEATH catches "some of these". This is is incorrect -- .. cannot escape the starting point with RESOLVE_BENEATH (and O_RESOLVE_BENEATH as it was modeled after RESOLVE_BENEATH), regardless of where you put it in the path. However, you can walk into .. during a lookup without issue if it remains within the starting root.

Reading the rest of the code, it seems this was actually done because the fallback userspace resolver cannot handle .. and so the decision was to ban it for everything for "consistency" but this doesn't really make sense -- symlinks can contain .. so there will always be an inconsistency between the resolver that rejects all symlinks and those that can handle them. It seems like it would be more helpful to most users to permit .. for RESOLVE_BENEATH and only block it in the resolver.

My final patch does this, but I can drop it if you prefer.

For what it's worth, there actually are safe ways of handling .. in userspace. libpathrs (shameless plug!) does this on Linux using /proc/self/fd/... to check the logical path against the real path in a very solid way, and Go's os.Root has a brute-force approach that in principle works on any platform by restarting the lookup each time they hit .. and re-applying the logical path walked so far (I plan to adapt the approach for libpathrs at some point so it can be cross-platform as well).

I suspect this will also help alleviate some of the issues people have had with regressions on v3.8.3.

cyphar added 3 commits June 1, 2026 11:25
On Linux, openat(2) famously does not return an error if you pass it
invalid flag bits. This was something I fixed in openat2(2) but for
backward compatibility reasons, openat(2) will always silently ignore
unknown O_* flags (which makes adding new ones quite painful).

As a result, the fallback path added by commits e1c5f0e
("t_chmod_secure: probe kernel RESOLVE_BENEATH at runtime; drop test
skip") and commit 7f60ec0 ("syscall: also use O_RESOLVE_BENEATH on
FreeBSD and MacOS") would always return sucess and could cause spurious
test failures so we can just remove it. O_RESOLVE_BENEATH will never get
added to Linux for this reason, and anyone doing so out-of-tree is just
creating a massive foot-gun so there is no need to entertain such broken
kernels.

Fixes: 7f60ec0 ("syscall: also use O_RESOLVE_BENEATH on FreeBSD and MacOS")
Fixes: e1c5f0e ("t_chmod_secure: probe kernel RESOLVE_BENEATH at runtime; drop test skip")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
In order to avoid in-kernel DoS due to unbounded retries, openat2(2)
will return -EAGAIN when trying to walk through ".." if there is any
racing rename or mount on the entire system.

Note that this applies regardless of whether the rename was on the same
filesytem or mount was in the same mount namespace as the process doing
openat(2) -- as a result, calling openat2(2) on even a modestly busy
system will result in spurious -EAGAIN errors every once in a while and
it is necessary to implement a retry loop for it. (Libraries such as
libpathrs and heavy users of openat2(2) like runc all do this, and in
our testing we found that ~256 iterations is enough to provide
resilience even on incredibly rename-heavy machines.)

Fixes: 4fa7156 ("syscall: use openat2(RESOLVE_BENEATH) on Linux for secure_relative_open")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
This code was added in commit 30656c5 ("syscall: add
symlink-race-safe do_*_at() wrappers and harden secure_relative_open")
but the justification given is really not particularly convincing.

The stated justification is that RESOLVE_BENEATH does not block some
".." escapes, which is categorically false (if it were true, that would
be a Linux kernel bug and the kernel actually has an extra safety check
at the end of lookup that would render this kind of breakout practically
impossible).

Reading the code, the actual reason appears to be to provide some
consistency between the RESOLVE_BENEATH and fallback per-component
O_NOFOLLOW lookups. However, the O_NOFOLLOW resolver can never have full
parity with the RESOLVE_BENEATH resolvers because RESOLVE_BENEATH permit
symlinks and symlinks can contain ".." components. It seems more prudent
to just allow ".." for modern systems and reject it for older ones. If
truly necessary, there are even ways to support ".." in a cross-platform
way so this could even be seen as a "not-implemented-yet" error rather
than a permanent API contract.

For the really-old systems case (with no O_NOFOLLOW or AT_DIRFD) then
just as we have to accept symlinks (which might contain "..") we might
as well accept bare ".." components. Rejecting ".." for that case is
just pure security theatre.

Fixes: 30656c5 ("syscall: add symlink-race-safe do_*_at() wrappers and harden secure_relative_open")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
@cyphar cyphar force-pushed the linux-openat-openat2 branch from 062c12e to 3d80c98 Compare June 1, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant