Description
Both lockless queue implementations inside std
for Once
and RwLock
suffer from the ABA problem, resulting in unsoundness.
In both cases, the enqueue operation roughly looks like the following:
struct Node {
next: *mut Node,
}
struct Queue {
state: AtomicPtr<Node>,
}
let mut state = queue.state.load();
loop {
node.next = to_node(state); // Some pointer masking
let next = to_state(node); // Some bittagging
match state.compare_exchange(state, next) {
Ok(_) => break,
Err(new) => state = next,
}
}
This code is problematic, because it allows the following to occur:
- Thread 1: Append node 1 to the queue.
- Thread 2: Load the state (a pointer to node 1) and store it as the
next
pointer of node 2. - Thread 1: Remove node 1 from the queue and deallocate it/modify it.
- Thread 1: Create node 3 at the same address as node 1 and append it to the queue.
- Thread 2: Perform the CAS. Because the state has the same bits, this succeeds.
Now, any traversal operation on the queue that attempts to dereference the next
pointer of node 2 exhibits UB, as the pointer, while having the right address, only has provenance to node 1, which is no longer valid, but may not access node 3.
This is of little practical concern, as it is extremely unlikely that LLVM will perform a problematic optimization; but nevertheless the code is unsound, both under stacked-borrows and LLVM's semantics.
This is the cause of #121626, where this exact situation led to a (correct) UB report from miri.
@rustbot label +T-libs +I-unsound
Possible solutions
The only solution I see at the moment is to use fuzzy provenance and use from_exposed_provenance
inside the traversal operation to make miri/LLVM guess the right provenance.
Activity
the8472 commentedon Mar 3, 2024
Since this is mac-specific why not use a DWCAS? Afaik they don't have any hardware that doesn't have those.
joboet commentedon Mar 3, 2024
It's not mac-specific, though.
Once
gets used on all platforms except thefutex
ones, I believe even Windows.RwLock
gets used on NetBSD and other UNIXes as well, and I'd like to use it to get rid of the unsound SGX lock implementations. So we need to find something portable here.Also, this is a provenance-only problem, so I'd like to avoid anything that makes the machine code less efficient, because in terms of what is actually generated, the current implementation is perfectly bug-free. It "just" violates the memory model.
the8472 commentedon Mar 3, 2024
And using the return value from the CAS (the
Ok(_)
) and working on that instead ofstate
doesn't work either?joboet commentedon Mar 3, 2024
No, because the old pointer is stored inside the node, and we can't retroactively update the pointer, because a traversal operation could access the old one immediately after the
compare_exchange
.saethlin commentedon Mar 3, 2024
I don't think LLVM implements expose semantics in any sense, so I suspect doing the expose dance here wouldn't change LLVM's view of the code. (Happy to be corrected on this)
joboet commentedon Mar 3, 2024
LLVM's pointer aliasing rules mention that the pointer resulting from
inttoptr
is based on "all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value". IIUC, that means that because the traversing thread reads the current pointer fromstate
and uses it to load the integer that is then passed tointtoptr
, the resulting pointer will also be based on the current pointer, making the code sound.RalfJung commentedon Mar 3, 2024
That is assuming that we don't have optimizations that make full use of the provenance model. Or have you inspected the generated assembly to be sure?
This looks extremely related to rust-lang/unsafe-code-guidelines#480 to me. I wonder if the CAS semantics I proposed here would help.
joboet commentedon Mar 3, 2024
I inspected the assembly of a simplified version (I removed the
Thread
handle andpark
operations). Also, theOnce
implementation has existed for ~9 years, with no issue encountered.saethlin commentedon Mar 3, 2024
The age of an implementation is not an indicator of its soundness with respect to provenance rules. If anything, perhaps the correlation runs the other direction.
I'm willing to review codegen tests for suspicious things like this, if someone can find a way to write them that's both effective and doesn't make them break often. I don't think we usually check in codegen tests for structures as complex as this.
26 remaining items