-
Notifications
You must be signed in to change notification settings - Fork 651
[FIX] CNIEnv concurrency issue #4202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] CNIEnv concurrency issue #4202
Conversation
b5744e6
to
87fa976
Compare
Damn it. |
87fa976
to
a061618
Compare
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
a061618
to
38f07ca
Compare
@AkihiroSuda I am not exactly in love with any of this... at least it should fix #3556 which has been increasingly a PITA. Touching CNIEnv gives me the creeps, as a mistake there would be very consequential for users, but I still feel we have to act on this. Suggesting we merge this after 2.1. |
err = os.MkdirAll(cniNetconfPath, 0o700) | ||
if err != nil { | ||
return err | ||
} | ||
lock, err := lockutil.Lock(filepath.Join(cniNetconfPath, ".nerdctl.lock")) | ||
lock, err := lockutil.Lock(filepath.Join(cniNetconfPath, ".cni-concurrency.lock")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.nerdctl.lock
is used to lock cni conf file manipulation. This lock here is a global locking mechanism meant to prevent cni plugins from acting concurrently.
This lock is enforced for almost the entirety of the ocihook flow.
It was previously using the same lock, as we were not locking the file operations in cnienv that are used in oci-hooks.
Since we do now, it can no longer use the same lock.
The note around line 100 explains some of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
This is (one of) the remaining issues in #3556
TL;DR: like in other places, we have been assuming that a filesystem is ACID and/or safe to use concurrently.
This has been solved generally by introducing the
Store
abstraction, providing a (somewhat (*)) ACID storage solution, and also ad-hoc fixed a bunch of CNIEnv methods.CNIEnv should be rewritten generally, as it is baroque at this point, but this is a lot more work.
For now, this PR just isolates all methods touching the filesystem in an ad-hoc store file - hopefully to a similar result.
This should address some of the issues we have been seeing (eg: missing .conflist file).
(*) atomic writes are not guaranteed in case of operating system crash, and data is not guaranteed to be saved as we do not sync (yet).