-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Replace FixedBitSet with SortedVecSet in Access #18955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Replace FixedBitSet with SortedVecSet in Access #18955
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’d be good to write a benchmark to verify that it does improve performance. A bitset should be able to stream set operations for 64 indices at a time, which is probably slower in many cases, but is still pretty fast. We may want some future work with a hybrid or new structure that’s better than both, but we’ll want benchmarks to know for sure.
I've done the benchmarking and I'm open to suggestions. For improvements. I know that Flecs has reserved components (or entities, since components and entities share the same id space), and we could potentially go for a hybrid approach as @therealbnut suggested, where, as an example, the first 256 component ids are reserved using bit sets and the others are saved via sorted vecs. I'm sceptical about doing that in this PR, however, as it would require reserved components first (and maybe components as entities). Although now I'm thinking about it, it might be possible. |
Sorry, for the confusion, I was actually meaning for you to add benchmark code, for reproducibility. It turns out the original PR #14385 also has a lot of similar benchmarking suggestions from Cart. |
The performance testing you have run is interesting though, it seems to indicate that performance is much worse until you're dealing with large numbers. Which I guess confirms the original hypothesis that high-bits or very sparse sets will cause issues with the current code. Although the current approach in this PR also causes issues for lower numbers. I don't know if this is the best approach, but for the hybrid I was imagining something more conceptually like This is a crate with the sort of thing I was thinking, I haven't used it, but it may help explain: https://crates.io/crates/hibitset |
It'd be interesting to see what the beak-even point is for this approach. It might make sense to investigate a hybrid technique, where the first 512 or so Conceptually though, I agree that Bevy should more gracefully handle cases with high component counts. It's not hard to encounter these performance issues, especially with dynamic components. |
I wonder whether we could make that work using a simple |
@@ -111,7 +74,7 @@ impl<T: SparseSetIndex> Clone for Access<T> { | |||
self.archetypal.clone_from(&source.archetypal); | |||
} | |||
} | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete these lines instead?
resource_read_and_writes: FixedBitSet::new(), | ||
resource_writes: FixedBitSet::new(), | ||
archetypal: FixedBitSet::new(), | ||
component_read_and_writes: SortedVecSet::<ACCESS_SMALL_VEC_SIZE>::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can leave off the ::<ACCESS_SMALL_VEC_SIZE>
in most places and let type inference figure it out.
component_read_and_writes: SortedVecSet::<ACCESS_SMALL_VEC_SIZE>::new(), | |
component_read_and_writes: SortedVecSet::new(), |
} | ||
} | ||
|
||
impl<T: SparseSetIndex> From<Vec<T>> for AccessConflicts { | ||
fn from(value: Vec<T>) -> Self { | ||
Self::Individual(value.iter().map(T::sparse_set_index).collect()) | ||
let mut conflicts = SortedVecSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to impl FromIterator for SortedVecSet
so we can keep using collect
here?
This is also O(N^2), since the insert
is O(N). You could resolve that by collecting into a Vec
and calling SortedVecSet::from_vec
.
} | ||
} | ||
|
||
impl<const N: usize> Default for SortedVecSet<N> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a #[derive(Default)]
?
match self.0[i].cmp(&other.0[j]) { | ||
Ordering::Less => i += 1, | ||
Ordering::Greater => { | ||
self.0.insert(i, other.0[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this on the previous version, but this is an O(N^2) algorithm. I sketched out an O(N) algorithm in that comment that might work.
@Trashtalk217 @chescock a proof of concept of the |
For the record, I'm broadly on board with this but don't have the bandwidth to properly review. Ping me / label this PR when y'all are convinced that this is ready. |
Third time's the charm: adapted from #16784
Objective
The access bitsets are currently efficient because the ComponentIds are dense: they are incremented from 0 and should remain small. To handle large amounts of components more efficiently, we should switch to a solution that scales with the number of components that are matched with, instead of the total number of components in the world. This is also a more intuitive performance profile.
Solution
We replace FixedBitSets with sorted vectors. The vectors should remain relatively small since queries usually don't involve hundreds of components.
These allow us to do union, difference, and intersection operations in O(n), with n the number of components the query matches against, instead of O(N) where N is the total number of components.
Benchmarking
Using
cargo run --profile stress-test --example many_components <entities> <components> <systems>
I got the following results:The numbers for pr and main are average FPS. The percentages next to them are the standard deviations expressed as a percentage of the average. The change is the percentage of
(pr - main) / main
. A negative percentage means a loss of performance in comparison to main.Interpretation
The numbers are quite dramatic (making me doubt some of the tests I've done), assuming, however, that the tests are good enough, we see a large difference between main and this pr. Before testing, I predicted that this implementation would perform better with a larger number of components. I was correct here, but it takes a very large number of components before this effect is noticeable. Meanwhile, the performance loss at a low amount of components is horrific, and it also scales poorly with added system.
It should be noted that many_components is specifically designed to stress test this part of the code, and it will have less of an impact on normal applications.