Skip to content

Replace FixedBitSet with SortedVecSet in Access #18955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Trashtalk217
Copy link
Contributor

@Trashtalk217 Trashtalk217 commented Apr 26, 2025

Third time's the charm: adapted from #16784

Objective

The access bitsets are currently efficient because the ComponentIds are dense: they are incremented from 0 and should remain small. To handle large amounts of components more efficiently, we should switch to a solution that scales with the number of components that are matched with, instead of the total number of components in the world. This is also a more intuitive performance profile.

Solution

We replace FixedBitSets with sorted vectors. The vectors should remain relatively small since queries usually don't involve hundreds of components.
These allow us to do union, difference, and intersection operations in O(n), with n the number of components the query matches against, instead of O(N) where N is the total number of components.

Benchmarking

Using cargo run --profile stress-test --example many_components <entities> <components> <systems> I got the following results:

entities components systems a-c ids pr main change
50000 100 100 240029 64.1 (4.4%) 148.0 (12.0%) -56.7%
50000 100 1600 240017 0.81 (0.4%) 8.99 (8.5%) -91.0%
50000 2000 100 246550 3834.2 (5.5%) 2427.2 (11.1%) 58.0%
50000 2000 1600 246733 29.8 (3.5%) 141.0 (9.2%) -78.9%
1000000 100 100 4576063 0.22 (0.6%) 8.31 (2.7%) -97.3%
500000 100 400 2313963 0.034 (0.5%) 2.92 (6.0%) -98.8%
500000 2000 100 2445538 282.6 (8.8%) 187.3 (6.0%) 50.9%
500000 2000 400 2445738 6.46 (5.9%) 47.5 (7.8%) -86.2%

The numbers for pr and main are average FPS. The percentages next to them are the standard deviations expressed as a percentage of the average. The change is the percentage of (pr - main) / main. A negative percentage means a loss of performance in comparison to main.

Interpretation

The numbers are quite dramatic (making me doubt some of the tests I've done), assuming, however, that the tests are good enough, we see a large difference between main and this pr. Before testing, I predicted that this implementation would perform better with a larger number of components. I was correct here, but it takes a very large number of components before this effect is noticeable. Meanwhile, the performance loss at a low amount of components is horrific, and it also scales poorly with added system.

It should be noted that many_components is specifically designed to stress test this part of the code, and it will have less of an impact on normal applications.

Copy link
Contributor

@therealbnut therealbnut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’d be good to write a benchmark to verify that it does improve performance. A bitset should be able to stream set operations for 64 indices at a time, which is probably slower in many cases, but is still pretty fast. We may want some future work with a hybrid or new structure that’s better than both, but we’ll want benchmarks to know for sure.

@Trashtalk217
Copy link
Contributor Author

I've done the benchmarking and I'm open to suggestions. For improvements. I know that Flecs has reserved components (or entities, since components and entities share the same id space), and we could potentially go for a hybrid approach as @therealbnut suggested, where, as an example, the first 256 component ids are reserved using bit sets and the others are saved via sorted vecs.

I'm sceptical about doing that in this PR, however, as it would require reserved components first (and maybe components as entities).

Although now I'm thinking about it, it might be possible.

@Trashtalk217 Trashtalk217 added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Apr 27, 2025
@therealbnut
Copy link
Contributor

therealbnut commented Apr 28, 2025

Sorry, for the confusion, I was actually meaning for you to add benchmark code, for reproducibility. It turns out the original PR #14385 also has a lot of similar benchmarking suggestions from Cart.

@therealbnut
Copy link
Contributor

The performance testing you have run is interesting though, it seems to indicate that performance is much worse until you're dealing with large numbers. Which I guess confirms the original hypothesis that high-bits or very sparse sets will cause issues with the current code. Although the current approach in this PR also causes issues for lower numbers.

I don't know if this is the best approach, but for the hybrid I was imagining something more conceptually like Map<HighBits, Set<LowBits>>, where Map is something sorted to allow for cheap intersections (BTreeMap, SortedVec<(Key, Value)>, etc.), and Set is something with packed bits (BitSet, etc.). That way for several thousand contiguous indices near zero Map will only have a single entry and will just have a small indirection cost on top of what we have on main.

This is a crate with the sort of thing I was thinking, I haven't used it, but it may help explain: https://crates.io/crates/hibitset

@bushrat011899
Copy link
Contributor

It'd be interesting to see what the beak-even point is for this approach. It might make sense to investigate a hybrid technique, where the first 512 or so ComponentId's will be stored in the existing bitset, and then anything higher than that gets stored in the vec-set. That 512 is just a guess at the break-even point.

Conceptually though, I agree that Bevy should more gracefully handle cases with high component counts. It's not hard to encounter these performance issues, especially with dynamic components.

@chescock
Copy link
Contributor

I don't know if this is the best approach, but for the hybrid I was imagining something more conceptually like Map<HighBits, Set<LowBits>>, where Map is something sorted to allow for cheap intersections (BTreeMap, SortedVec<(Key, Value)>, etc.), and Set is something with packed bits (BitSet, etc.). That way for several thousand contiguous indices near zero Map will only have a single entry and will just have a small indirection cost on top of what we have on main.

I wonder whether we could make that work using a simple usize bitmask for Set<LowBits>, and the same SortedVecSet as this PR for the Map. usize would only pack 32 or 64 indexes at a time, but it would limit the worst case increase in size to 2x, both compared to FixedBitSet and to SortedVecSet!

@@ -111,7 +74,7 @@ impl<T: SparseSetIndex> Clone for Access<T> {
self.archetypal.clone_from(&source.archetypal);
}
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete these lines instead?

resource_read_and_writes: FixedBitSet::new(),
resource_writes: FixedBitSet::new(),
archetypal: FixedBitSet::new(),
component_read_and_writes: SortedVecSet::<ACCESS_SMALL_VEC_SIZE>::new(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave off the ::<ACCESS_SMALL_VEC_SIZE> in most places and let type inference figure it out.

Suggested change
component_read_and_writes: SortedVecSet::<ACCESS_SMALL_VEC_SIZE>::new(),
component_read_and_writes: SortedVecSet::new(),

}
}

impl<T: SparseSetIndex> From<Vec<T>> for AccessConflicts {
fn from(value: Vec<T>) -> Self {
Self::Individual(value.iter().map(T::sparse_set_index).collect())
let mut conflicts = SortedVecSet::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to impl FromIterator for SortedVecSet so we can keep using collect here?

This is also O(N^2), since the insert is O(N). You could resolve that by collecting into a Vec and calling SortedVecSet::from_vec.

}
}

impl<const N: usize> Default for SortedVecSet<N> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a #[derive(Default)]?

match self.0[i].cmp(&other.0[j]) {
Ordering::Less => i += 1,
Ordering::Greater => {
self.0.insert(i, other.0[j]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this on the previous version, but this is an O(N^2) algorithm. I sketched out an O(N) algorithm in that comment that might work.

@therealbnut
Copy link
Contributor

therealbnut commented Apr 29, 2025

@Trashtalk217 @chescock a proof of concept of the SortedVecMap<usize> collection type if you're interested to try: rust playground.

@alice-i-cecile
Copy link
Member

For the record, I'm broadly on board with this but don't have the bandwidth to properly review. Ping me / label this PR when y'all are convinced that this is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants