Skip to content

boot: bootutil: swap-move: Allow trailer area not be a multiple of the sector size #2247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

taltenbach
Copy link
Contributor

@taltenbach taltenbach commented Mar 30, 2025

When using swap-move, the area allocated to the trailer has currently to be a multiple of the sector size. This is not a big deal on devices having small sectors, but on devices having a large sector size, such most STM32 MCUs, this typically means losing a significant amount of flash memory. For example, on an STM32F4 device with 128-KiB sectors, the trailer is usually only a few hundreds of bytes long but a whole 128 KiB has still to be reserved for storing the trailer and can't therefore be used for firmware data.

The idea of this PR is to relax the need of having a sector-aligned trailer area, enabling therefore to use all the bytes not needed by the trailer, in the first sector holding trailer data, to store firmware data.

To illustrate the solution, let's imagine a device having 4096-byte sectors and a primary slot composed of N sectors. Let's assume the trailer needs 4224 bytes. Before this MR, two entire sectors would have to be allocated to the trailer, so the layout would have been something like:

         PRIMARY                               SECONDARY
|          ...          |              |          ...          |
+-----------------------+              +-----------------------+
| Firmware (4096 bytes) | Sector N-3   | Firmware (4096 bytes) | Sector N-3
|           .           |              |           .           |
+-----------------------+              +-----------------------+
|  Padding (4096 bytes) | Sector N-2   |  Trailer (4096 bytes) | Sector N-2
|           .           |              |           .           |
+-----------------------+              +-----------------------+
|  Trailer (4096 bytes) | Sector N-1   |  Trailer (4096 bytes) | Sector N-1
|                       |              |           .           |
+-----------------------+              +-----------------------+
|  Trailer (4096 bytes) | Sector N
|           .           |
+-----------------------+

With this MR, only the exact amount of bytes actually needed by the trailer have to be allocated to the trailer, so we have 3968 additional bytes available to store the firmware image:

         PRIMARY                               SECONDARY
|          ...          |              |          ...          |
+-----------------------+              +-----------------------+
| Firmware (4096 bytes) | Sector N-3   | Firmware (4096 bytes) | Sector N-3
|           .           |              |           .           |
+-----------------------+              +-----------------------+
| Firmware (3968 bytes) | Sector N-2   | Firmware (3968 bytes) | Sector N-2
|  Padding (128 bytes)  |              |  Trailer (128 bytes)  |
+-----------------------+              +-----------------------+
|  Padding (3968 bytes) | Sector N-1   |  Trailer (4096 bytes) | Sector N-1
|  Trailer (128 bytes)  |              |           .           |
+-----------------------+              +-----------------------+
|  Trailer (4096 bytes) | Sector N
|           .           |
+-----------------------+

At the beginning of an upgrade or revert process, all the sectors in the primary slot are "moved up", ending up with:

         PRIMARY                               SECONDARY
|          ...          |              |          ...          |
+-----------------------+              +-----------------------+
| Firmware (4096 bytes) | Sector N-3   | Firmware (4096 bytes) | Sector N-3
|           .           |              |           .           |
+-----------------------+              +-----------------------+
| Firmware (4096 bytes) | Sector N-2   | Firmware (3968 bytes) | Sector N-2
|           .           |              |  Trailer (128 bytes)  |
+-----------------------+              +-----------------------+
| Firmware (3968 bytes) | Sector N-1   |  Trailer (4096 bytes) | Sector N-1
|  Trailer (128 bytes)  |              |           .           |
+-----------------------+              +-----------------------+
|  Trailer (4096 bytes) | Sector N
|           .           |
+-----------------------+

To main ideas of the solution implemented by this PR is:

  • To copy only the part used by the firmware image when moving up the last sector (sector N-2 in the example), to avoid overwriting the trailer data.
  • To erase the secondary slot's trailer when swapping the last sector holding firmware data in the secondary slot, to ensure no firmware data will be lost when erasing the secondary trailer. Previously the erasure was performed at the beginning of the move process, just after the erasure of the primary trailer, by simplicity.
  • To avoid the need of rewriting the secondary slot's trailer when starting a revert process.

The latter is the most difficult part. Indeed, at the beginning of the revert process, the secondary trailer was rewritten to make the revert look like a permanent upgrade in case an unfortunate reset occurs during the rewriting of the primary trailer. This was a clever trick, but is unfortunately no longer possible if the trailers are not stored in dedicated sectors. The solution proposed by this PR is to write a "fallback trailer" composed only of a magic number in the primary slot, during the upgrade process, at the end of the sector that is just before the first sector containing part of the slot's trailer. This area is available at the end of the upgrade process since it is part of the padding area, used to "move the sectors up". The maximum image size is adjusted, by adding at most 8 bytes of padding, to ensure the area is always large enough to write the magic number. In case a reset occurs while rewriting the primary trailer during a revert process, the presence of the fallback trailer can be used to determine a revert was in progress.

If the improvement proposed by this MR is merged, I will try to port those changes to the swap-offset strategy, which should be quite easy, at least in theory.

Resolves #2052

@nordicjm
Copy link
Collaborator

The problem with this is that it introduces the inability to clear flags e.g. if a firmware update is marked for confirmation or test, once a flag is set it cannot be cleared without potentially bricking the device if power is lost during erase/re-write of that sector

@taltenbach
Copy link
Contributor Author

taltenbach commented Mar 31, 2025

The problem with this is that it introduces the inability to clear flags e.g. if a firmware update is marked for confirmation or test, once a flag is set it cannot be cleared without potentially bricking the device if power is lost during erase/re-write of that sector

@nordicjm Have you a specific scenario in mind where there would be an issue?

  • Before an upgrade the user can erase the secondary trailer before downloading an upgrade and then write it to make the upgrade pending without risking bricking the device.

  • During an upgrade or revert process, the primary and secondary trailers are rewritten but the logic makes sure the process is resumed gracefully on boot in case of a power loss, and that without risking losing part of the firmware image.

  • After an upgrade, the user might have to set the image-ok flag in the primary trailer (which is possible without erasing the trailer) but never has to clear a flag in that trailer, unless I'm missing something.

Also note that the swap-scratch strategy already supports having firmware and trailer data in the same sector and this doesn't seem to cause any issue, so I don't think this should be a problem for swap-move.

@nordicjm
Copy link
Collaborator

nordicjm commented Apr 1, 2025

User uploads image, they mark it for test/swap (which writes the flag to the primary image, not the secondary), then they delete the secondary image without rebooting, the update image is gone, but the flag is permanently set to test/confirm

@taltenbach
Copy link
Contributor Author

taltenbach commented Apr 1, 2025

User uploads image, they mark it for test/swap (which writes the flag to the primary image, not the secondary), then they delete the secondary image without rebooting, the update image is gone, but the flag is permanently set to test/confirm

@nordicjm Perhaps you're talking about a specific MCUboot feature I'm not aware of, but the usual flow is:

  1. Erasing the secondary slot.
  2. Writing the update image to the secondary slot.
  3. Marking the update image as pending, which writes the secondary slot's trailer. To do a permanent upgrade the image-ok flag is set in the secondary slot, otherwise only the magic is written. Nothing is written to the primary slot, please look at the implementation of boot_set_pending_multi, in particular:
    rc = flash_area_open(FLASH_AREA_IMAGE_SECONDARY(image_index), &fap);
    if (rc != 0) {
    return BOOT_EFLASH;
    }
    rc = boot_set_next(fap, false, !(permanent == 0));
  4. At this point, if the user wants to delete the update image before rebooting for some reason, they can do that without any problem and no update will be triggered. There is also no risk of bricking the device, even in the case of a power loss, because erasing only one bit of the image of course prevents the upgrade from occurring because of the hash/signature check, which is performed before attempting an upgrade.

This routine will now also be useful for the swap-move strategy. The
trailer size is removed from the list of parameters as it can be easily
obtained using the 'state' parameter and it will be more convenient, for
the swap-move strategy, not to have to compute it manually before
calling 'boot_get_first_trailer_sector'.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
The swap-move strategy is currently assuming the trailer is stored in
dedicated sector, meaning the size of the area allocated to the trailer
must be a multiple of the sector size. This commit relaxes this
assumption when moving sectors up in the primary slot by allowing the
last sector containing firmware data to also hold part of the trailer
(or the whole trailer if it is small enough).

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
When using the swap-move strategy, the secondary trailer was erased just
after the primary trailer, when moving the sectors up in the primary
slot. This was working because it was assumed the trailer is stored in
dedicated sectors, so no sector could contain both trailer and firmware
data. However, if this assumption is relaxed, it means it is no more
possible to erase the secondary trailer before the last sector holding
firmware data has been copied to the primary slot, since that sector
might also contain trailer data. So the erasure of the secondary trailer
has to occur at the time the last sector containing firmware data is
swapped.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
When using the swap-move strategy, at the very beginning of the revert
process, the secondary trailer was rewritten to make the revert look
like a permanent upgrade in case an unfortunate reset occurs when
rewriting the primary trailer. This was possible because the assumption
was that no sector contained both part of the firmware and part of the
trailer.

To relax this assumption, it is necessary to avoid having to rewrite the
secondary trailer at the start of the revert process, since that could
also erase firmware data. The solution chosen is to write, during the
upgrade process, a "fallback trailer" at the end of the last sector that
is not containing part of the slot's trailer. The fallback trailer only
contains the magic field. Depending on the size of the trailer, the
maximum firmware image size has to be reduced a bit (by at most 8 bytes)
to ensure the fallback trailer can be written. During a revert, if a
reboot occurs while the primary trailer is rewritten, the presence of
the fallback trailer will make possible to determine that a revert was
in progress and has to be resumed.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
This check is not useful because it is already done during the
validation of the image and was also making sure there was no firmware
data in the first trailer sector, which is no more needed.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
This commit updates the computation of the maximum firmware image size
for the swap-move strategy to allow the first sector containing part of
the trailer to also hold firmware data. This means it is no more
necessary to allocate a sector-aligned area for the trailer when using
swap-move, the area to allocate is now simply equal to the maximum
trailer size. The space required for writing the fallback trailer is
also taken into account.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
For swap-move and swap-offset strategies, the size of the images used
for running the tests was hardcoded and was smaller than the maximum
possible size. This was done this way because the logic used to compute
the maximum image size was not taking into account the padding that is
needed when using those strategies. This commit fixes the issue, making
now possible to run the tests with images having the maximum possible
size.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
When using the swap-move strategy, the simulator now doesn't compute
anymore the maximum image size so the area allocated to the trailer has
a size that is a multiple of the sector size. Indeed, the swap-move
strategy now supports having part of the firmware in the first sector
holding trailer data, allowing larger images.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
When using the swap-move strategy, the area allocated to the trailer no
more need to have a size that is a multiple of the sector size, since
the area not allocated to the trailer in the first sector holding
trailer data can now be used to store firmware data.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
The swap-move strategy now support having firmware data in the first
sector holding trailer data. This commit adds a release note snippet
regarding this change.

Signed-off-by: Thomas Altenbach <thomas.altenbach@legrand.com>
@taltenbach taltenbach force-pushed the feature/swap-move-unaligned-trailer branch from 5b4c3e6 to 2bc56dc Compare April 20, 2025 16:35
@taltenbach
Copy link
Contributor Author

taltenbach commented Apr 20, 2025

The solution that was initially implemented to avoid the need of rewriting the secondary slot's trailer when starting a revert process was simple but had the drawback of requiring the trailer in the secondary slot to be valid after the completion of an upgrade. This meant a few changes in the swap tables were required to handle this swap-move-specific state, but also that it was no longer possible for a user to set again the secondary image as pending after an upgrade or revert process, since the secondary trailer wasn't erased anymore and cannot be rewritten without risking to lose part of the firmware. I'm not sure this is a very typical use case, but that might still be useful to some users, e.g. to perform a rollback after an update has been confirmed.

So, I wasn't fully satisfied with that solution. I updated this MR with another solution, which is bit more complex but doesn't require anymore the secondary trailer to be valid after an upgrade process. Instead, a "fallback trailer" is written in the primary slot, in a sector that doesn't contain any bit of the primary slot's trailer to ensure it is not erased when the primary trailer is erased. This fallback trailer is used to determine a revert process was in progress in case a reboot occurs while rewriting the primary trailer. I updated the PR description accordingly and the details can be found in the commit messages.

Copy link
Member

@d3zd3z d3zd3z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what is going on here, and really hope we can get this to a point where this will allow swap-move and swap-offset to be used on these 3 large sector devices.

However, I have removed the constraint in the simulator that disables the simulator tests on the first STM device, and even a basic upgrade fails. I haven't investigated further.

So, first thing to emphasize: work on mcuboot needs to always be made to work in the simulator. I would highly recommend that it be done first in the simulator, and only after that to test it on targets. The simulator will run through hundreds of thousands of scenarios that even if you had time to run them on a real device, would wear the flash out before a single run even finished.

I'll post a simple patch to test.

Also note that the mbedTLS update needs to be brought in for the simulator tests to pass on some compiler toolchains. I think this might be a clang warning issue, as it doesn't seem to affect CI.

@d3zd3z
Copy link
Member

d3zd3z commented Apr 22, 2025

This will enable the sim to run these tests on the stm32f4 device:

diff --git a/sim/src/image.rs b/sim/src/image.rs
index 09214c50..ed72d2e5 100644
--- a/sim/src/image.rs
+++ b/sim/src/image.rs
@@ -451,7 +451,7 @@ impl ImagesBuilder {
 
                 let mut flash = SimMultiFlash::new();
                 flash.insert(dev_id, dev);
-                (flash, Rc::new(areadesc), &[Caps::SwapUsingMove, Caps::SwapUsingOffset])
+                (flash, Rc::new(areadesc), &[])
             }
             DeviceName::Stm32f4SpiFlash => {
                 // STM style internal flash and external SPI flash.

@d3zd3z
Copy link
Member

d3zd3z commented Apr 22, 2025

To run this on the simulator, cd sim, and then run something like this:

cargo test \
    --features "swap-move" \
    -- --nocapture --test-threads 1

The features sets what will be tested, in this case just swap move, "sig-ecdsa-mbedtls swap-move" for example will test with ecdsa signatures, and swap move.

You can also add arguments at the end to restrict the particular test you want to run.

@taltenbach
Copy link
Contributor Author

taltenbach commented Apr 22, 2025

@d3zd3z Thanks for your feedback! I'm doing nearly all my testing with the simulator, which is a lot easier and faster than setting up a real device, in addition to be more comprehensive as you said.

Your patch isn't sufficient to run swap-move with the STM32F4 device defined in the simulator, as the slots contain sectors with different sizes (which is of course not supported by swap-move, even with my changes). Also, when using swap-move or swap-offset, the simulator currently assumes all of the sectors of the flash memory have the same size, e.g. here:

mcuboot/sim/src/image.rs

Lines 1810 to 1812 in 6cbea0a

} else if Caps::SwapUsingOffset.present() || Caps::SwapUsingMove.present() {
let sector_size = dev.sector_iter().next().unwrap().size as u32;
align_up(c::boot_trailer_sz(dev.align() as u32), sector_size) as usize

So to be able to run swap-move on the STM32F4 device, you need:

diff --git a/sim/src/image.rs b/sim/src/image.rs
index 09214c50..b00c7e96 100644
--- a/sim/src/image.rs
+++ b/sim/src/image.rs
@@ -437,9 +437,9 @@ impl ImagesBuilder {
                // The flash layout as described is not present in any real STM32F4 device, but it
                // serves to exercise support for sectors of varying sizes inside a single slot,
                // as long as they are compatible in both slots and all fit in the scratch.
-                let dev = SimFlash::new(vec![16 * 1024, 16 * 1024, 16 * 1024, 16 * 1024, 64 * 1024,
-                                        32 * 1024, 32 * 1024, 64 * 1024,
-                                        32 * 1024, 32 * 1024, 64 * 1024,
+                let dev = SimFlash::new(vec![64 * 1024, 64 * 1024,
+                                        64 * 1024, 64 * 1024,
+                                        64 * 1024, 64 * 1024,
                                        128 * 1024],
                                        align as usize, erased_val);
                let dev_id = 0;
@@ -451,7 +451,7 @@ impl ImagesBuilder {

                let mut flash = SimMultiFlash::new();
                flash.insert(dev_id, dev);
-                (flash, Rc::new(areadesc), &[Caps::SwapUsingMove, Caps::SwapUsingOffset])
+                (flash, Rc::new(areadesc), &[])
            }
            DeviceName::Stm32f4SpiFlash => {
                // STM style internal flash and external SPI flash.

With those changes, nearly all the tests are OK and in particular the perm_with_fails and revert_with_fails tests. However, I get an error in status_write_fails_with_reset when validate-primary-slot is selected:

core-0fe4d46a1f7a85be: ../../boot/bootutil/src/loader.c:1820: boot_swap_image: Assertion `rc == 0' failed.
error: test failed, to rerun pass `-p bootsim --test core`

Not sure exactly what's going on here, I don't know the purpose of status_write_fails_with_reset. I will investigate in the following days :)

@taltenbach
Copy link
Contributor Author

taltenbach commented Apr 22, 2025

So I've be able to look a bit further. If my understanding is correct, status_write_fails_with_reset simulates an upgrade with errors occurring while writing to the status area, meaning if the upgrade process is interrupted, the status read by MCUboot at next boot won't be valid. When run on the simulated STM32F4 device, with two 64-KiB sectors in each slot, the initial state is:

         PRIMARY                               SECONDARY
|          ...          |              |          ...          |
+-----------------------+              +-----------------------+
|   Image N (~63 KiB)   |   Sector 0   |  Image N+1 (~63 KiB)  |
|    Padding (~1 KiB)   |              |    Padding (~1 KiB)   |
+-----------------------+              +-----------------------+
|    Padding (~63 KiB)  |   Sector 1   |    Padding (~63 KiB)  |
|     Erased (~1 KiB)   |              |    Trailer (~1 KiB)   |
+-----------------------+              +-----------------------+

The status_write_fails_with_reset runs the upgrade and interrupts it arbitrarily in the middle of the process. At this point, the slots are being swapped and the sector 0 of the primary slot has just been erased and has been partly rewritten with the sector 0 of the secondary slot:

         PRIMARY                               SECONDARY
|          ...          |              |          ...          |
+-----------------------+              +-----------------------+
|  Image N+1 (~30 KiB)  |   Sector 0   |  Image N+1 (~63 KiB)  |
|   Erased (~34 KiB)    |              |    Padding (~1 KiB)   |
+-----------------------+              +-----------------------+
|    Image N (~63 KiB)  |   Sector 1   |   Padding (~63 KiB)   |
|    Trailer (~1 KiB)   |              |    Trailer (~1 KiB)   |
+-----------------------+              +-----------------------+

At next boot, MCUboot tries to resume the upgrade but since the boot status has not been properly written, MCUboot thinks the upgrade hasn't started yet so tries to compute the size of the primary image:

hdr = boot_img_hdr(state, BOOT_PRIMARY_SLOT);
if (hdr->ih_magic == IMAGE_MAGIC) {
rc = boot_read_image_size(state, BOOT_PRIMARY_SLOT, &copy_size);
assert(rc == 0);
}

Of course, this fails because the TLV area of the (new) primary image hasn't been written yet, so boot_read_image_size returns an error and we end up in the assert at the next line. This assert isn't caught by the simulator (the simulator only catches ASSERT, not assert), so the simulator crashes.

From my point of view, this is not an MCUboot issue, but more a simulator-related issue. Indeed, from what I understand, when MCUBOOT_VALIDATE_PRIMARY_SLOT is enabled (and only in that case), the test case tries to resume the upgrade, as described above, and then expect at most one assertion to be thrown:

mcuboot/sim/src/image.rs

Lines 1212 to 1220 in 6cbea0a

// This might throw no asserts, for large sector devices, where
// a single failure writing is indistinguishable from no failure,
// or throw a single assert for small sector devices that fail
// multiple times...
if asserts > 1 {
warn!("Expected single assert validating the primary slot, \
more detected {}", asserts);
fails += 1;
}

So it is expected that either:

  • The swap completes without any error. This happens if the last status word is properly written.
  • The swap completes but the validation of the primary slot fails. This can happen if the last status hasn't been properly written. In that case, the test expects an assert to be triggered. It seems at the time the test was written an ASSERT(0); was performed when the primary slot validation fails. However, this is no more the case.

But in fact, there is another valid case that isn't handled by this test case: if no status word has been properly written, MCUboot thinks the upgrade hasn't started. In that case, there can be zero, one or multiple asserts depending on the current state of the slots (the primary and secondary headers might be valid or not, the size of the TLV area might be found or not, the crypto keys might be found in the TLV area or not, ...).

@d3zd3z Is my understanding correct? If so, does it really make sense not to expect any assert, except one coming from the primary slot validation, when trying to resume the upgrade in status_write_fails_with_reset with a partially valid status? As a user, the only thing I would expect from MCUboot in that situation is not to boot the corrupted image, which can happen either because the bootloader crashes in an assert or because the primary slot validation fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Swap move should only need one extra sector
3 participants