Skip to content

RTIO: Add Error-handling for Multishot Items with Blocking Flag #93543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ubieda
Copy link
Member

@ubieda ubieda commented Jul 23, 2025

Description

This PR introduces an option to provide error-handling capabilities to RTIO multi-shot submissions (e.g: sensor streaming). This blocks the on-going submission, giving the client the chance to either un-block (resume as it is) or cancel and start over the IODEVs config/submissions.

This PR includes:

  • Code-changes on core RTIO (API + executor) to support the functionality.
  • Testcase locking in behavior.
  • Default handling to Sensor API on its occurrence.
  • Dependent RTIO bug-fix to prevent CQE semaphore bypassing (non-related bug-fix).

Note

Marked as DNM until #93544 lands

ubieda added 4 commits July 22, 2025 19:40
IODEV items running multi-shot submissions are now marked as blocked
when an error occurs, with the objective of enabling the app to handle
these errors, either by unblocking the submission, cancelling it, or
other application-specific actions.

Signed-off-by: Luis Ubieda <[email protected]>
Testcase demonstrates how a multi-shot submission stops re-executing
once it fails, and how the user can unblock it to resume its execution.

Signed-off-by: Luis Ubieda <[email protected]>
Add basic error-handling code to sensor API by cancelling items once a
multi-shot submission is blocked. This prevents starving the RTIO
client with failed submissions. The user, then has the option to
resubmit the multi-shot request.

Signed-off-by: Luis Ubieda <[email protected]>
Otherwise, calls to rtio_cqe_consume_block will bypass the semaphore
and held back in a Z_SPIN_DELAY(1) indefinitely.

Signed-off-by: Luis Ubieda <[email protected]>
@ubieda ubieda added the DNM This PR should not be merged (Do Not Merge) label Jul 23, 2025
Copy link

@bjarki-andreasen
Copy link
Contributor

bjarki-andreasen commented Jul 23, 2025

I wonder if it may be simpler for the user to check the result of and manually resubmit the SQE, given I believe an SQE is cancelled if there is an error.

RTIO_SQE_MULTISHOT_BLOCKED effectively dequeues the SQE until the user allows it to be re-enqueued, is this not essentially the same as the SQE being cancelled on error (dequeued), and the user resubmitting the SQE later?

@ubieda
Copy link
Member Author

ubieda commented Jul 23, 2025

I wonder if it may be simpler for the user to check the result of and manually resubmit the SQE, given I believe an SQE is cancelled if there is an error.

RTIO_SQE_MULTISHOT_BLOCKED effectively dequeues the SQE until the user allows it to be re-enqueued, is this not essentially the same as the SQE being cancelled on error (dequeued), and the user resubmitting the SQE later?

AFAIK, cancelled items cannot be un-cancelled (as in, re-use the submission), so the client needs to re-create the SQE submission from the top.

However, I see a couple pros of having a separate flag instead of reusing the existing one:

  • I see value in keeping the authority of cancelling items to the clients only mainly for troubleshooting purposes.
  • I also like that the flag cannot be confused by other one-shot items that have been cancelled (e.g: in a scenario of a single RTIO client working with multiple IODEVs).

At the end of the day, what we want to do is to let the user know a multi-shot failed, and it's on them to handle it in order to recover. I'm game for discussing what's the best way to do it if it's not this one.

@bjarki-andreasen
Copy link
Contributor

I wonder if it may be simpler for the user to check the result of and manually resubmit the SQE, given I believe an SQE is cancelled if there is an error.
RTIO_SQE_MULTISHOT_BLOCKED effectively dequeues the SQE until the user allows it to be re-enqueued, is this not essentially the same as the SQE being cancelled on error (dequeued), and the user resubmitting the SQE later?

AFAIK, cancelled items cannot be un-cancelled (as in, re-use the submission), so the client needs to re-create the SQE submission from the top.

However, I see a couple pros of having a separate flag instead of reusing the existing one:

* I see value in keeping the authority of cancelling items to the clients only mainly for troubleshooting purposes.

* I also like that the flag cannot be confused by other one-shot items that have been cancelled (e.g: in a scenario of a single RTIO client working with multiple IODEVs).

At the end of the day, what we want to do is to let the user know a multi-shot failed, and it's on them to handle it in order to recover. I'm game for discussing what's the best way to do it if it's not this one.

I thought chained SQEs where automatically cancelled if any SQE fails, I would imagine the same behavior for a multishot SQE (basically an infinite SQE chain), maybe this is not the case?

@teburd
Copy link
Contributor

teburd commented Jul 23, 2025

Multi shot is a bit odd here but I’m with Bjarki, if a multi shot fails a failed completion should be the result and it should not be automatically resubmitted is my thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DNM This PR should not be merged (Do Not Merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants