-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
We want to simplify more of the Fenix+KokkosResilience process, focusing for now on implementing the global recovery approach but keeping future localized recovery flows in mind.
Here's the current basic flow for MiniMD:
/**** Preinit ****/
// Application does initialization that doesn't depend on MPI
MPI_Init()
MPI_Comm res_comm;
Fenix_Init(&res_comm);
If initial_rank:
/**** Init ****/
// No failures have happened. Application does some MPI-dependent init
kr_ctx = KokkosResilience::make_context()
Elif recovered_rank:
/**** Recovery Re-Init ****/
//Rank died and this is a replacement spare rank.
//Re-do the MPI-dependent init, possibly with some alterations
kr_ctx = KokkosResilience::make_context()
Elif survivor_rank:
/**** Survivor Re-Init ****/
//These ranks need to help the recovered ranks re-init, and swap to the new resilient communicator
kr_ctx.reset(res_comm)
for i:
kr_ctx.checkpoint(i, {
//Application work
});
Fenix_Finalize();
MPI_Finalize()
I'll leave some thoughts on directions to go as comments.
Metadata
Metadata
Assignees
Labels
No labels