Skip to content

Stronger Fenix Integration #75

@Matthew-Whitlock

Description

@Matthew-Whitlock

We want to simplify more of the Fenix+KokkosResilience process, focusing for now on implementing the global recovery approach but keeping future localized recovery flows in mind.

Here's the current basic flow for MiniMD:

/**** Preinit ****/
// Application does initialization that doesn't depend on MPI

MPI_Init()

MPI_Comm res_comm;
Fenix_Init(&res_comm);

If initial_rank:
  /**** Init ****/
  // No failures have happened. Application does some MPI-dependent init
  kr_ctx = KokkosResilience::make_context()
Elif recovered_rank:
  /**** Recovery Re-Init ****/
  //Rank died and this is a replacement spare rank. 
  //Re-do the MPI-dependent init, possibly with some alterations
  kr_ctx = KokkosResilience::make_context()
Elif survivor_rank:
  /**** Survivor Re-Init ****/
  //These ranks need to help the recovered ranks re-init, and swap to the new resilient communicator
  kr_ctx.reset(res_comm)


for i:
  kr_ctx.checkpoint(i, {
    //Application work
  });

Fenix_Finalize();
MPI_Finalize()

I'll leave some thoughts on directions to go as comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions