Particle Loop#

Introduction#

In NESO-Particles (NP) a “Particle Loop” is a loop over all particles in a collection of particles. For each particle in the iteration set a user provided kernel is executed. This user provided kernel may access the data stored on each particle (ParticleDats) and access global data. For each data structure accessed by the kernel the user must provide an access descriptor. This access descriptor describes exactly how the data will be accessed, e.g. read-only. The provided kernel must be written such that the result of execution of the loop is independent of the execution order of the loop, i.e. parallel and unsequenced in C++ terminology. The particle loop abstraction follows the particle loop abstraction in [SAUNDERS2018].

Advection Example#

In the code listings below we present and describe an example ParticleLoop. This example loop assumes that a ParticleGroup has been created with “P” and “V” particle properties which are real valued and have at least two components.

Listing 1 Example of a ParticleLoop which performs a simple advection operation.#
inline void advection_example(
    // Input ParticleGroup - we will loop over all particles in this
    // ParticleGroup.
    ParticleGroupSharedPtr particle_group
) {
  
  // These constants are captured by value into the kernel lambda.
  const int ndim = 2;
  const REAL dt = 0.001;
  
  auto loop = particle_loop(
    "advection_example", // Optional name for the loop - for profiling.

    particle_group,      // Iteration set is defined as all particles in this
                         // ParticleGroup.

    // This lambda defines the kernel to be executed for
    // all particles in the iteration set.
    // The [=] captures the ndim and dt variables by value. A [&] would capture
    // these variables by reference.
    [=](auto P, auto V){ // These parameters have a one-to-one correspondence
                         // with the loop arguments that follow the kernel in
                         // the call to particle_loop.
      
      // Loop over the P particle property and update the values with values
      // read from the V property.
      for(int dimx=0 ; dimx<ndim ; dimx++){
        P[dimx] += dt * V[dimx];
      }
    },
    // The remaining arguments passed to the particle loop are the link between
    // the kernel parameters and the data structures which the loop accesses.
    // Each argument is passed as a combination of an access descriptor and a
    // data structure. 
    
    // In this example the data structure is a ParticleDat and it is referenced
    // by the symbol for the ParticleDat in the ParticleGroup. Here we indicate
    // that the "P" ParticleDat is accessed in a write mode.
    Access::write(Sym<REAL>("P")),
    // Here we pass the V ParticleDat to the particle loop and indicate that
    // these particle properties are accessed in a read-only mode in the
    // particle loop.
    Access::read(Sym<REAL>("V"))
  );
  
  // Execute the ParticleLoop. A ParticleLoop can be executed multiple times
  // without reconstruction.
  loop->execute();

  return; 
}

The particle loop can be executed asynchronously via calls to “submit” and “wait”.

Listing 2 Duplicate of the advection ParticleLoop example with the comments removed and asynchronous execution.#
inline void advection_example_no_comments(
    ParticleGroupSharedPtr particle_group
) {
  
  const int ndim = 2;
  const REAL dt = 0.001;
  
  auto loop = particle_loop(
    "advection_example",
    particle_group,
    [=](auto P, auto V){
      for(int dimx=0 ; dimx<ndim ; dimx++){
        // .at is an alternative access method
        P.at(dimx) += dt * V.at(dimx);
      }
    },
    Access::write(Sym<REAL>("P")),
    Access::read(Sym<REAL>("V"))
  );

  // Launch the particle loop.
  loop->submit();

  // Wait for execution of the particle loop to complete.
  loop->wait();
  return; 
}
Table 1 ParticleDat<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::ParticleDat::Read<T>

Components accessed for the particle via .at or subscript which returns a const reference.

Write

Access::ParticleDat::Write<T>

Components accessed for the particle via .at or subscript which returns a modifiable reference.

Additional Data Structures#

In the advection example we described a particle loop that only accessed particle data. In addition to particle data particle loops may access additional data structures which are not tied to a particular particle.

The “LocalArray” data structure is an array type which is local to the MPI rank on which it is defined, i.e. no MPI communication occurs. The local array type can be considered similar to the std::vector type. The “GlobalArray” type is an array type where a copy of the array is stored on each MPI rank and access to the array is collective over the MPI communicator on which the global array is defined. The “CellDatConst” type is a matrix type defined with fixed size (the “Const” part of the name) and data type for each cell in a mesh. A matrix with a fixed number of rows and a variable number of columns per cell can be created per cell with the “CellDat” data structure. Due to the SYCL language restrictions the number of kernel arguments must be fixed at compilation time, the “SymVector” data structure applies indirection to allow a “vector” of particle properties to be passed at runtime. The “ParticleLoopIndex” provides access to indexing information for the particle within the particle loop kernel.

These additional data structures must be passed to a particle loop with an access mode which is commutative. In practice only commutative access modes are defined for these data structures and attempting to pass them to a particle loop with an inappropriate access descriptor will result in a compile time error.

LocalArray#

The local array type is local to each MPI rank. No communication between MPI ranks is performed by the particle loop. The local array can be accessed in particle loops with the access modes in the following table.

Listing 3 Particle loop example where a local array is incremented by each particle and another local array is read by each particle.#
inline void local_array_example(
    ParticleGroupSharedPtr particle_group
) {
  
  // Create a new LocalArray on the same SYCLTarget as the particle group.
  auto local_array_add = std::make_shared<LocalArray<REAL>>(
    particle_group->sycl_target, // Compute target to use.
    2,                           // Number of elements in the array.
    0                            // Initial value for elements.
  );
  
  // Create a local array using initial values from a std::vector.
  std::vector<REAL> d0 = {1.0, 2.0, 3.0};
  auto local_array_read = std::make_shared<LocalArray<REAL>>(
    particle_group->sycl_target, // Compute device.
    d0                           // Initial values and size definition.
  );
  
  auto loop = particle_loop(
    "local_array_example",
    particle_group,
    // LA_ADD is the parameter for the LocalArray with "add" access and LA_READ
    // is the parameter for the LocalArray with read-only access.
    [=](auto ID, auto V, auto LA_ADD, auto LA_READ){
      
      // Increment the first component of LA_ADD by 1.
      LA_ADD.fetch_add(0, 1);
      // Increment the second component of LA_ADD with the entry in ID[0].
      LA_ADD.fetch_add(1, ID[0]);
      
      // Read from LA_READ and assign the values to the V particle component.
      V[0] = LA_READ.at(0);
      V[1] = LA_READ.at(1);
      V[2] = LA_READ.at(2);
    },
    // Particle property access descriptors.
    Access::read(Sym<INT>("ID")),
    Access::write(Sym<REAL>("V")),
    // LocalArray access descriptors.
    Access::add(local_array_add),
    Access::read(local_array_read)
  );

  // Execute the loop.
  loop->execute();
  
  // Get the contents of the local array in a std::vector.
  auto vec0 = local_array_add->get();

  return; 
}
Table 2 LocalArray<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::LocalArray::Read<T>

Components accessed for each array element via .at or subscript which returns a const reference.

Write

Access::LocalArray::Write<T>

Components accessed for each array element via .at or subscript which returns a modifiable reference. It is the responsiblity of the user to avoid write conflicts and race conditions between loop iterations.

Add

Access::LocalArray::Add<T>

fetch_add(index, value) atomically increments the element referenced by the index with the passed value. Returns the previous value stored at the index. Users may assume that the memory region accessed is the same for all invocations of the kernel. Hence adding 1 for each particle gives a total ordering for each participating calls.

NDLocalArray#

NDLocalArray is a N-Dimensional local array type with a templated data type (T) and number of dimensions (N). Otherwise the behaviour is intended to be analogous to LocalArray. The indices of the NDLocalArray are linearised such that the first index is the slowest varying index and the last index is the fastest varying index.

Listing 4 Particle loop example where a N-Dimensionsal local array is incremented by each particle and another ND local array is read by each particle.#
inline void nd_local_array_example(
    ParticleGroupSharedPtr particle_group
) {
  
  // Create a new 2D NDLocalArray to accumulate values into.
  auto local_array_add = std::make_shared<NDLocalArray<INT, 2>>(
    particle_group->sycl_target, // Compute target to use.
    2,                           // Number of elements in the array for 
                                 // dimension 0.
    1                            // Number of elements in the array for 
                                 // dimension 1.
  );
  // Fill the array with the same value for all elements.
  local_array_add->fill(0);
  
  // Create a 2D NDLocalArray for read access.
  auto local_array_read = std::make_shared<NDLocalArray<REAL, 2>>(
    particle_group->sycl_target, // Compute device.
    1,                           // Number of elements in dimension 0. 
    3                            // Number of elements in dimension 1. 
  );
  
  // Get the contents of an NDLocalArray on the host.
  // Note uninitialised data here.
  auto v = local_array_read->get();
  // The data in the array is linearised such that the indices run from slowest 
  // to fastest.
  v.at(0) = 1.0; // Element at index (0,0).
  v.at(1) = 2.0; // Element at index (0,1).
  v.at(2) = 3.0; // Element at index (0,2).
  
  // Copy the host std::vector back into the NDLocalArray
  local_array_read->set(v);
  
  auto loop = particle_loop(
    "local_array_example",
    particle_group,
    // LA_ADD is the parameter for the NDLocalArray with "add" access and 
    // LA_READ is the parameter for the NDLocalArray with read-only access.
    [=](auto ID, auto V, auto LA_ADD, auto LA_READ){
      
      // Increment the (0, 0) component of LA_ADD by 1.
      LA_ADD.fetch_add(0, 0, 1);
      // Increment the (1, 0) component of LA_ADD with the entry in ID[0].
      LA_ADD.fetch_add(1, 0, ID[0]);
      
      // Read from LA_READ and assign the values to the V particle component.
      V[0] = LA_READ.at(0, 0);
      V[1] = LA_READ.at(0, 1);
      V[2] = LA_READ.at(0, 2);
    },
    // Particle property access descriptors.
    Access::read(Sym<INT>("ID")),
    Access::write(Sym<REAL>("V")),
    // NDLocalArray access descriptors.
    Access::add(local_array_add),
    Access::read(local_array_read)
  );

  // Execute the loop.
  loop->execute();
  
  // Get the contents of the local array in a std::vector.
  auto vec0 = local_array_add->get();
}
Table 3 NDLocalArray<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::NDLocalArray::Read<T, N>

Components accessed for each array element via .at(i_0, …, i_n-1) which returns a const reference.

Write

Access::NDLocalArray::Write<T, N>

Components accessed for each array element via .at(i_0, …, i_n-1) which returns a modifiable reference. It is the responsiblity of the user to avoid write conflicts and race conditions between loop iterations.

Add

Access::NDLocalArray::Add<T, N>

fetch_add(i_0, …, i_n-1, value) atomically increments the element referenced by the index with the passed value. Returns the previous value stored at the index. Users may assume that the memory region accessed is the same for all invocations of the kernel. Hence adding 1 for each particle gives a total ordering for each participating calls.

GlobalArray#

Unlike the local array, a global array is intended to have identical values across MPI ranks. The global array can be accessed in “read” and “add” access modes. When accessed with “add” access modes the particle loop must be executed collectively across all MPI ranks. On completion of the loop the local entries of each global array are globally combined (Allreduce).

Listing 5 Particle loop example where a global array is incremented by each particle.#
inline void global_array_example(
    ParticleGroupSharedPtr particle_group
) {
  
  // Create a new GlobalArray on the same SYCLTarget as the particle group.
  auto global_array = std::make_shared<GlobalArray<REAL>>(
    particle_group->sycl_target, // Compute target to use.
    1,                           // Number of elements in the array.
    0                            // Initial value for elements.
  );
  
  auto loop = particle_loop(
    "global_array_example",
    particle_group,
    [=](auto V, auto GA){
      // Kinetic energy of this particle.
      const REAL kinetic_energy = 
        0.5 * (V[0] * V[0] + V[1] * V[1]);
      // Increment the first component of the global array with the
      // contribution from this particle.
      GA.add(0, kinetic_energy);
    },
    // Particle property access descriptor.
    Access::read(Sym<REAL>("V")),
    // GlobalArray access descriptor.
    Access::add(global_array)
  );
  
  // Execute the loop. This must be called collectively on the MPI communicator
  // of the SYCLTarget as the add operation is collective.
  loop->execute();

  // Get the contents of the global array in a std::vector. The first element
  // would be the kinetic energy of all particles in the ParticleGroup.
  auto vec0 = global_array->get();

  return; 
}
Table 4 GlobalArray<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::GlobalArray::Read<T>

Components accessed for each array element via .at or subscript which returns a const reference.

Add

Access::GlobalArray::Add<T>

add(index, value) atomically increments the element referenced by the index with the passed value. On loop execution completion values are all-reduced across all MPI ranks.

CellDatConst and CellDat#

The CellDatConst data structure stores a constant sized matrix per mesh cell. When accessed from a particle loop both the read and add access descriptors expose the matrix which corresponds to the cell in which the particle resides.

Listing 6 Particle loop example where a CellDatConst is accessed by the ParticleLoop. CellDat is accessed in a ParticleLoop in an identical manner.#
inline void cell_dat_const_example(
    ParticleGroupSharedPtr particle_group
) {
  
  // Get the number of cells on this MPI rank.
  const int cell_count = particle_group->domain->mesh->get_cell_count();

  // Create a 3x1 matrix for cell_count cells.
  auto g1 = std::make_shared<CellDatConst<REAL>>(
      particle_group->sycl_target, cell_count, 3, 1);
  
  // For each cell get the current matrix and zero the values then write back
  // to the data structure.
  for (int cx = 0; cx < cell_count; cx++) {
    auto cell_data = g1->get_cell(cx);
    cell_data->at(0, 0) = 0.0;
    cell_data->at(1, 0) = 0.0;
    cell_data->at(2, 0) = 0.0;
    g1->set_cell(cx, cell_data);
  }

  // Alternatively all entries in all cells of the CellDatConst may be filled
  // with a value.
  g1->fill(0.0);
  
  auto loop = particle_loop(
    "cell_dat_const_example",
    particle_group,
    [=](auto V, auto GA){
      // Increment the matrix in each cell with the velocities of particles in
      // that cell.
      GA.fetch_add(0, 0, V[0]);
      GA.fetch_add(1, 0, V[1]);
      GA.fetch_add(2, 0, V[2]);
    },
    // Particle property access descriptor.
    Access::read(Sym<REAL>("V")),
    // CellDatConst access descriptor.
    Access::add(g1)
  );
  
  // Execute the loop.
  loop->execute();

  // After loop execution the 3x1 matrix in each cell will contain the sum of
  // the particle velocities in each cell.
  for (int cx = 0; cx < cell_count; cx++) {
    auto cell_data = g1->get_cell(cx);
    // use cell data
  }

  return; 
}
Table 5 CellDatConst<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::CellDatConst::Read<T>

Components accessed for each array element via .at(row, col) or subscript (linearised index) which returns a const reference.

Add

Access::CellDatConst::Add<T>

fetch_add(row, col, value) atomically increments the element referenced by the index with the passed value. Returns the previous value stored at the index. Users may assume that the memory region accessed is the same for all invocations of the kernel. Hence adding 1 for each particle gives a total ordering for each participating calls.

Min

Access::CellDatConst::Min<T>

fetch_min(row, col, value) atomically computes the minimum of element referenced by the index and the passed value. Returns the previous value stored at the index.

Max

Access::CellDatConst::Max<T>

fetch_max(row, col, value) atomically computes the maximum of element referenced by the index and the passed value. Returns the previous value stored at the index.

Table 6 CellDat<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::CellDat::Read<T>

Components accessed for each array element via .at(row, col) which returns a const reference.

Add

Access::CellDat::Add<T>

fetch_add(row, col, value) atomically increments the element referenced by the index with the passed value. Returns the previous value stored at the index. Users may assume that the memory region accessed is the same for all invocations of the kernel. Hence adding 1 for each particle gives a total ordering for each participating calls.

Write

Access::CellDat::Write<T>

Write using a modifiable reference provided by .at(row, col). It is the users responsibility that no write conflicts or race conditions occur.

SymVector#

SymVector enables passing a set of ParticleDats to the ParticleGroup where the number of ParticleDats is only known at runtime. The indexing of the data structure in the kernel is via the positional index of the ParticleDat relative to the construction of the SymVector.

Listing 7 Particle loop example where a SymVector is accessed by the ParticleLoop.#
inline void sym_vector_example(
    ParticleGroupSharedPtr particle_group
) {
  
  // These constants are captured by value into the kernel lambda.
  const int ndim = 2;
  const REAL dt = 0.001;
  
  auto loop = particle_loop(
    "sym_vector_example",
    particle_group,
    [=](auto dats_vector){
      for(int dimx=0 ; dimx<ndim ; dimx++){
        // P has index 0 in dats_vector as it is first in the sym_vector.
        // V has index 1 in dats_vector as it is second.
        dats_vector.at(0, dimx) += dt * dats_vector.at(1, dimx);
      }     
    },
    // We state that all ParticleDats in the SymVector are write access.
    Access::write(
      // Helper function to create a SymVector.
      sym_vector<REAL>(
        particle_group,
        // This argument may also be a std::vector of Syms.
        {Sym<REAL>("P"), Sym<REAL>("V")}
      )
    )
  );

  // Execute the loop.
  loop->execute();
  return; 
}
Table 7 SymVector<T> Access#

Access Descriptors

Kernel Types

Notes

Read

Access::SymVector::Read<T>

Components accessed for each array element via .at or subscript which returns a const reference.

Write

Access::SymVector::Write<T>

Components accessed for each array element via .at or subscript which returns a reference.

ParticleLoopIndex#

Some data structures are indexed by the cell and layer of the particle. The ParticleLoopIndex is a data structure that can be read in a ParticleLoop to provide this information. This data structure is read-only.

Listing 8 Particle loop example where a ParticleLoopIndex is accessed by the ParticleLoop.#
inline void advection_example_loop_index(
    ParticleGroupSharedPtr particle_group
) {
  
  auto loop = particle_loop(
    "loop_index_example",
    particle_group,
    [=](auto index){
      // Dummy output variable we store the indices in for this example.
      [[maybe_unused]] INT tmp;
      // The cell containing the particle.
      tmp = index.cell; 
      // The row (layer) containing the particle.
      tmp = index.layer; 
      // The linear index of the particle on the calling MPI rank. 
      // This index is in [0, A->get_npart_local()).
      tmp = index.get_local_linear_index(); 
      // The linear index of the particle in the current ParticleLoop. 
      // This index is in [0, <size of ParticleLoop iteration set>).
      tmp = index.get_loop_linear_index(); 
    },
    // Note the extra {} that creates an instance of the type.
    Access::read(ParticleLoopIndex{})
  );

  loop->execute();
  return; 
}
Table 8 ParticleLoopIndex Access#

Access Descriptors

Kernel Types

Notes

Read

Access::ParticleLoopIndex::Read

Indices accessed via .cell, .layer, .get_local_linear_index and .get_loop_linear_index.

KernelRNG#

We assume that there are potentially many methods for accessing RNG values in a ParticleLoop. Each of these methods may require a specialised device implementation.

HostPerParticleBlockRNG is an abstract base class that encapsulates the an approach where RNG values are sampled and stored in a temporary device buffer prior to the particle loop execution. HostPerParticleBlockRNG allows any host function which takes no arguments and returns a single value to be used as a sampling function.

In the following example we use a host normal distribution as a source of random velocity values in a ParticleLoop.

Listing 9 Particle loop example where a ParticleLoop accesses random values for each particle.#
inline void particle_loop_rng(
    // Input ParticleGroup - we will loop over all particles in this
    // ParticleGroup.
    ParticleGroupSharedPtr particle_group
) {
  
  // Number of RNG values required per particle in the ParticleLoop.
  const int num_components = 3;
  
  // Create a RNG on the host with the required distribution.
  std::mt19937 rng_state(52234234);
  std::normal_distribution<REAL> rng_dist(0, 1.0);
  auto rng_lambda = [&]() -> REAL { return rng_dist(rng_state); };
  
  // Create a PerParticleBlockRNG instance from the required distribution.
  auto rng_kernel = host_per_particle_block_rng<REAL>(rng_lambda, num_components);
  
  // Create a ParticleLoop which samples the distribution num_components times
  // per particle.
  auto loop = particle_loop(
    "rng_example", 
    particle_group,
    [=](auto INDEX, auto DIST, auto V){
      for(int dimx=0 ; dimx<num_components ; dimx++){
        V.at(dimx) = DIST.at(INDEX, dimx);
      }
    },
    // A ParticleLoopIndex facilities access to the RNG values.
    Access::read(ParticleLoopIndex{}),
    // A KernelRNG may only be accessed in Read mode.
    Access::read(rng_kernel),
    // Output particle property.
    Access::write(Sym<REAL>("V"))
  );
  
  loop->execute();

  return; 
}
Table 9 KernelRNG Access#

Access Descriptors

Kernel Types

Notes

Read

Access::KernelRNG::Read<T>

Values are accessed via .at(particle_index, component) where particle_index is a ParticleLoopIndex.

[SAUNDERS2018]

A domain specific language for performance portable molecular dynamics algorithms. CPC , arXiv.