Parallel Galaxy Evolution Simulation
  • We implemented a galaxy evolution simulator using two kinds of algorithms: Barnes-Hut algorithm and Morton-Code algorithm. Then, we paralleled these sequential algorithms using CUDA and evaluated the performance on latedays cluster that NVIDIA Tesla K40.

  • Sequential and CUDA Barnes-Hut algorithm
  • The key data structure for sequential Barnes-Hut algorithm is QuadTree. QuadTree is a data structure where every internal node has exactly four children. In our case, we use QuadTree to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. We have different implementations of QuadTree in sequential and CUDA versions. In our sequential approach, we use a pointer-based tree layout; while in our CUDA approach we we pre-allocate an array in GPU memory that accommodates an array-based tree layout.

  • The QuadTree is the core of the program. In the sequential approach, during the construction phase, new nodes traverse the existing QuadTree, find its appropriate position in the tree and do corresponding insertion. During the force calculation phase, the acceleration of each node in the tree can be calculated by simply invoking its method apply_to_objects:
QuadTree::apply_to_objects(std::vector<Object> &objects, GS_FLOAT dt) {
    size_t i;
    Object *object;
    for (i = 0; i < objects.size(); i++) {
        object = &objects[i];
        Point2D acc = get_force_on_object(object);
        Point2D dv = point2d_multiply(acc, dt);
        object->speed = point2d_add(object->speed, dv);
  • For the CUDA version, we have several key operations on the array-based tree array, that is, several kernels. The TreeBuildingKernel is responsible for filling internal tree nodes in the pre-allocated array. The ForceCalculationKernel is responsible for calculating the acceleration of every particle on each iteration. The IntegrationKernel is responsible for using the calculated accelerated from previous kernels to actually update the location and velocity of particles. There are two other kernels, which are SummarizationKernel and SortKernel. They are responsible for summarizing internal node information and sort particles respectively.
  • Generally speaking, the Barnes Hut algorithm is just an approximation method that we use to approximate some calculation of forces upon an particle. The algorithm should just modify the position and speed of particles. So, the input to the algorithm and the output from the algorithm should be the same set of particles, with only positions and speeds changed by the algorithm.
  • The force calculation phase is the most computationally expensive (over 90% of run time) . And since results of particles can be computed seemingly independently, we should try to parallelize the code.

  • As is said above, the algorithm mainly consists of three kernels (or three stages): tree construction, force calculation and position and velocity update. In the CUDA implementation, force calculation depends on the successful construction of the QuadTree while position and velocity update depends on the successful calculation of force calculation. Each stage itself is parallelizable. However, it is not feasible to parallel across stages. For example, illegal results may be generated if the force calculation begins before the completion of the tree construction stage. Within each parallelizable stage, it is data-parallel. Sequential tree construction and sequential force calculation have potentially very bad data locality, which is an inherent drawback of pointer based tree layout and they also aren’t amenable to SIMD execution due to distinction in each step of execution. However, the position and velocity update stage has good data locality and is amenable to SIMD execution.

  • Sequential and CUDA Morton Code algorithm
  • First, we need a brief introduction to MortonCode algorithm. The basic idea is that we create a hash function between the position of a particle and the generated MortonCode (hash value). After that, particles are sorted according to their Morton Code representation. In this way, particles with “closer” positions will sit “closer” in the array as shown in Figure 4. Then, a level mask is used to help filter those particles (new_mask = morton_code & level_mask). If “new_mask” is equal for a group of particles, then they’ll be grouped together into cells. And, a threshold of leafs (let’s say 16) is used to identify whether the particle will be further used for tree construction. The tree construction will not complete unless the maximum of depth is reached or all particles are assigned into the leaf. In conclusion, “close” particles will be finally grouped into a cell that also has the property {mass, speed and position} to represent a group of particles.
  • The first key data structure for our project is: class MortonTreeObject. It has the mcode variable that stores the mcode generated based on the object’s position in the rectangle space model. 
class MortonTreeObject : public Object {
        unsigned long mcode;    // 16 bytes
        // store the index of cell that contians the MortonTreeObject
        int parent;
        MortonTreeObject(Point2D pos, Point2D speed, GS_FLOAT mass, RectangleD rect) : Object(pos, speed, mass) {
            // transfer a position to the unified format
            Point2D p = point2d_unit_square(pos, rect);
            mcode = mortan2D(p.x, p.y, rect);
            parent = -1;
  • The second key data structure is class MortonTree. It has the mortonObjects vector that stores all the objects within the the MortonTree Model.
class MortonTree {
        std::vector<MortonTreeObject*> mortonObjects;