Use the MinoTauro User's Guide provided by the teacher to solve the tasks in this lab.

numsection Connecting to MinoTauro

task: Connect to Minotauro supercomputer using Secure Shell tool.
task: Create a Hello World program in C, compile it and run it in your login node.

numsection Submitting a Job

There are 2 supported methods for submitting jobs. The first one is to use a wrapper maintained by the Operations Team at BSC that provides a standard syntax regardless of the underlying Batch system (mnsubmit). The other one is to use the SLURM sbatch directives directly. The second option is recommended for advanced users only.

task: Submit your ”Hello World” program using SLURM system.
task: Submit your ”Hello World” program using mnsubmit system.

numsection MPI Hello World

task: Compile and run your MPI ”Hello World” program created during Lab 3 using the MinoTauro batch system (mnsubmit).
task:Compile and run your MPI ”Trapezoidal Rule” program created during Lab 3 using the MinoTauro batch system (mnsubmit).

numsection Taking Time

task: Usign gettimeofday function obtain the MPI execution time of the program that estimates the Pi number using the Trapezoidal Rule for 2 and 4 processors for a n=16777216. Compare the results with the results obtained with Marenostrum. Justify the difference.

numsection MPI Matrix-vector product

task: Compute the total parallel execution time tpar(n, p) = σ(n) + φ(n)/p + κ(n, p) of the MPI parallel code used using one processor per node. Populate the following table. Comment the results obtained and compare with the previous obtained in Marenostrum.
Node n = 131072 n = 262144 n = 524288
1
2
4

Hint: Example job file (2 Nodes)

#!/bin/bash
#SBATCH --job-name=cuda_k80
#SBATCH -D .
#SBATCH --output=k80_%j.out
#SBATCH --error=k80_%j.err
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --gres gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --constraint=k80
#SBATCH --time=00:02:00

...

Hint: Add this to run a job using the reservation queue

#SBATCH --reservation=YOUR_RESERVATION

Hint: Add this to run a job using the debug queue

#SBATCH --partition=debug
#SBATCH --qos=debug

numsection CUDA Hello World

Hint: Ask for an interactive node using:

mnsh -k -g 1 
task: Create and run a CUDA Hello World in a CPU of an interactive node of MinoTauro.
task: Create and run a Hello World in a GPU of an interactive node of MinoTauro.

numsection Data movement between host and device

task: Based in the example of ”A simple kernel to add two integers” presented in theory class, create a code that adds two arrays. You can use the following code that perform array summation on the CPU and modify it to perform array summation on the GPU.
#include <stdlib.h>
#include <time.h>
void sumArraysOnHost(float *A, float *B, float *C, const int N)
{
    int idx;
    for (idx = 0; idx < N; idx++)
    {
        C[idx] = A[idx] + B[idx];
    }
}
void initialData(float *ip, int size)
{
    // generate different seed for random number
    time_t t;
    srand((unsigned) time(&t));

    int i
    for (i = 0; i < size; i++)
    {
        ip[i] = (float)(rand() & 0xFF) / 10.0f;
    }
    return;
}

int main(int argc, char **argv)
{
    int nElem = 1024;
    size_t nBytes = nElem * sizeof(float);

    float *h_A, *h_B, *h_C;
    h_A = (float *)malloc(nBytes);
    h_B = (float *)malloc(nBytes);
    h_C = (float *)malloc(nBytes);

    initialData(h_A, nElem);
    initialData(h_B, nElem);

    sumArraysOnHost(h_A, h_B, h_C, nElem);

    free(h_A);
    free(h_B);
    free(h_C);

    return(0);
}

Some help:

//Memory allocation:
float *d_A, *d_B, *d_C;
cudaMalloc((float**)&d_A, nBytes);
cudaMalloc((float**)&d_B, nBytes);
cudaMalloc((float**)&d_C, nBytes);

// Transfer the data from the CPU memory to the GPU global memory
cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);

// with the parameter cudaMemcpyHostToDevice specifying the transfer direction.

//Copy the result from the GPU memory back to the host:
cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost);

// Release the memory used on the GPU
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

numsection Organizing Threads

task: Create and run a CUDA program with a thread hierarchy structure with 2D grid containing 2D blocks, that display the dimensionality of a thread block and grid from the host host side and device side.

numsection Summiting jobs at MinoTauro

task: Use the mnsubmit Batch system to run a Hello World in a CPU or GPU.
task: Use the mnsubmit Batch system to run the case study matrix multiplication presented in theory class.

numsection Timing the kernel

task: Using gettimeofday measure the previous matrix multiplication example.
task: Use the NVIDIA profiler nvprof to measure the previous matrix multiplication example.
task: Use CUDA events to measure the previous matrix multiplication example.
task: Compare the results obtained in the previous 3 tasks.

numsection Timing the kernel

task: Build a testbed based on the previous matrix multiplication example to evaluate the scalability of the algorithm in MinoTauro. Consider different values for the matrix size parameter, the block size and grid size. Discuss with the teacher during lab session your experimental design before starting the executions. example.