GPUTejas Overview

GpuTejas is a highly configurable simulator and it can seamlessly simulate advanced GPU architectures such as Tesla, Fermi, and Kepler. A CUDA executable is given as an input to the simulator. We first use an instrumented version of Ocelot to run the executable, and then generate a set of trace files. These trace files primarily contain information regarding the instructions being executed, including the instruction type, instruction pointer (IP), and the corresponding PTX instruction. Note that (PTX) (Parallel Thread Execution) is an intermediate device language, which has to be converted into device specific binary code for native execution. The trace additionally contains the list of memory addresses (if it is a load/store instruction) accessed by an instruction. We also embed some metadata along with every trace file. The metadata lists the number of kernels, the grid sizes in each kernel, and the number of blocks present in each kernel.

These traces undergo another pass to reduce the size of the files. It was observed that all the blocks of a kernel contain the same set of instructions. Thus, redundant information was getting saved in the trace files for every block. We stored the information regarding these instructions separately in a hashfile. Our post-processing scripts subsequently generate new trace files that only contain the instruction pointers of instructions. These instruction pointers map to the actual instructions in the hashfile. Note that, these instructions are translated to specific instruction classes before storing them into the hashfiles. This further reduces the space occupied by the traces being generated.

The trace files generated in the second pass are read by the waiting Java simulation threads. The Java based simulator threads model the GPU, and the memory system. They are responsible for generating the timing information, and detailed execution statistics for each unit in the GPU, and the memory system.

The NVIDIA Tesla GPU contains a set of TPCs (Texture Processing Clusters), where each TPC contains a texture cache, and two SMs (Streaming Multiprocessors). Each SM contains 8 cores (Stream Processors (SPs)), instruction and constant caches, shared memory and two special function units (can perform integer, FP, and transcendental oper- ations). A typical CUDA computation is divided into a set of kernels (function calls to the GPU). Each kernel conceptually consists of a large set of computations. The computations are arranged as a grid of blocks, where each block contains a set of threads. The NVIDIA GPU defines the notion of a warp, which is a set of threads (typically in the same block) that are supposed to execute in a SIMD fashion. We simulate warps, blocks, grids, and kernels in GpuTejas.

We parallelize the simulation by allocating a set of SMs to each thread. In our simulation, each SM has its own local clock for maintaining the timing of instructions. Memory instructions are passed to the memory system, which sup- ports private SM caches, instruction caches, constant caches, shared memory, local memory and global memory. The important point to note here is that different Java threads do not operate in lock step. This can potentially create issues in the memory system, where we need to model both causality (load-store) order and contention. We adapt novel solutions to model these issues.