|Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain)||Download|
Simulation refers to using specified kind of techniques to imitate the operation in real world. It is derived from then modeling of the object in the real world. In computer engineering, simulation is one of the three main methods in analyzing computer systems. It requires less time than measuring the real system and can be done by computer programs. Gem5 and ESESC are two featured simulation tools for simulating computer system. In this paper, their features will be studied, validated and then summarized.
In computer system analysis, the three main methods are modeling, simulation and measuring. Among which, simulation can be performed on any stage, do not require a large portion of time and can be done using computer language. [Jain91]
Since the appearance of the simulation technology, the techniques and tools has experienced a boost on the development. Different techniques are invented satisfying various demands. In this way, a comparation will be necessary in order to find the differences and the direction fitted by each one.
Simulation is known as the imitation of the operation of a real-world process or system over time. [Banks01] It is based on the model developed representing the key features, behaviors and functions of the real-world object. The model is the representation of the system, and the simulation represents the operation on the system. Simulation is widely used in computer engineering and electrical engineering.
Computer architecture simulator, also known as architectural simulator, is a kind of software modeling computer hardware and give the predicted performance. The modeled target can be very flexible. It could be a microprocessor only or a full system include processor, memory and I/O devices. The architectural simulation can achieve the following purposes: evaluating designs without building the real hardware system, accessing non-existing computer devices, generating detailed performance data and quick debugging. A typical example is the multicore computer system, it demands a full-system simulation because creating and debugging can be very difficult and time-consuming. What's more, with the help of simulation, the software development can also start before the hardware is ready. This will validate the design if the hardware. [Joloboff09]
Emulator can be hardware or software, and allows one system to behave like another. The system running the emulator is called the host system and the one emulated called the guest. The emulator will allow the host system to run software that designed for the guest system.
The word "emulator" was first created in 1963[Emerson95], at that time it only refers to the simulation assisted by microcode or hardware while running software emulation was still called "simulation".
Gem5 is a simulator platform doing simulation around system-level computer architecture and processor microarchitecture. It integrates interchangeable CPU model, GPU model, memory system, and multiple instruction set architectures with has Full-system capability and Multi-system capability and power modeling ability. Untill now, it has developed into many branches.
A two-factor experiment is used to measure the accuracy of the gem5 simulator. The method is that run the same program on a real hardware system and the system simulated by gem5 respectively, collect output data and calculate the differences. For the referential hardware model, the Snowball SKY-S9500-ULP-C01 development kit is chosen. The system is built with ST-Ericsson A9500, a dual-core ARM Cortex-A9 processor on Linux Kernel. On the other hand, the gem5 model was built as a dual-core ARM Cortex-A9 processor@ 1GHz, running on Linux kernel, too. The ALP Media, SPLASH-2 and STREAM benchmark were chosen as the workload, permitting exploiting and assessing performance of multicore architecture. The first benchmark includes speech recognition, face recognition, race tracing, MPEG-2 encode and MPEG-2 decode. In the experiment, the last two application which are MPEG-2 encoder and decoder are selected as the tested service. In the second benchmark, there are eight complete applications: Barnes, FMM, Ocean, Radiosity, Water-Spatial and 3-kernels: FFT, LU, Radix. What's more, the STREAM benchmark is a simple program that calculating corresponding rate based on the measured memory bandwidth. The descriptions are presented in Table 1.
Table 1. Benchmark Set Description
The measured execution time was shown in table2.
Table 2. Execution Time of The Adopted Benchmarks
The result shows that the mismatch rate is between 0,47% and 17.94%. The mismatch rate varies for different workloads. In this case, the Radix application is studied since it has the largest mismatch rate. This time, different number of keys were tested to study the relationship between the number and the mismatch rate. Table 3 shows the outputs.
Table 3. Execution Time (ET) of Radix Sort Kernel
Fig.1 shows that the mismatch rate will increase the mismatch rate. This is because the volume of communication between cores will increase when the number of keys increase, which will cause more cache misses.
Fig.1 Radix Sort Execution Time Behavior
For the stream benchmark, Table4 shows that the results from both system are very close.
Table 4. Memory Bandwidth When Executing Stream Benchmark
pd-gem5 is developed for parallel/distributed computer systems. [Alian16] Each host runs one or more gem5 processes, which simulates a full system node or more network switch. An 8-node system simulated in pd-gem5 can be depicted as Fig.2:
Fig.2 The Structure of a Four-node Computer System Simulation in pd-gem5
For networking, each packet is generated by a simulated NIC, and forwarded to a simulated network switch port through TCP sockets. The switch will route the packet to a simulated NIC destination node.
The NIC model latency of traditional gem5 simulation is
Where S is the packet size, it is smaller than the maximum transmission unit. lNIC and bNIC stand for NIC's fixed latency and maximum bandwidth. However, in pd-gem5, it is enhanced to more precisely capture non-linear latency effect due to diverse packet sizes. Barrier synchronization is implemented in order to synchronize simulated nodes in pd-gem5. In barrier synchronization, each simulated node is synchronized at the end of each simulated time quantum, which should be fixed.
To run the simulation, hosts which consists a quad core Intel Xeon processor, 2x8GB DDR3- 1600 DIMMs and an Intel ethernet NIC are used. And another group of AMD quad core processor, 1x8GB DDR3 DIMM and a Realtek PCIe Ethernet NIC is used to validate this model. 2 to 24-node systems are evaluated with a star network topology. Then, the MPI implementation of the NAS benchmark is ran.
Fig.3 shows the comparison of pd-gem5, gem5 and the measurement of a AMD quad core processor. From the graph, we can find that the pd-gem5 shows a similar non-linearity fits the real measured curve well. On contrast, the original gem-5 result is more linear and has more significant difference with the real result. In this way, the pd-gem5 method is validated.
Fig.3 Comparison of Packet Round Trip Latency
Figure 3 is the speed up of pd-gem5 running on multiple simulation hosts (0.25 times the number of simulated nodes/the number of cores per simulation host) over pd-gem5 running on a single simulation host. For example, the number of simulation hosts used for the simulation is 2, 4, and 6 for the number of simulated nodes 8, 16, and 24. This is because each simulation host has four cores. The simulation time of using multiple simulation hosts is normalized to that of using a single simulation host to get the speedup.
When the number of simulated nodes is fewer than the number of (active) threads supported by a simulation host, pd-gem5 running on a single simulation host shows slightly higher performance than running on multiple simulation hosts, as the overhead of synchronization is lower (on-chip versus off-chip interconnect communications). However, pd-gem5 running the NAS benchmarks on 2, 4, and 6 simulation hosts offers 1.2, 1.6, and 3.2 higher geometric-mean performance for 8, 16, and 24 simulated nodes than running them on a single simulation host. Note that the microbenchmark fails to complete in a reasonable amount of time when a single simulation host attempts to simulate a 24node computer system because of extremely slow simulation and thus there is no associated speedup point in Fig.4.
Fig.4 Simulation Speedup of Using Multiple Simulation Hosts over a Single Simulation Host
Gem5 is the simulator modeling integrated CPU-GPU systems. It is based on the gem5 and GPGPU-sim simulators. It is designed to do simulations on both systems with coherent caches, using single virtual address space across CPU and GPU, and systems with separated caches. Gem5-gpu followed gem5's cache coherence language SLICC. In addition, gem5-gpu added a series of heterogeneous cache protocols. MOESI protocol is used in the newly-added models. The gem5-gpu simulator supports most unmodified CUDA 3.2 source code and allows CPU and GPU working simultaneously.
Fig.5 shows that compared with the real system, gem5-gpu have a difference smaller than 22% in most cases. And the performance is tightly correlated with the performance of the GPGPU simulation. In this way, such simulator can be validated.
Fig.5 Runtimes Normalized to NVIDIA GTX 580
Besides the gem5 simulator, ESESC is also a widely used simulator for computer systems. ESESC is an open source very fast simulator that models heterogeneous multicores (Out of Oder/In Order/GPUs) with detailed performance, power, and thermal models. [esesc] The simulator models ARM ISA, supports time-based sampling and can give detailed power and temperature reports.
ESESC is famous for its speed. The simulator uses time-based sampling [Ehsan13]. Fig.6 shows that comparing with single threaded simulators without sampling and multithreaded simulators, ESESC showed a slight decrease when the number of cores increased. However, the simulation speed (represented by MIPS) still went beyond other simulators definitely. Supplemented by its detailed report for physical features (power and temperature), ESESC is can be used widely in its field.
Fig.6 Simulation Speed for Different Simulators
A simple exampling experiment is implemented. The computer systems with different L1 cache size, with and without L2 cache are tested and the IPCs are studied as the representation of the performance. The details are shown in Table.1.
Table 5. The Details of the Experiments
The results are shown as follow:
Table.6 The Results of the Verification Experiment
The ANOVA Table is as follow:
Table.7 The ANOVA Table of the Experiment
From the table, it is shown that the effect of the Cache size is significant while the effect of the L2 Cache is insignificant.
The linear regression model of the IPC and cache size is
What's more, increasing the cache size do decrease the miss rate and help to improve the performance. In this way, we can find the result valid and represent the real-world condition.
Nowadays, computer system simulation is getting more and more important since its convenience for not requiring real system and its reliability validated by several experiments. Also, simulators are branching into more and more types fitting multiple demands.
The gem5 simulator can simulate different models of CPU and GPU and many computer systems, it can be customized due to individual requirement for the simulation. In this way, Sub-simulator software such as pd-gem5 and gem5-gpu is designed. Pd-gem5 is prepared for simulating parallel or distributed systems. It can run several hosts at the same time and is suitable for simulating network systems. Gem5-GPU combines gem5 CPU simulation and GPGPU simulation to simulate CPU and GPU together. It has independent interface between gem5 and GPGPU simulator and can give results very close to the measurement of real system.
The ESESC simulator is good at running simulation in high speed. It can simulate multicore, in order/ out of order processors and different kind of caches. Comparing with other simulators based on instructions such as SESC, this simulator is based on the TBS technique and generates output data faster than some other simulators. The data is detailed including power and temperature calculations. However, it is limited in ARM instruction set architectures. The current problem is that it cannot handle multithreads conditions.
In this way, these simulators show different feature against each other and have their own advantages and disadvantages. There probably does not exist a simulator that is perfect for all tasks, but the development and evolution of computer simulation techniques and tools is obvious enough. The current techniques have already made great difference and made the simulation method an important step in computer system analysis.