How many instructions does a processor need to execute to get through the first level of Doom? About 2.7 billion, in my case. Over the course of the several minutes it took me to stumble through the level, it rendered about 1080 frames, each of which required about 2.5 million instructions to produce. The other 700 million instructions initialized the game, loading textures, setting up data structures, etc.

I ported Doom to my GPGPU processor recently. While it doesn’t use advanced hardware features of this architecture like hardware multithreading, SIMD, or even floating point, it is a good test of the toolchain and libraries, being substantially larger than other test programs I’ve run. It did shake out some good bugs.

I forgot how much I sucked at this game

The nice thing about the instruction set simulator (emulator) is that I can instrument at a fairly fine grained level. Here is breakdown of instructions by type:

Interestingly, this matches the instruction profile of the 3D teapot renderer I wrote within a few percent:

load 22.71%
store 10.67%
branch 11.81%
arithmetic 54.45%

I would have expected a program designed for a different generation of hardware to have a different instruction profile (for example, Doom uses lookup tables heavily, which I’d presume would make it more weighted towards memory loads). However, that doesn’t seem to be the case. Perhaps the instruction distribution is influenced more by the compiler.

The simulator gets around 8 frames per second on my Core i5 laptop. That means it’s executing around 20 million instructions per second on average. The simulator isn’t optimized, as I’ve mostly targeted it as a reference for co-verification of the hardware model, so that’s not too bad, I guess.

However, it’s a speed demon compared to the cycle accurate Verilog simulation, which shlumps along at the equivalent of 73kHz on my laptop. This is no fault of Verilator, the open source tool I’m using to compile the model, which is actually very fast relative to other simulators. It’s just simulating the model at a high level of detail. I managed to get it to initialize and render the first frame of the level over the course of an hour and 20 minutes. During that time, it executed around 353 million clock cycles.

total cycles 353,708,911
l2_writeback 88,342
l2_miss 121,008
l2_hit 4,505,554
store rollback count 2,200,227
store count 6,519,751
instruction_retire 77,984,252
instruction_issue 93,883,016
l1i_miss 2,133
l1i_hit 158,202,234
l1d_miss 304,961
l1d_hit 9,045,137

The performance is pretty crummy. It only issues instructions about 26% of clock cycles, the rest being idle, presumably because it is waiting on memory. Of the issued instructions, only 83% are retired. It ends up rolling back the rest because they were speculatively issued and couldn’t complete.

I’d estimate it would get a little over 5 frames per second running on FPGA at 50Mhz. This architecture depends on hardware threading to hide latency and makes little effort otherwise to minimize it. It’s designed for highly parallel workloads. If the game utilized a real 3D renderer, those threads could be put to work rendering the scene in parallel.