Most things don’t start out complicated. In fact, paradoxically, sometimes the things that start the simplest end up accumulating the most complexity over time. The process of loading code onto the FPGA version of the GPGPU I’ve been working on has become kind of messy, and I’ve been considering how to clean it up. The challenge with a project like this is that it highly unconstrained. Much like deciding what to have for dinner, the number of choices can be overwhelming. However, simplicity and ease of setup are a high priority for me, so that’s at least a useful guiding criterion.

There currently isn’t an operating system for this processor, so programs are self contained. Code at the entry of the program initializes the stack pointer(s) and sets up C runtime. The toolchain for this processor outputs statically linked executables in the ELF format. These are processed by a custom tool ‘elf2hex’ that expands the file to its in memory layout. ELF files are designed to be loaded by operating systems, and have a header with information about the executable including the entry instruction address. Since the processor begins executing instructions at address 0 on boot and doesn’t have logic to decode the header, the elf2hex tool inserts a jump instruction at address 0 to the entry point (clobbering part of the unused header).

This project spent its early life running only in simulation. Booting in this environment is trivial using a built-in function called $readmemh provided by Verilog. This function initializes a memory array from the file created by elf2hex. When the processor starts, the code is magically in the proper place in memory. This isn’t too far off from how this would work as a coprocessor, since a host would be doing this on its behalf. However, the FPGA configuration is a standalone processor, and this has become a more interesting use case in its own right.

The scheme described above also works on FPGA when using the on-chip block RAM as system memory. When the synthesis tool encounters $readmemh, it will initialize SRAM with the file’s contents. This can also be done using most vendor’s clunky 90’s style GUI to configure a memory initialization file. However, this means having to resynthesize the design every time the code changes. I learned to program after the mainframe era, so getting something right on the first try (or the tenth) isn’t really my style. After suffering through hour recompilations a few dumb mistakes in a row, I was ready to try something different.

Fortunately, someone else had already solved the problem for me. Brian Swetland built a JTAG stub for his CPU, as well as a tool that allows loading the image over USB using the USBBlaster hardware that is built into Altera’s dev board–without the use of TCL (which I object to on humanitarian grounds). The tool asserts the reset signal, loads the image into SRAM, and then deasserts reset. This worked pretty well, allowing me to run various test programs. However, when I subsequently ported a C++ compiler, the test programs predictably became too big to fit in on-chip SRAM.

I already had implemented an SDRAM controller because there isn’t enough on- chip memory to hold the VGA framebuffer. The logical next step was to load the program code into SDRAM. This adds complexity that doesn’t exist with the SRAM loader. SDRAM is stateful and operates in bursts. The JTAG tool allows loading code into the board many times without needing to reload the FPGA bitstream. This can also occur at any time, even while a SDRAM burst is active. This would leave the SDRAM controller in an undefined state if it weren’t also reset at the same time. The JTAG tool also needs to keep the processor in reset the until code is fully loaded to ensure it doesn’t prematurely start executing partially loaded code. While the image is loading, the processor needs to remain in reset, but the SDRAM must not be. This was not an issue with SRAM, because internal block ram is dual ported and the JTAG loader used the second port. This worked fine while everything else was held in reset; SRAM doesn’t even have a reset signal. That isn’t an option for SDRAM.

One solution would be to create two reset lines. One would control SDRAM controller and memory subsystem. The other would control the processor core. During boot, the host would first assert both reset lines to put everything into a known state. It would then deassert the reset line to the memory subsystem and leave the processor in reset. After loading the code, it would deassert processor reset and start execution. The SDRAM controller uses the AXI bus protocol, so the JTAG interface would need to be able to speak that to the SDRAM controller. This would also require bypass logic to let it take over from the AXI interconnect.

That seemed like a lot of work. Instead, I wrote a first stage bootloader that reads the image over the serial port into SDRAM. A loader program, which runs on a host machine, reads an ELF file (rather than the flattened hex file described previously), sends commands over the serial port to load data into ranges, clear ranges (for BSS segments), and jump to an entry point. For now, this first stage loader is loaded using the JTAG mechanism described above. This makes running programs a tedious two step process, but I assumed I would just convert this SRAM into a ROM and add a reset button on the board.

There were a few nice things about this setup. The serial loader was faster than JTAG. Also, using a serial loader is more portable across different FPGA devices (Altera has apparently changed the protocol for their USB blaster in newer revisions). However, making the emulator and Verilog simulator environments consistent with FPGA would add the complexity of needing to simulate the serial loader. This would be especially painful while debugging test programs, because it would mean having to wade through a slow copy loop before getting to the actual test program.

The other issue with using SDRAM is that the memory map between the emulator and FPGA environments is different. The emulator assumes a single flat address space starting at address 0. In the FPGA environment, SRAM and SDRAM need to be at different memory addresses. In the current configuration, SRAM starts at address 0 and SDRAM at 0x10000000. With the new boot scheme, I need to recompile the libraries and applications differently for emulation and FPGA and this currently involves manually changing the places in code where addresses are hard coded, including stack, heap, and framebuffer addresses. It would be straightforward to make the emulator mimic the memory layout of the FPGA environment, but I would still have to emulate the two step loading process required on FPGA.

Another alternative would be to move the bootloader to high memory and put SDRAM at address 0. In the FPGA configuration, a bootloader would load the image over serial (or from some other media). In the emulator situation, the emulator itself would load the image to address 0 and jump directly to it. From the program’s perspective, it would look the same once it started executing.

Although each of these solutions is fairly straightforward, I haven’t found one that I’m happy with. I think I have a few basic goals:

  1. The three test environments (Verilog simulation, emulation, and FPGA) should be similar enough to be transparent to the code that is executing.
  2. The process of loading code onto FPGA should be fast and easy: ideally one button click.
  3. The mechanism should be simple, easy to understand, and not hamper debugging software or hardware.

I have a tendency to overcomplicate things at times, so I have a nagging feeling that there is a simpler solution that I’m overlooking because of a bad assumption somewhere.

Update: a few former colleagues gave some good ideas. I ended up making the boot address configurable. The FPGA starts execution in a small boot ROM located in upper memory just below the register space. It reads a program over the serial port to the beginning of memory (I abandoned JTAG for now). I used one of the pushbuttons as a hardware reset on the FPGA board. The Verilog simulator and emulator boot to address 0, where they have already loaded the program. Since all environments assume the binary is statically linked at address 0, I can run the same binaries unmodified in any of them.