(2/2) Adventures in RISC-V - the Angry Goose Initiative
Part 2 of 2 - see part 1 for context!
Once I was able to wrap up my work on the SBI, it was time to sink my teeth into the RTL side of the project, known as LETC. Since the FPGA implementation was just starting to get off the ground, there was a lot more work to do.
Despite having gotten my hands dirty with verilog a bit in the past, I had done next to nothing with SystemVerilog, so I defenitely had to learn the ropes. Thankfully John had prepared a nice introduction for those of us new to the language to work through. This proved to be a great entry point for getting my head around the ins and outs of SystemVerilog, as well as the chosen workflow and toolchain that we would be using for the project.
My first task would be to implement part of the execute stage. LETC’s architecture consists of a 6 stage pipeline with instruction and data TLBs, at least two levels of cache, and a two-part execute stage.
One of the lovely things about SystemVerilog is the use of packed structs to specify complex interfaces between modules. This cleans up the code considerably as opposed to vanilla Verilog, and also makes it much easier to wire up different blocks. The execute stage would connect to four different blocks:
To the data TLB for load/store operations
To the decode stage, for receiving the decoded signals for a given instruction
To the second execute stage, to continue the execution
To the hazard mitigation logic (known as TGHM), so that the stage can be stalled/flushed as required.
At the time, TGHM had not yet been implemented, nor the DTLB and its associated interface. An ALU had already been implemented, but not integrated into the stage. So the main work to do on this stage would be to compose some combinational logic that would connect the control signals from the decode stage, and TGHM, to ensure that the right source data is placed at the operands of the ALU, and the result is placed in the correct destination. Bypass signals would also be added so that data can be forwarded from other stages to accomodate hazard mitigation.
The first step to developing any digital logic block is to make a testbench, so that’s what I did. My philosophy for working on complicated projects is to start with the most dead simple components, and then add and test one feature at a time until all of the goals are reached. So once I was able to compile and run my testbench, I decided to just connect the ALU directly to the register inputs from decode (no muxing yet) as well as hardwired an add operation. This made the ALU just add the two hardwired operands together every cycle. This worked:
Once I verified this was working, It was simply a matter of putting in the right muxing and interface connections so that everything worked as expected. Not a particularily challenging task, but valuable in getting comfortable with contributing RTL to this project.
After my changes were commited, John was able to wire the stage up to the other stages and execute some simple instructions:
Once my work on the execute stage was wrapped up, it was time to dive into my next task, which would prove to be the more challenging of the two blocks that I would work on throughout the term: The cache. This was especially fun because there were still many design decisions left to be made as to how the cache would take shape.
Having not really though very deeply about cache architecture since taking ECE 222 the previous year, I had to do some brushing up on theory of operation and terminology, to understand the questions and challenges at hand. After some extensive brainstorming with John, we arrived on a cache design with the following specs:
Parameterized depth and line length for easy optimization experiments
Direct mapped
Implemented in LUTRAM rather than BRAM for ease of implementation. This could be changed later on if desired.
There was also some thinking to do on how the cache’s refill FSM would handle refilling the cache lines when a miss occurs, and when to indicate to the associated stage that the data is valid for access. After some discussion with John, I arrived at the following behaviour:
When a memory access is attempted, the associated stage will wait for a “valid” signal, as per LETC’s internal memory protocol, LIMP. If there is a cache miss, the cache will first access that block of memory and load each word into the cache line before the valid signal is raised.
John came up with a clever approach for the refill circutry: a shift register. Essentially the LUTRAM would have “word enable” signals, that allow for only one particular word at that address to be written to. This way, we could connect the data bus in paralell to all of the words for that address, and the shift register would enable the words one at a time as the block of memory is accessed.
I went to work with first implementing the shift register and testing it seperately. I also had to create a simple address counter that could be loaded with the desired memory address, so the cache could independently access a consecutive block of memory locations without intervention from the core. While I was working on that, John put together the main structure of the cache, including the address splitting and the LUTRAM. I was then able to incorporate the shift register, counter, and the refill FSM. After much trial and error, we were able to see the cache being properly filled upon a miss, and triggering the valid signal at the right time once the refill was complete:
This is more or less where the project stands now. There are lot more details to cover in the implementation, as well as future plans, but I’ll leave that for another post.
This was a super fun project to work on, and my first time working in SystemVerilog. I defenitely have a taste for RTL design now, and look forward to doing more of it on future coops and personal projects. More updates to come!
If you would like to review my RTL work on the project, commits can be found here and here.