# CHIP-8 on a FPGA

Introduction of CHIP-8 and creating a working CHIP-8 implementation for use on FPGA's. 27 Aug 2020

# Introduction

## What is CHIP-8

CHIP-81 is a binary interpreted programming language created in the mid 70s to be able to create programs that could run independently of clock speeds and other architecture-specific challenges that were a problem at this time.

Later in the late 80s and early 90s, CHIP-8 became popular to implement on devices like programmable calculators due to the low requirement to write an interpreter. Simplified CHIP-8 editions of many games were also made at this time for example: PONG, Pac-Man and Tetris.

Even now, CHIP-8 is being used by people to learn how to write an emulator2 often being implemented in languages like Python3.

## What is a FPGA

A Field Programmable Gate Array is an integrated circuit that contains programmable logic blocks with a reconfigurable interconnect that can reconfigured at any time.

This is useful for testing implementations of digital integrated circuits such as microprocessors and for circuits where it is useful to be able to change the logic at any time. Accelerator cards in servers is a common use case for FPGA cards these days due to being able to develop the accelerator with the software even after deploying into production.

## What is a HDL-language

A Hardware Description Language is a specialized type of programming language that is an abstraction to describe the logic of digital and analogue circuits. Most HDL languages is on what is referred to as a register-transfer level4. This makes it possible to create complicated circuits and designs where various logic is all ran in parallel.

Today's most widely used HDL languages are Verilog and VHDL. The big differences are that VHDL has a strong type system that requires more accurate definitions in implementations. Because of this VHDL will not convert bit sizes automatically, this makes VHDL code typically longer and more complicated.

An HDL language can be synthesized into logic ports and how they are connected (net-list). This can be used to design an integrated circuit or for FPGA software. The software converts it and optimizes it to generate proprietary bitstream format which describes the logic blocks and interconnects. This is the closest we get to what can be called a program for an FPGA.

# Architecture

One of the most important things to get in order before doing anything related to implementation is defining the architecture. This consists of defining the parts of the implementation, its modules and how they are connected. In the HDL language Verilog, everything is divided into something called modules that can be compared to a component in a circuit, these modules can be connected with wires, wires allows modules to access other modules registers and logical functions.

High-level overview for the architecture of the FPGA implementation.

The architecture has been split into these modules with the following tasks:

• Timer - Responsible for generating the different clock pulses at the different frequencies required

• CPU - Also known as the microprocessor and has the important function of reading instructions from memory and executing them, in addition to controlling the GPU.

• CPU BCD - A sub-module of the CPU that is responsible for converting an 8-bit number to three 4-bit numbers representing the 3 digits of the input number.

• CPU RNG - Pseudo-random number generator that creates a new number for each clock cycle based on an algorithm where numbers in the sequence have no visible connection.

• Memory - Holds the memory for the microprocessor and a separate video memory area (VRAM) that the VGA signal generator can read from, in addition, the font set and the ROM are pre-programmed into the memory.

• GPU - Has the task of copying in sprites and clearing video memory area on behalf of the microprocessor.

• VGA signal generator - Generates a valid VGA signal with data it reads from the video memory.

• Keypad - Assigns to read a 4×4 matrix keypad and turn it into a 16-bit register for the microprocessor to access.

# Implementation

## Picking the HDL-language

After going through the differences between Verilog and VHDL, I came to the conclusion that even though I would have preferred VHDL due to more precise code, the ecosystem was not as big as I prefer. Verilog has several open simulators such as Icarus Verilog and Verilator, while VHDL only had GHDL at the time of writing.

As the tools available were critical to me in terms of being able to test and make changes quickly, therefore I picked Verilog for this project.

## Testing

To simulate test benches that are a form of automatic testing, I have used Icarus Verilog. Is an open source project that can perform event driven simulation.

Event driven simulation gives me the ability to create cycle delays in test benches without writing my own logic for this. This is used to wait X number of cycles before expecting to get a result is what the test bench expects. This is not possible in other simulators such as Verilator which uses a cycle simulation model.

The repo has a Makefile which can execute simulation of all the test modules in the tests/ directory. There are various unit tests which contains assertions.

Unit test which starts a PONG game and dumps the output of the framebuffer in the terminal.

## Modules

### Timer

The timer circuit is one of the most critical parts of the circuit and has the task of creating different pulses for the different parts of the circuit.

In this circuit 3 different clock pulses are needed:

• Pulse for microprocessor - 500Hz

• Pulse for delay & sound timer (register in microprocessor) - 60Hz

• Pulse for VGA signal generator - 25MHz

To generate the pulses, I wrote a simple module (see src/timer.v while reading). It works by calculating the number of system clock pulses required to reach the frequency for the various timers in advance during the synthesis of the circuit.

Each timer has a 32-bit register which it subtracts with one for each positive pulse it gets from the clock input. When it reaches zero, it is resets to the pre-calculated top value at the same time as the pulse output becomes high. Due to this the high pulse length of the pulses the timer generates is equal to the length of the 500MHz signal. Internally in this circuit, this will not create any noticeable problems.

It is important that the clock frequency enters as close as possible to the size of the pulses so that the hours are as accurate as possible. The built-in clock on the DE0-Nano FPGA I have obtained is at 50MHz. This means that the VGA timer output is only triggered for every second pulse the timer receives from the built-in clock.

### CPU

The CPU is the main component of the implementation and has the task of retrieving instructions from the memory where the program counter points at and performs the operations of the instruction.

CHIP-8 has a 12-bit wide address area where the CPU can read and write memory. But to simplify it, I have used a 16-bit area to have it byte aligned.

#### Registers

CHIP-8 has several registers in its implementation:

• PC - Program counter - 16 bit wide and points to the address of the next instruction to be executed. Is automatically added 2 in value for each CPU instruction that is run unless the instruction itself modifies the program counter.

• I - Address register - 16 bit wide and used for memory related instructions, holds an address that indicates where memory should be read from or written to is.

• V0 to VF - Data register - 16 8-bit wide registers that are used to temporarily hold values ​​during operations, VF is also used to set the carry result for some of the addition and subtraction instructions.

• SP - Stack pointer - 8 bit wide pointer that points to the place in the stack register where the last call went to.

• DT - Delay timer - Counts down at a value 60 times per second until it reaches 0, used by the programmer to maintain the correct speed.

• ST - Sound timer - Identical to Delay timer but intended for use with sound.

#### Stack

In CHIP-8 there is a call stack, it keeps track of which addresses are to be returned to when the RET instruction is executed after a CALL instruction. The stack size is 64 bytes and can hold 32 16-bit addresses.

#### Instructions

All CHIP-8 instructions are 16-bit wide and can be divided into four nibbles, one nibble is 4-bit wide and it is only guaranteed that the lead nibble is what is called an OP code, OP codes is what identifies an instruction.

Using the casez function in Verilog, a decoder is generated that can find out which instruction can be executed in relation to which nibbles of the instruction's OP code match. Often instructions have nibbles left to be used as arguments for the instruction.

When an instruction is executed, different nibbles are used or composed to a value with several bits. For example the JP instruction (Jump to address) has only the first nibble as the OP code and the last three nibbles are used as a 12 bit number (NNN) to which the program counter is set.

##### OP Table

There are three different sizes parts of an instruction that can be read:

• 0000 0000 0000 0000  - NNN - 12 bit wide
• 0000 0000 0000 0000  - KK - 8 bit wide
• 0000 0000 0000 0000  - X, Y and Z - 4 bit wide

OP ASM DESCRIPTION
00E0 CLS Clears the display, in this case zeros out the VRAM.
00EE RET Sets the program counter back to the last value from the call stack
1NNN JP addr Sets the program counter to the value of NNN
2NNN CALL addr Adds the current program counter to the callstack and sets the program counter to NNN
3XKK SE Vx, byte Skip the next instruction if the data register Vx is equal to KK
4XKK SNE Vx, byte Skips the next instruction if the data register Vx is not equal to KK
5XY0 SE Vx, Vy Skip the next instruction on data registers Vx is equal to Vy
6XKK LD Vx, byte Sets the value of the data register Vx to KK
7XKK ADD Vx, byte Sets the data register Vx's value to Vx + KK
8XY0 LD, Vx, Vy Sets the value of the data register Vx to register Vy
8XY1 OR Vx, Vy Sets Vx to a bitwise OR operation between Vx and Vy
8XY2 AND Vx, Vy Sets Vx to a bitwise AND operation between Vx and Vy
8XY3 XOR Vx, Vy Sets Vx to a bitwise EXCLUSIVE OR operation between Vx and Vy
8XY4 ADD Vx, Vy Sets Vx to Vx + Vy, In addition, VF is set to the mean value from the operation
8XY5 SUB Vx, Vy Sets Vx to Vx - Vy, In addition, VF is set to the mean value from the operation
8XY6 SHR Vx Runs a single bit shift to the right of Vx and sets VF to the LSD to Vx
8XY7 SUBN Vx, Vy Sets Vx to Vy - Vx, In addition, VF is set to the mean value from the operation
8XY8 SHL Vx, Vy Runs a single bit shift to the left on Vx and sets VF to the LSD to Vx
9XY0 SNE Vx, Vy Skip the next instruction if data registers Vx are not equal to Vy
ANNN LD I, addr Sets register In its value to NNN
BNNN JP V0, addr Sets the program counter to V0 + NNN
CXKK RND Vx, byte Sets Vx to the random number of the RNG module bitwise AND towards KK
DXYZ DRW Vx, Vy, N Draws a sprite at position X = Vx, Y = Vy, with length N (GPU), VF will be set to 0 or 1 based on whether the GPU experienced a sprite collision
EX9E SKP Vx Skips the next instruction about the key with Vx's value one down
EXA1 SKNP Vx Skips the next instruction about the key with Vx's value one up
FX07 LD Vx, DT Sets Vx to delay timer register its value
FX0A LD Vx, K Stops execution of instructions until a button is pressed and sets Vx to it
FX15 LD DT, Vx Sets the delay timer's value to register Vx
FX18 LD ST, Vx Sets the sound timer's value to register Vx
FX1E ADD I, Vx Sets register I to I + Vx
FX29 LD F, Vx Sets register I's value to memory the location of the font sprite with ID Vx
FX33 LD B, Vx Uses the BCD module to split Vx into hundredths, tier-parts and one-part and stores this in memory at I, I +1 and I + 2
FX55 LD [I], Vx Stores V0 to Vx in memory from I to I + x, I is then set to I + x + 1
FX65 LD Vx, [I] Fills V0 to Vx with the memory from I to I + x, I is then set to I + x + 1

#### BCD converting

The CPU has its own BCD conversion mini-circuit (see src/cpu_bcd.v) which takes an 8-bit binary number and splits it into three 4-bit nibbles containing the first three digits of the 8-bit binary number.

After looking for possible solutions, I found an algorithm5 which can be synthesized into a single cycle circuit and went with it.

#### RNG

I used the PRBS31 algorithm as pseudo-random number generator. The only problem is that it is not currently seeded making the RNG predictable (see src/cpu_rng.v).

### Memory

Illustration of the CHIP-8 memory space

CHIP-8's internal memory is 4 KB, i.e. 4096 bytes. This is because CHIP-8 was designed for systems with only 4 KB of main memory and therefore some regions of the memory space is reserved for the system's interpreter. This includes the first 512 bytes and the last 352 bytes.

The memory can be read by both the microprocessor and the GPU but the GPU has priority because the microprocessor is blocked while running in this implementation. This had to be done to make the memory work properly.

In addition, the framebuffer memory is mirrored to a separate video memory (VRAM) that only the VGA module has read access to. This is because the VGA implementation needs to retrieve framebuffer data every clock cycle. This is due to VGA being a signal that must output info on a given frequency instead of sending information in larger transmissions digitally.

The font set and play ROM is loaded into memory using Verilog's \$readmemh function as part of the start state of the circuit. The rest is of the address space is zeroed.

Note that the internal usage regions can be used for anything, I just chose to use the start to store the font set and framebuffer, see src/memory.v for the full implementation.

The internal wiring of a matrix keypad

For input, the CHIP-8 uses a 4×4 keypad, this is because the original CHIP-8 implementation ran on a COSMAC VIP which also had a keypad as the primary input device.

A matrix keypad is used which works with the decoder running in a loop where it sets the output of one of the columns high at the time. Then it checks the inputs if any of the rows inputs are high. If a row is high it knows that the button on that column and the row is pressed with this info the register bit of that button is updated.

Using a matrix keypad avoids having one cord per button. This means that smaller wires and IO pins have been used on the FPGA than other solutions might have required.

The keypad decoder implemented represents the value of the keypad as a 16-bit value which represents every button with its own bit, compared to a 4-bit representation that can only represent one pressed button at a time.

The keypad module stores the 16-bit value in a register that is connected to the CPU module through a wire. The CPU has instructions that read the keypad value and are affected by the value.

### VGA

VGA which stands for Video Graphics Array, is a standard created by IBM in the 80's. Compared to other standards such as DisplayPort and HDMI, it is easier to implement since it is much less complicated and takes little to get a working image.

This is due to VGA being analogue, it has three colours: red, green and blue over each analogue signal in a range of 0 to 0.7V which says the strength of the colour. There are also two digital signals for horizontal and vertical synchronization.

The synchronization signals are sent outside the cycles as the pixels are drawn, and it has an extra padding on the side of the synchronization signals called blanking interval.

I have decided to generate an image signal of 640 pixels in width and 480 pixels in height with an image refresh rate of 60Hz. I picked this since my FPGA has a 50MHz clock which makes the 25MHz pixel clock usable.

#### VGA signal drawing

Visualization of a 640x480@60Hz VGA signal.

#### VGA signal info

Signal Horizontal cycles Vertical cycles
Pixels drawn 640 480
Before sync pulse 48 10
Sync pulse 96 2
After sync pulse 16 33
Table which lists the amount of cycles of the different part of the signal

#### VGA DAC

To generate the analogue VGA signal, I designed a 4 four bit weighted resistance network per colour channel enabling 12-bit colour. While CHIP-8 has a single bit per pixel (black or white) I decided to do it anyway, so I can do other projects which requires multiple colours.

The resistor network is required to get the voltage down to the 0-0.7V range required by the VGA spec and to convert the digital outputs to an analogue value.

My FPGA has 3.3V outputs which is multiple times the 0.7V of the VGA pins and therefore I need to use a resistor to get the voltage down for the VGA cable to an acceptable level for the current6.

$$R_{colour} = \frac{U_{GPIO}-U_{VGA}}{I_{VGA}}=\frac{3.3V-0.7V}{18.7mA}=139,037Ω≈\underline{\underline{139Ω}}$$

To split the total resistance into four different values for each bit we can simply multiply by 2<sup>bit</sup>. Also, the values has been rounded to common resistor values.

Note that the bit variable has to start with 1 to get correct results.

$$R_{bit0} = R_{colour}*2^{bit}=139Ω*2^1=278Ω≈\underline{\underline{270Ω}}$$ $$R_{bit1} = R_{colour}*2^{bit}=139Ω*2^2=556Ω≈\underline{\underline{560Ω}}$$ $$R_{bit2} = R_{colour}*2^{bit}=139Ω*2^3=1112Ω≈\underline{\underline{1.1kΩ}}$$ $$R_{bit3} = R_{colour}*2^{bit}=139Ω*2^4=2224Ω≈\underline{\underline{2.2kΩ}}$$

### GPU

How a 6 line long sprite would be stored,
notice how every line can be stored in a single byte.

In order to efficiently manipulate the framebuffer, this function has been separated into a dedicated Graphical Processing. This is because the GPU circuit can perform in one cycle what the microprocessor could do on dozens of instructions.

In CHIP-8, 8x1 pixels are stored in an 8-bit number, i.e. even if the frame buffer is 64×32 pixels, it is stored as 8×2 (256) 8-bit numbers. Instead of 2048, which would have happened if each pixel had its own byte. When I refer to a pixel block it means a group of 8 pixels stored in one byte.

The GPU has two commands, one that sets all the values ​​in the framebuffer region to zero called CLEAR, and one that has the task of drawing sprites called DRAW.

A sprite is 8 pixels wide and between 1 and 15 pixels high image stored in memory.

The draw command has 3 parameters:

• The address of where the sprite begins

• Number of lines in the sprite

• The X & Y coordinates of where to draw the sprite

The first DRAW instruction checks is whether the X parameter goes up to 8-bit. If it is not, an unaligned flag is set. The sprite is put and put in a 16-bit register where the number of bits it is away from being aligned is changed. If the unaligned flag is set, both pixel blocks that the alcohol passes over are read otherwise instead of just one. A bit XOR operation is run on the pixel blocks to be drawn and those in memory. This allows you to draw a sprite and remove it by drawing it again in the same position. In addition, there is a collision register that is set if it becomes a positive bitwise AND operation between the line section and the pixel block.

NOTE: These GPU functions could also be directly implemented as a part of the instructions in the CPU, but I decided to split it into its own module

## Putting it all together

After all the modules are implemented it has to be connected together with a top module, this is the main module which connects all the various modules together, see src/chip8.v.

Diagram of how all the physical components are connected.

So in the end after a bit of debugging I got everything brought up, sure there are a ton of ways the implementation could be improved, cleaned up and expanded but it was a interesting project that I learned a fair share of new things while doing. But in the end...

It's working!