CHIP-81 is a binary interpreted programming language created in the mid 70s to be able to create programs that could run independently of clock speeds and other architecture-specific challenges that were a problem at this time.
Later in the late 80s and early 90s, CHIP-8 became popular to implement on devices like programmable calculators due to the low requirement to write an interpreter. Simplified CHIP-8 editions of many games were also made at this time for example: PONG, Pac-Man and Tetris.
A Field Programmable Gate Array is an integrated circuit that contains programmable logic blocks with a reconfigurable interconnect that can reconfigured at any time.
This is useful for testing implementations of digital integrated circuits such as microprocessors and for circuits where it is useful to be able to change the logic at any time. Accelerator cards in servers is a common use case for FPGA cards these days due to being able to develop the accelerator with the software even after deploying into production.
A Hardware Description Language is a specialized type of programming language that is an abstraction to describe the logic of digital and analogue circuits. Most HDL languages is on what is referred to as a register-transfer level4. This makes it possible to create complicated circuits and designs where various logic is all ran in parallel.
Todays most widely used HDL languages are Verilog and VHDL. The big differences are that VHDL has a strong type system that requires more accurate definitions in implementations. Because of this VHDL will not convert bit sizes automatically, this makes VHDL code typically longer and more complicated.
An HDL language can be synthesized into logic ports and how they are connected (net-list). This can be used to design an integrated circuit or for FPGA software. The software converts it and optimizes it to generate proprietary bitstream format which describes the logic blocks and interconnects. This is the closest we get to what can be called a program for an FPGA.
One of the most important things to get in order before doing anything related to implementation is defining the architecture. This consists of defining the parts of the implementation, its modules and how they are connected. In the HDL language Verilog, everything is divided into something called modules that can be compared to a component in a circuit, these modules can be connected with wires, wires allows modules to access other modules registers and logical functions.
High-level overview for the architecture of the FPGA implementation.
The architecture has been split into these modules with the following tasks:
Timer - Responsible for generating the different clock pulses at the different frequencies required
CPU - Also known as the microprocessor and has the important function of reading instructions from memory and executing them, in addition to controlling the GPU.
CPU BCD - A sub-module of the CPU that is responsible for converting an 8-bit number to three 4-bit numbers representing the 3 digits of the input number.
CPU RNG - Pseudo-random number generator that creates a new number for each clock cycle based on an algorithm where numbers in the sequence have no visible connection.
Memory - Holds the memory for the microprocessor and a separate video memory area (VRAM) that the VGA signal generator can read from, in addition, the font set and the ROM are pre-programmed into the memory.
GPU - Has the task of copying in sprites and clearing video memory area on behalf of the microprocessor.
VGA signal generator - Generates a valid VGA signal with data it reads from the video memory.
Keypad - Assigns to read a 4×4 matrix keypad and turn it into a 16-bit register for the microprocessor to access.
After going through the differences between Verilog and VHDL, I came to the conclusion that even though I would have preferred VHDL due to more precise code, the ecosystem was not as big as I prefer. Verilog has several open simulators such as Icarus Verilog and Verilator, while VHDL only had GHDL at the time of writing.
As the tools available were critical to me in terms of being able to test and make changes quickly, therefore I picked Verilog for this project.
To simulate test benches that are a form of automatic testing, I have used Icarus Verilog. Is an open source project that can perform event driven simulation.
Event driven simulation gives me the ability to create cycle delays in test benches without writing my own logic for this. This is used to wait X number of cycles before expecting to get a result is what the test bench expects. This is not possible in other simulators such as Verilator which uses a cycle simulation model.
The repo has a Makefile which can execute simulation of all the test modules in the tests/ directory. There are various unit tests which contains assertions.
Unit test which starts a PONG game and dumps the output of the framebuffer in the terminal.
The timer circuit is one of the most critical parts of the circuit and has the task of creating different pulses for the different parts of the circuit.
In this circuit 3 different clock pulses are needed:
Pulse for microprocessor - 500Hz
Pulse for delay & sound timer (register in microprocessor) - 60Hz
Pulse for VGA signal generator - 25MHz
To generate the pulses, I wrote a simple module (see src/timer.v while reading). It works by calculating the number of system clock pulses required to reach the frequency for the various timers in advance during the synthesis of the circuit.
Each timer has a 32-bit register which it subtracts with one for each positive pulse it gets from the clock input. When it reaches zero, it is resets to the pre-calculated top value at the same time as the pulse output becomes high. Due to this the high pulse length of the pulses the timer generates is equal to the length of the 500MHz signal. Internally in this circuit, this will not create any noticeable problems.
It is important that the clock frequency enters as close as possible to the size of the pulses so that the hours are as accurate as possible. The built-in clock on the DE0-Nano FPGA I have obtained is at 50MHz. This means that the VGA timer output is only triggered for every second pulse the timer receives from the built-in clock.
The CPU is the main component of the implementation and has the task of retrieving instructions from the memory where the program counter points at and performs the operations of the instruction.
CHIP-8 has a 12-bit wide address area where the CPU can read and write memory. But to simplify it, I have used a 16-bit area to have it byte aligned.
CHIP-8 has several registers in its implementation:
PC - Program counter - 16 bit wide and points to the address of the next instruction to be executed. Is automatically added 2 in value for each CPU instruction that is run unless the instruction itself modifies the program counter.
I - Address register - 16 bit wide and used for memory related instructions, holds an address that indicates where memory should be read from or written to is.
V0 to VF - Data register - 16 8-bit wide registers that are used to temporarily hold values during operations, VF is also used to set the carry result for some of the addition and subtraction instructions.
SP - Stack pointer - 8 bit wide pointer that points to the place in the stack register where the last call went to.
DT - Delay timer - Counts down at a value 60 times per second until it reaches 0, used by the programmer to maintain the correct speed.
ST - Sound timer - Identical to Delay timer but intended for use with sound.
In CHIP-8 there is a call stack, it keeps track of which addresses are to be returned to when the RET instruction is executed after a CALL instruction. The stack size is 64 bytes and can hold 32 16-bit addresses.
All CHIP-8 instructions are 16-bit wide and can be divided into four nibbles, one nibble is 4-bit wide and it is only guaranteed that the lead nibble is what is called an OP code, OP codes is what identifies an instruction.
Using the casez function in Verilog, a decoder is generated that can find out which instruction can be executed in relation to which nibbles of the instructions OP code match. Often instructions have nibbles left to be used as arguments for the instruction.
When an instruction is executed, different nibbles are used or composed to a value with several bits. For example the JP instruction (Jump to address) has only the first nibble as the OP code and the last three nibbles are used as a 12 bit number (NNN) to which the program counter is set.
There are three different sizes parts of an instruction that can be read:
- 0000 0000 0000 0000 - NNN - 12 bit wide
- 0000 0000 0000 0000 - KK - 8 bit wide
- 0000 0000 0000 0000 - X, Y and Z - 4 bit wide
|00E0||CLS||Clears the display, in this case zeros out the VRAM.|
|00EE||RET||Sets the program counter back to the last value from the call stack|
|1NNN||JP addr||Sets the program counter to the value of NNN|
|2NNN||CALL addr||Adds the current program counter to the callstack and sets the program counter to NNN|
|3XKK||SE Vx, byte||Skip the next instruction if the data register Vx is equal to KK|
|4XKK||SNE Vx, byte||Skips the next instruction if the data register Vx is not equal to KK|
|5XY0||SE Vx, Vy||Skip the next instruction on data registers Vx is equal to Vy|
|6XKK||LD Vx, byte||Sets the value of the data register Vx to KK|
|7XKK||ADD Vx, byte||Sets the data register Vxs value to Vx + KK|
|8XY0||LD, Vx, Vy||Sets the value of the data register Vx to register Vy|
|8XY1||OR Vx, Vy||Sets Vx to a bitwise OR operation between Vx and Vy|
|8XY2||AND Vx, Vy||Sets Vx to a bitwise AND operation between Vx and Vy|
|8XY3||XOR Vx, Vy||Sets Vx to a bitwise EXCLUSIVE OR operation between Vx and Vy|
|8XY4||ADD Vx, Vy||Sets Vx to Vx + Vy, In addition, VF is set to the mean value from the operation|
|8XY5||SUB Vx, Vy||Sets Vx to Vx - Vy, In addition, VF is set to the mean value from the operation|
|8XY6||SHR Vx||Runs a single bit shift to the right of Vx and sets VF to the LSD to Vx|
|8XY7||SUBN Vx, Vy||Sets Vx to Vy - Vx, In addition, VF is set to the mean value from the operation|
|8XY8||SHL Vx, Vy||Runs a single bit shift to the left on Vx and sets VF to the LSD to Vx|
|9XY0||SNE Vx, Vy||Skip the next instruction if data registers Vx are not equal to Vy|
|ANNN||LD I, addr||Sets register In its value to NNN|
|BNNN||JP V0, addr||Sets the program counter to V0 + NNN|
|CXKK||RND Vx, byte||Sets Vx to the random number of the RNG module bitwise AND towards KK|
|DXYZ||DRW Vx, Vy, N||Draws a sprite at position X = Vx, Y = Vy, with length N (GPU), VF will be set to 0 or 1 based on whether the GPU experienced a sprite collision|
|EX9E||SKP Vx||Skips the next instruction about the key with Vxs value one down|
|EXA1||SKNP Vx||Skips the next instruction about the key with Vxs value one up|
|FX07||LD Vx, DT||Sets Vx to delay timer register its value|
|FX0A||LD Vx, K||Stops execution of instructions until a button is pressed and sets Vx to it|
|FX15||LD DT, Vx||Sets the delay timers value to register Vx|
|FX18||LD ST, Vx||Sets the sound timers value to register Vx|
|FX1E||ADD I, Vx||Sets register I to I + Vx|
|FX29||LD F, Vx||Sets register I value to memory the location of the font sprite with ID Vx|
|FX33||LD B, Vx||Uses the BCD module to split Vx into hundredths, tier-parts and one-part and stores this in memory at I, I +1 and I + 2|
|FX55||LD [I], Vx||Stores V0 to Vx in memory from I to I + x, I is then set to I + x + 1|
|FX65||LD Vx, [I]||Fills V0 to Vx with the memory from I to I + x, I is then set to I + x + 1|
The CPU has its own BCD conversion mini-circuit (see src/cpu_bcd.v) which takes an 8-bit binary number and splits it into three 4-bit nibbles containing the first three digits of the 8-bit binary number.
After looking for possible solutions, I found an algorithm5 which can be synthesized into a single cycle circuit and went with it.
I used the PRBS31 algorithm as pseudo-random number generator. The only problem is that it is not currently seeded making the RNG predictable (see src/cpu_rng.v).
Illustration of the CHIP-8 memory space
CHIP-8s internal memory is 4 KB, i.e. 4096 bytes. This is because CHIP-8 was designed for systems with only 4 KB of main memory and therefore some regions of the memory space is reserved for the systems interpreter. This includes the first 512 bytes and the last 352 bytes.
The memory can be read by both the microprocessor and the GPU but the GPU has priority because the microprocessor is blocked while running in this implementation. This had to be done to make the memory work properly.
In addition, the framebuffer memory is mirrored to a separate video memory (VRAM) that only the VGA module has read access to. This is because the VGA implementation needs to retrieve framebuffer data every clock cycle. This is due to VGA being a signal that must output info on a given frequency instead of sending information in larger transmissions digitally.
The font set and play ROM is loaded into memory using Verilogs $readmemh function as part of the start state of the circuit. The rest is of the address space is zeroed.
Note that the internal usage regions can be used for anything, I just chose to use the start to store the font set and framebuffer, see src/memory.v for the full implementation.
The internal wiring of a matrix keypad
For input, the CHIP-8 uses a 4×4 keypad, this is because the original CHIP-8 implementation ran on a COSMAC VIP which also had a keypad as the primary input device.
A matrix keypad is used which works with the decoder running in a loop where it sets the output of one of the columns high at the time. Then it checks the inputs if any of the rows inputs are high. If a row is high it knows that the button on that column and the row is pressed with this info the register bit of that button is updated.
Using a matrix keypad avoids having one cord per button. This means that smaller wires and IO pins have been used on the FPGA than other solutions might have required.
The keypad decoder implemented represents the value of the keypad as a 16-bit value which represents every button with its own bit, compared to a 4-bit representation that can only represent one pressed button at a time.
The keypad module stores the 16-bit value in a register that is connected to the CPU module through a wire. The CPU has instructions that read the keypad value and are affected by the value.
VGA which stands for Video Graphics Array, is a standard created by IBM in the 80s. Compared to other standards such as DisplayPort and HDMI, it is easier to implement since it is much less complicated and takes little to get a working image.
This is due to VGA being analogue, it has three colours: red, green and blue over each analogue signal in a range of 0 to 0.7V which says the strength of the colour. There are also two digital signals for horizontal and vertical synchronization.
The synchronization signals are sent outside the cycles as the pixels are drawn, and it has an extra padding on the side of the synchronization signals called blanking interval.
I have decided to generate an image signal of 640 pixels in width and 480 pixels in height with an image refresh rate of 60Hz. I picked this since my FPGA has a 50MHz clock which makes the 25MHz pixel clock usable.
|Signal||Horizontal cycles||Vertical cycles|
|Before sync pulse||48||10|
|After sync pulse||16||33|
To generate the analogue VGA signal, I designed a 4 four bit weighted resistance network per colour channel enabling 12-bit colour. While CHIP-8 has a single bit per pixel (black or white) I decided to do it anyway, so I can do other projects which requires multiple colours.
The resistor network is required to get the voltage down to the 0-0.7V range required by the VGA spec and to convert the digital outputs to an analogue value.
My FPGA has 3.3V outputs which is multiple times the 0.7V of the VGA pins and therefore I need to use a resistor to get the voltage down for the VGA cable to an acceptable level for the current6.
To split the total resistance into four different values for each bit we can simply multiply by 2<sup>bit</sup>. Also, the values has been rounded to common resistor values.
Note that the bit variable has to start with 1 to get correct results.
How a 6 line long sprite would be stored,
notice how every line can be stored in a single byte.
In order to efficiently manipulate the framebuffer, this function has been separated into a dedicated Graphical Processing. This is because the GPU circuit can perform in one cycle what the microprocessor could do on dozens of instructions.
In CHIP-8, 8x1 pixels are stored in an 8-bit number, i.e. even if the frame buffer is 64×32 pixels, it is stored as 8×2 (256) 8-bit numbers. Instead of 2048, which would have happened if each pixel had its own byte. When I refer to a pixel block it means a group of 8 pixels stored in one byte.
The GPU has two commands, one that sets all the values in the framebuffer region to zero called CLEAR, and one that has the task of drawing sprites called DRAW.
A sprite is 8 pixels wide and between 1 and 15 pixels high image stored in memory.
The draw command has 3 parameters:
The address of where the sprite begins
Number of lines in the sprite
The X & Y coordinates of where to draw the sprite
The first DRAW instruction checks is whether the X parameter goes up to 8-bit. If it is not, an unaligned flag is set. The sprite is put and put in a 16-bit register where the number of bits it is away from being aligned is changed. If the unaligned flag is set, both pixel blocks that the alcohol passes over are read otherwise instead of just one. A bit XOR operation is run on the pixel blocks to be drawn and those in memory. This allows you to draw a sprite and remove it by drawing it again in the same position. In addition, there is a collision register that is set if it becomes a positive bitwise AND operation between the line section and the pixel block.
NOTE: These GPU functions could also be directly implemented as a part of the instructions in the CPU, but I decided to split it into its own module
After all the modules are implemented it has to be connected together with a top module, this is the main module which connects all the various modules together, see src/chip8.v.
Diagram of how all the physical components are connected.
So in the end after a bit of debugging I got everything brought up, sure there are a ton of ways the implementation could be improved, cleaned up and expanded but it was a interesting project that I learned a fair share of new things while doing. But in the end...
It is working!
A program that makes it possible to emulate software or hardware. ↵
While people often use CHIP-8 as a way to learn how microprocessors and hardware work. CHIP-8 is not an actual microprocessor, but a binary interpreted programming language which shares many concepts of how it works with microprocessors. ↵
The value of the current was taken from http://tinyvga.com/faq/electrical/how-is-vga-terminated. I have not been able to find much info about this, but this value worked fine in practice. ↵