In the morning, I'd been thinking about how to accelerate the writing of data to RAM, and I came up with two ideas:
The idea behind this is that as I was using sC to sE for the address, I could eliminate the OUT commands that wrote the address to another latch, which meant that it would be faster. This meant that the code would change from:
OUT addr0, lowaddr OUT addr1, midaddr OUT addr2, highaddr OUT colour, ocolour
OUT colour, ocolour
lowaddr, midaddr, highaddr and ocolour are output ports, and addr[0..2] and colour are registers.
My initial attempt to do this worked fine for behavioural simulation, but I found that it was inadequate for real device (or real simulation). The reason for this was while there's a simulation register called sc_contents (for the sC register), it's only used in the behavioural simulation - under normal mode, the registers are held in 8 RAM16X1D primitives.
In order to get around this, I needed to delve a bit into how this was being set - and I decided that I needed to add an 8-bit D-type flip flop for each register I needed, and make the enable pin occur when the register was being written to. Luckily this wasn't too complicated.
So, the changes I made to the kcpsm3.v code was as follows (indicated in bold and yellow):
In the module declaration:
module kcpsm3( address, instruction, port_id, write_strobe, out_port, read_strobe, in_port, interrupt, interrupt_ack, reset, sc_value, sd_value, se_value, sf_value, clk) ;
input interrupt, reset, clk ; output [7:0] sc_value, sd_value, se_value, sf_value; // //////////////////////////////////////////////////////////////////////////////////// // // Start of Main Architecture for KCPSM3
Before the simulation:
FDRE stack_count_loop_register_bit_4 ( .D(next_stack_address), .Q(stack_address), .R(internal_reset), .CE(not_active_interrupt), .C(clk)); // Extensions by Jason Tribbeck wire sc_enable; wire sd_enable; wire se_enable; wire sf_enable; assign sc_enable = (instruction[11:8] == 4'hc) && (register_enable); assign sd_enable = (instruction[11:8] == 4'hd) && (register_enable); assign se_enable = (instruction[11:8] == 4'he) && (register_enable); assign sf_enable = (instruction[11:8] == 4'hf) && (register_enable); latch_8 rsc ( .d(alu_result), .q(sc_value), .en(sc_enable), .clk(clk), .set(1'b0), .reset(reset) ); latch_8 rsd ( .d(alu_result), .q(sd_value), .en(sd_enable), .clk(clk), .set(1'b0), .reset(reset) ); latch_8 rse ( .d(alu_result), .q(se_value), .en(se_enable), .clk(clk), .set(1'b0), .reset(reset) ); latch_8 rsf ( .d(alu_result), .q(sf_value), .en(sf_enable), .clk(clk), .set(1'b0), .reset(reset) ); // //////////////////////////////////////////////////////////////////////////////////// // // End of description for KCPSM3 macro.
The latch_8 module is declared as follows:
module latch_8 ( input wire [7:0] d, input wire en, input wire clk, input wire reset, input wire set, output wire [7:0] q ); FDCPE # (.INIT(0)) b0 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b1 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b2 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b3 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b4 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b5 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b6 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); FDCPE # (.INIT(0)) b7 ( .CLR(reset), .PRE(set), .CE(en), .C(clk), .D(d), .Q(q)); endmodule
I've written these down here, in case anyone else needs this.
That didn't make a huge difference to the speed of plotting the Aeon Sportscars logo, so I went on to the next stage - bulk writing
A lot of what I need to do is plotting effectively the same colour as a horizontal stripe. If 8 pixels are to be written, then the PicoBlaze needs to perform the following operation 8 times:
OUT colour, ocolour ADD addr0, $01 ADDC addr1, $00 ADDC addr2, $00
This could be rather tedious if there's a lot of data to be written (in addition, the CPU runs at 1/4 the speed of the memory).
Adding bulk write capability would do two things:
I opted for a maximum of 8-pixels to be plotted at the same time - this would lend itself nicely to using SDRAM in a later version. The length is held in the sF register (which I'd already exposed to the main VGA module).
This code was quite tricky to get right from both the FPGA and the PicoBlaze. I used some tricks I'd picked up from ARM programming for reducing instruction count - for example, if you have a value in register s0 that you want to step downwards, and stop under 8, you could do this:
loop: SUB s0, $08 COMP s0, $07 JUMP C, loop
This is all well and fine, but means you get 3 instructions per iteration. Instead, this code is better:
SUB s0, $08 loop: SUB s0, $08 JUMP C, loop ADD s0, $08
This adds an additional instruction, but means you only have 2 per loop - very important if you've got quite a high number to count down from.
Anyway, with those additions, it didn't seem to make that much difference either - which leads me to think that the bottleneck is the SPI bus. The PIC is running at 32MHz, and the SPI bus is running at FOSC/4, which means 8MHz. So each byte is being transmitted at 1MHz (assuming zero overhead [although there is overhead]).
However, these optimisations are important if I decide to go down the 16-bit route, and hopefully have a parallel interface (I was 2 pins short for a parallel interface - although I could've possibly multiplexed some of the configuration pins).
I may also consider adding an SPI memory chip onto the FPGA, and getting the PicoBlaze to initialise the screen memory from there on first boot-up (I don't think there's enough space on the current chip for this logo).
|© Copyright 1997-2013|
Tribbeck.com / Jason Tribbeck
All trademarks are the property of their respective owners.