Welcome to Tribbeck.com

Introduction | Schematic | PCB Design | Prototype | Initial tests | Reading | Writing
Instructions | Useful stuff | Acceleration I II | Coordinates | Circles | Font preparation | Fonts

Acceleration

In the morning, I'd been thinking about how to accelerate the writing of data to RAM, and I came up with two ideas:

  1. Getting direct access to the PicoBlaze registers
  2. Performing block writes

Direct access

The idea behind this is that as I was using sC to sE for the address, I could eliminate the OUT commands that wrote the address to another latch, which meant that it would be faster. This meant that the code would change from:

    OUT     addr0, lowaddr
    OUT     addr1, midaddr
    OUT     addr2, highaddr
    OUT     colour, ocolour

To simply:

    OUT     colour, ocolour

lowaddr, midaddr, highaddr and ocolour are output ports, and addr[0..2] and colour are registers.

My initial attempt to do this worked fine for behavioural simulation, but I found that it was inadequate for real device (or real simulation). The reason for this was while there's a simulation register called sc_contents (for the sC register), it's only used in the behavioural simulation - under normal mode, the registers are held in 8 RAM16X1D primitives.

In order to get around this, I needed to delve a bit into how this was being set - and I decided that I needed to add an 8-bit D-type flip flop for each register I needed, and make the enable pin occur when the register was being written to. Luckily this wasn't too complicated.

So, the changes I made to the kcpsm3.v code was as follows (indicated in bold and yellow):

In the module declaration:

module kcpsm3(
  address,
  instruction,
  port_id,
  write_strobe,
  out_port,
  read_strobe,
  in_port,
  interrupt,
  interrupt_ack,
  reset,
  sc_value,
  sd_value,
  se_value,
  sf_value,
  clk) ;

I/O declaration:

input        interrupt, reset, clk ;
output [7:0] sc_value, sd_value, se_value, sf_value;
//
////////////////////////////////////////////////////////////////////////////////////
//
// Start of Main Architecture for KCPSM3

Before the simulation:

 FDRE stack_count_loop_register_bit_4 ( 
 .D(next_stack_address[4]),
 .Q(stack_address[4]),
 .R(internal_reset),
 .CE(not_active_interrupt),
 .C(clk));

// Extensions by Jason Tribbeck
wire sc_enable;
wire sd_enable;
wire se_enable;
wire sf_enable;

assign sc_enable = (instruction[11:8] == 4'hc) && (register_enable);
assign sd_enable = (instruction[11:8] == 4'hd) && (register_enable);
assign se_enable = (instruction[11:8] == 4'he) && (register_enable);
assign sf_enable = (instruction[11:8] == 4'hf) && (register_enable);

latch_8 rsc
(
  .d(alu_result),
  .q(sc_value),
  .en(sc_enable),
  .clk(clk),
  .set(1'b0),
  .reset(reset)
);

latch_8 rsd
(
  .d(alu_result),
  .q(sd_value),
  .en(sd_enable),
  .clk(clk),
  .set(1'b0),
  .reset(reset)
);

latch_8 rse
(
  .d(alu_result),
  .q(se_value),
  .en(se_enable),
  .clk(clk),
  .set(1'b0),
  .reset(reset)
);

latch_8 rsf
(
  .d(alu_result),
  .q(sf_value),
  .en(sf_enable),
  .clk(clk),
  .set(1'b0),
  .reset(reset)
);
//
////////////////////////////////////////////////////////////////////////////////////
//
// End of description for KCPSM3 macro.

The latch_8 module is declared as follows:

module latch_8
(
  input wire [7:0] d,
  input wire en,
  input wire clk,
  input wire reset,
  input wire set,
  output wire [7:0] q
);

FDCPE # (.INIT(0)) b0 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[0]),
  .Q(q[0]));

FDCPE # (.INIT(0)) b1 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[1]),
  .Q(q[1]));

FDCPE # (.INIT(0)) b2 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[2]),
  .Q(q[2]));

FDCPE # (.INIT(0)) b3 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[3]),
  .Q(q[3]));

FDCPE # (.INIT(0)) b4 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[4]),
  .Q(q[4]));

FDCPE # (.INIT(0)) b5 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[5]),
  .Q(q[5]));

FDCPE # (.INIT(0)) b6 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[6]),
  .Q(q[6]));

FDCPE # (.INIT(0)) b7 (
  .CLR(reset),
  .PRE(set),
  .CE(en),
  .C(clk),
  .D(d[7]),
  .Q(q[7]));

endmodule

I've written these down here, in case anyone else needs this.

That didn't make a huge difference to the speed of plotting the Aeon Sportscars logo, so I went on to the next stage - bulk writing

Block writing

A lot of what I need to do is plotting effectively the same colour as a horizontal stripe. If 8 pixels are to be written, then the PicoBlaze needs to perform the following operation 8 times:

    OUT     colour, ocolour
    ADD     addr0, $01
    ADDC    addr1, $00
    ADDC    addr2, $00

This could be rather tedious if there's a lot of data to be written (in addition, the CPU runs at 1/4 the speed of the memory).

Adding bulk write capability would do two things:

  1. Speed up the writing of horizontal lines
  2. Reduce the impact on the write FIFO

I opted for a maximum of 8-pixels to be plotted at the same time - this would lend itself nicely to using SDRAM in a later version. The length is held in the sF register (which I'd already exposed to the main VGA module).

This code was quite tricky to get right from both the FPGA and the PicoBlaze. I used some tricks I'd picked up from ARM programming for reducing instruction count - for example, if you have a value in register s0 that you want to step downwards, and stop under 8, you could do this:

loop:
    SUB      s0, $08
    COMP     s0, $07
    JUMP     C, loop

This is all well and fine, but means you get 3 instructions per iteration. Instead, this code is better:

    SUB      s0, $08
loop:
    SUB      s0, $08
    JUMP     C, loop
    ADD      s0, $08

This adds an additional instruction, but means you only have 2 per loop - very important if you've got quite a high number to count down from.

Anyway, with those additions, it didn't seem to make that much difference either - which leads me to think that the bottleneck is the SPI bus. The PIC is running at 32MHz, and the SPI bus is running at FOSC/4, which means 8MHz. So each byte is being transmitted at 1MHz (assuming zero overhead [although there is overhead]).

However, these optimisations are important if I decide to go down the 16-bit route, and hopefully have a parallel interface (I was 2 pins short for a parallel interface - although I could've possibly multiplexed some of the configuration pins).

I may also consider adding an SPI memory chip onto the FPGA, and getting the PicoBlaze to initialise the screen memory from there on first boot-up (I don't think there's enough space on the current chip for this logo).

Introduction | Schematic | PCB Design | Prototype | Initial tests | Reading | Writing
Instructions | Useful stuff | Acceleration I II | Coordinates | Circles | Font preparation | Fonts

Updated: 2011-06-08 20:41:21 | Comments: 0 | Show comments | Add comment
© Copyright 1997-2013
Tribbeck.com / Jason Tribbeck
All trademarks are the property of their respective owners.