Cloe v3

Just as Cloe v2 was a major enhancement to Cloe v1, Cloe v3 is an enhancement to Cloe v2. The principles of Cloe v3 are the same as Cloe v2, except some findamental bug-fixes and optimisations have been incorporated.

Bugs with Cloe v2

There was only one thing that was horribly wrong - and that was you would not be able to emulate SVC mode instructions, as calling SWIs would cause R14 to become corrupted.

Cloe v3 bug fixes

Rather than using a SWI to emulate instructions, it would be better to use a mode which is not present in 26-bit mode (at least, not present in a useable form in 26-bit mode). The easiest mode to do this with is the Undefined "und" mode.

Rather than calling a SWI, an unused co-processor number is used. Cloe claims the Undefined Instruction hardware vector, and checks to see if the instruction is one of this co-processor's instructions, much in the same way that the floating-point emulator works.

In the example given in the Cloe v2 document, this would mean that all the SWIs would be translated into some co-processor instruction.

Cloe v3 optimisations

  1. For small numbers of instructions, it would make sense to branch to the emulator code, rather than use the undefined instruction trap. Initially, these instructions would be: Inside Cloe, there would be 32 (+32) jump points, each relating to which register <rn> was, so for MOVS pc,lr, this would become B cloe_movs_pc_lr

    This optimisation means that:

    1. No instruction decode needs to occur
    2. No undefined instruction trap needs to occur
    3. The original instruction would not have to be stored in an array somewhere
    This would improve performance quite significantly.

  2. To improve the speed of instruction decode, many instructions could have their salient parts encoded as part of the co-processor instruction. For example, ADDNES r7,r3,pc could be encoded as MCRNE 8,0,4,7,3,0. The values in the MCR instruction are taken from: The 'pc' register is taken as read. So, for a generic instruction ALUxx{s} Ra,Rb,pc the following instruction table is used:
      31..28 27..24 23..20 19..16 15..12 11..8 7..4 3..0
    MRC/MCR Condx 1 1 1 0 0 0 0 s Ra ALU 1 0 0 0 0 0 0 1 Rb

    A further optimisation, which would allow for ALUxx{s} Ra,Rb,Rc,<shift;>#<constant> relies on splitting the instruction further:

    <shift> is either LSL (00), LSR (01), ASR (10), ROR (11). For the table below, the first bit is encoded as 'T', and the second as 'U'.

    <constant> is between 0 and 31, and is bit-wise encoded as J, K, L, M and N.

    The new table looks like:

      31..28 27..24 23..20 19..16 15..12 11..8 7..4 3..0
    MRC/MCR Condx 1 1 1 0 N T U s Ra ALU 1 0 0 M L K J 1 Rb

    As you can see, it now uses two co-processors - number 8 and number 9. However, as there are 14 coprocessors that are unused, this should not present a problem.

    For example, the instruction BICGE r4,r3,pc,ASR#13 would become 2_1010 1110 0100 0100 1110 1001 1011 0011, or &AE44E9B3 (when disassembled, this becomes MCRGE CP9,4,R14,C4,C3,5).

    For some LDMFD r13!,{...}, the LDC/STC instructions could be used, and the mapping between the two are:

      31..28 27..24 23..20 19..16 15..12 11..8 7..4 3..0
    LDM/STM Condx 1 0 0 P U S W L Rn Register bit-field
    LDC/STC Condx 1 0 0 P U N W L Rn CRd 1 0 0 0 Offset

    Since APCS normally doesn't store r0-r3 on the stack, bits 0..7 would encode registers 4-11, and bits 12..15 would encode registers 12-15, as they would do in the LDM/STM instruction.

    Another advantage to these optimisations is that the original instruction also does not have to be stored in an array.

  3. To increase the amount of conversion performed in one lump, the following code could be checked for:
      CMP     Rn,#<constant>
      ADDCC   pc,pc,Rn,LSL#2
      B       <some address>
      B       <some otheraddress>
    ; ...
    This kind of code is used when you wish to create a jump table - Rn on entry is the routine number it wishes to call. This is used in SWI decode tables, typically with R11 as Rn. Cloe would add each of the addreses in the table, up to <constant>/2 instructions. Note that this may cause problems if the programmer has some data in the middle of the table, and they are not expecting that particular value to be called...

Speed differences

With these optimisations, four speeds are achievable:

  1. Direct execution speed
  2. No decode speed
  3. Minimal decode speed
  4. Full decode speed
No decode speed would probably, at best (ie. jumping to an address that has already been converted) have a 10:1 degredation in performance (ie. will perform 10 instructions instead of 1). Minimal decode would probably have a 20:1 degredation, and a full decode probably 75:1. Instructions which have to do a convert would add 4 to these values.

Using the same tests as before:

Number of instructions (total): 20391 (100%)
Number of direct execution: 18047 88.5%
Number of no-decode: 716 3.5%
Number of minimal decode: 489 2.4%
Number of full decode: 1139 5.6%

This means that the execution speed would be 89% of the speed of the processor. As this is a 'global' convert (ie. it does not follow branches etc), then the actual figure would be higher - possibly around 92%. This is because a global convert also converts all the strings in the program, and all the data.

In order to get proper results, the convertor would have to be improved, and actually follow code - also the emulator would also have to be written, which would show the actual degredation in performance.


CLOE is © Jason Tribbeck 1998.