Zilog 80 assembly code optimization


In general, programs can be made to run substantially faster only by first
determining where they spend their time. This requieres determining which loops
(other than delay routines) the processor is executing most often. Reducing the
execution time of a frequently executed loop will have a major effect because
of the multiplying factor. It is thus critical to determine how often
instructions are being executed and to work on loops in the order of their
frequency of execution.

Once it is determined which loops the processor executes most frequently,
reduce their execution time with the following techniques:

Eliminate redundant operations. These may include a constant that is being
added during each iteration or a special case that is being tested repatedly.

Another example is a constant value or a memory address that is being fetched
from memory each time rather than being stored in a register or register pair.

Reorganize the loop to reduce the number of jump instructions. You can often
eliminate branches by changing the initial conditions, inverting the order of
operations, or combining operations. In particular, you may find it helpful to
initialize everything one step back, thus making the first iteration the same
as all the others.

Inverting the order of operations can be helpful if numerical comparisons are
involved, since the equality case may not have to be handled sparately.
Reorganization may also combine condition checking inside the loop with the
overall loop control. 

If you call a function only once, use inline code rather than a function. This
will save a CALL and a RET. Also make very small function rather macros than
normal functions.

Try to take maximum advantage of specialized instructions as LD HL, (ADDR); LD
(ADDR), HL; EX DE,HL; EX HL,(SP); DJNZ; and the block move/compare instructions
by organizing the registers in the right way. Thus it is preferable to always
use B or BC for a counter, HL for an indirect address, and DE for another
indirect address if needed.

Use the block move, block compare, and block I/O instructions to handle blocks
of data. These instructions can replace an entire program sequence, since they
combine counting and updating of pointers with the actual data manipulation or
transfer operations and updating of pointers with the actual data manipulation
or transfer operations. Note, in particular, that the block move and block I/O
instructions transfer data to or form memory without using the accumzlator.

Use the 16-bit instructions whenever possible to manipulate 16-bit data. These
instructions are ADC, ADD, DEC, EX INC, LD, POP, PUSH, and SBC.

Use instructions that operate directly on data in user registers or in memory
to avoid having to save and restore the accumulator,HL, or an index register.
These instructions inculde DEC, EX, INC, LD, POP, PUSH, and the bit
manipulation and shift instructions. 

Minimize the use of the index registers, since they always require extra
execution time and memory. The index registers are generally used only as
backups to HL and in handlingdata structurs that involve many fixed offsets.

Minimize the use of special Z80 instructions that require a 2-byte operation
code. These alway reequire extra execution time and memory. Examples are BIT,
RES, SET, SLA, SRA, and SRL, as well as some load instructions such as LD DE,
(ADDR),LD(ADDR), BC and LD SP,(ADDR).

Take advantage of specialized short instructions such as the accumulator
shifts(RLA, RLCA, RRA, and RRCA) and DJNZ.

Use absolute jumps(JP) rather than relative jumps(JR). The absolute jumps take
less time if a branch actually occurs.

Organize sequences of conditional jumps to minimize average execution time.

Branches that are often taken should come before ones that are seldom, taken
for example, checking for a result being negative (true 50% of the time if the
value is random) before checking for it to be zero(true less than1% of the time
if the value is random).

Test for conditions under which a sequence has no effect and branch around it
if the conditions hold. This will be profitable if the sequence is long, and it
frequently does not change the result. A typical example is the propagation of
carries through higher or bytes. If a carry seldom occurs, it will be faster on
the average to test for it rather than simply propagate a0.

A general way to reduce execution time is to replace long sequences of
instructions with tables. A single table lookup can perform the same operation
as a sequence of instructions if there are no special exits or program logic
involved. The cost is extra memory, but that may be justified if the memory is
available. If enough memory is avaiable, a lookup table may be reasonable
approach even if many of its entries are repetitive- that is, even if many
inputs procude the same output. In addition to its speed, table lookup is
also general, easy to program, and easy to change.

Now for the even more practical approach:

The less bytes an instruction uses, the faster it generally executes. So always
look for a better way to do things. Note however that this might go in hand
with some disadvantages... Here are some examples:

Instead of ...        ... you write           Disadvantages?

 ld a, 0               sub a or xor a          flags are modified

 cp 0                  and a or or a           none

 cp 1                  dec a                   A is modified

 cp 255                inc a                   A is modified

 srl a                 rrca                    not exactly the same effect

 ld hl, ...            ld hl, ...              Zeroflag not affected
 ld de, ...            ld de, -...
 or a                  add hl, de
 sbc hl, de

 dec bc                cpi                     increments HL
 ld a,b                ret po
 or c
 ret z

Try, if possible, to use the shadow registers in frequently used loops. You can
reach this over the instructions EXX and EX AF',AF. Note however to do a DI to
disable interrupts before the actual function and an EI afterwards. Avoid
having interrupts switched off all the time.

Pass arguments to function over registers, NOT by PUSHing/POPing or even
variables in memory!! 

Keep often used variables, like position of the main character in registers
(optimally in those registers you have to pass to the draw function as
coordinates).

Avoid excessive mode switching using the IY register. Each switching costs 4
bytes! 

Take advantage of ROM functions / built-in functions as often as possible. This
won't only save you from coding them, they also save storage and are likely to
be highly optimized. 

If possible, use self-modifying code (only possible when code is in RAM) for
less code and faster execution.

In certain circumstances (where you can disable all interrupts and don't have
too make any calls or other use of the stack), an extremely quick way of
retrieving data from memory is to set the stack pointer SP to the start of the
memory and then POP bytes into a register pair. The POP takes 10 T-states but
it gets 2 bytes at once, and it's twice as quick as a LD HL,(nnn) and works
especially well in reading contiguous buffer data. It's 4 times faster than
using LD HL,(...) and then you need to update the load address.
(This of course also works with PUSHing to store data.)

author: unknown TI-calculators coder