Pentium® Optimization Cross-Reference by Instruction

The following is a list of optimizations that may come in handy. Each one is listed alphabetically (more or less) in the first column.

The second column lists the CPU or CPU's that this optimization is applicable to; alternatively it may be noted as applicable to 16-bit code or 32-bit code.

The third column contains one or more replacement sequences of code that is either faster or smaller (sometimes both) than the first column. For some obscure optimizations, the action of the first column instruction is explained.

The forth column contains a description and/or examples.

                            replacement

instruction     CPU's       or action             description/notes

---------------------------------------------------------------------------



aad (imm8)      all        AL = AL+(AH*imm8)    If imm8 is blank uses 10.

                           AH = 0               AAD is almost always slower,

                                                but only 2 bytes long.



aam (imm8)      all        AH = AL/imm8         Same as AAD.

                           AL = AL MOD imm8





add             16-bit     lea reg, [reg+reg+disp]



                                                Use LEA to add

                                                base + index + displacement

                                                Also preserves flags;

                                                for example:



                                                  add bx, 4



                                                can be replaced by:



                                                  lea  bx, [bx+4]



                                                when the flags must not

                                                be changed.









add             32-bit     lea reg, [reg+reg*scale+disp]



                                                Use LEA to add

                                                base + scaled index + disp

                                                Also preserves flags.

                                                (See previous example).

                                                The 32-bit form of LEA

                                                is much more powerful

                                                than the 16-bit version

                                                because of the scaling

                                                and the fact that almost

                                                all of the 8 General purpose

                                                registers can be used

                                                as base and index registers.



and reg, reg    Pent       test reg, reg        Use TEST instead of AND

                                                on the Pentium because

                                                fewer register conflict

                                                will result in better pairing





bswap           Pent       ror eax, 16          Pairs in U pipe, BSWAP

                                                doesn't pair.

                                                disadvantage: modifies flags

                                                (Not a direct replacement)





call dest1      286+       push offset dest2    When CALL is followed by

jmp  dest2                 jmp  dest1           a JMP, change the return

                                                address to the JMP destination.





call dest1      all        jmp  dest1           When a CALL is followed by a

ret                                             RET, the CALL can be replaced

                                                by a JMP.





cbw             386+       mov ah, 0            When you know AL < 128

                                                use MOV AH, 0 for speed.

                                                But use CBW for smaller

                                                code size.



cdq             486+       xor edx, edx         When you know EAX is positive

                                                Faster, better pairing.



                                                disadvantage: modifies flags





                Pent       mov edx, eax         When EAX value could be

                           sar edx, 31          positive or negative

                                                because of better pairing



cmp mem, reg    286        cmp reg, mem         reg, mem is 1 cycle faster



cmp reg, mem    386        cmp mem, reg         mem, reg is 1 cycle faster







dec reg16                  lea reg16, [reg16 - 1]   Use to preserve flags

                                                    for BX, BP, DI, SI



dec reg32                  lea reg32, [reg32 - 1]   Use to preserve flags

                                                    for EAX, EBX, ECX, EDX

                                                        EDI, ESI, EBP



div <op>         8088       shr accum, 1         When <op>  resolves to 2, use

                                                shift for division.

                                                (use CL for 4, 8, etc.)



div <op>         186+       shr accum, n         When <op>  resolves to a power

                                                of 2 use shifts for division.



enter imm16, 0  286+       push bp              ENTER is always slower

                           mov  bp, sp          and 4 bytes in length

                           sub  sp, imm16       if imm16 = 0 then push/mov

                                                is smaller



                386+       push ebp

                32-bit     mov  ebp, esp

                           sub  esp, imm16



inc reg16                  lea reg16, [reg16 + 1]   Use to preserve flags

                                                    for BX, BP, DI, SI



inc reg32                  lea reg32, [reg32 + 1]   Use to preserve flags

                                                    for EAX, EBX, ECX, EDX

                                                        EDI, ESI, EBP

jcxz <dest>:    486+        test cx, cx          JCXZ is faster and

                           je   <dest>:          smaller on 8088-286.

                                                On the 386 it is the

                                                about the same speed



               486+        test ecx, ecx        Never use JCXZ on 486

                           je   <dest>:          or Pentium except for

                                                compactness



lea reg, mem   8088-286    mov reg, OFFSET mem  MOV reg, imm is faster

                                                on 8088 - 286. 386+

                                                they are the same.



        Note: There are many uses for LEA, see: add, inc, dec, mov, mul





leave           486+       mov sp, bp           LEAVE is only 1 byte

                           pop bp               long and is faster

                                                on the 186-386. The

                           mov esp, ebp         MOV/POP is much faster

                           pop ebp              on 486 and Pentium





lodsb           486+       mov al, [si]         LODS is only 1 byte long

                           inc si               and is faster on 8088-386,

                                                much slower on the 486.

                                                On the Pentium the MOV/INC

                                                or MOV/ADD instructions

                                                pair, taking only 1 cycle.



lodsw           486+       mov ax, [si]         see lodsb

                           add si, 2



lodsd           486+       mov eax, [esi]       see lodsb

                           add esi, 4





loop <dest>:     386+       dec cx               LOOP is faster and

                           jnz <dest>:           smaller on 8088-286.

                                                on 386+ DEC/JNZ is

loopd <dest>:               dec ecx              much faster. On the Pentium

                           jnz <dest>:           the DEC/JNZ instructions

                                                pair taking only 1 cycle.





loopXX <dest>:   486+       je  $+5              The 3 replacement instructions

( XX = e,ne,z or nz)       dec cx               are much faster on the 486+.

                           jnz <dest>:           LOOPxx is smaller and

                                                faster on 8088-286

loopdXX <dest>:  486+       je  $+5              The speed is about the

                           dec ecx              same on the 386.

                           jnz <dest>:





mov reg2, reg1  286+       lea reg2, [reg1+n]   LEA is faster, smaller and

 followed by:                                   preserves flags. This is a

 inc/dec/add/sub reg2                           way to do a MOV and ADD/SUB

                                                of a constant, n.







mov acc, reg    all        xchg acc, reg        Use XCHG for smaller code

                                                when one of the registers

                                                final value can be ignored.

                                                Note that acc = AL, AX or EAX.





mov mem, 1      Pent       lea bx, mem          Displacement/immediate does

                           mov [bx], 1          not pair. LEA/MOV can be used

                                                if other code can be placed

                                                inbetween to prevent AGI's.

                           mov ax, 1            MOV/MOV may be easier to pair.

                           mov mem, ax



mov [bx+2], 1   Pent       mov ax, 1            Better pairing because

                           mov [bx+2], ax       displacement/immediate

                                                instructions do not pair.



                           lea bx, [bx+2]

                           mov [bx], 1



movsb           486+       mov al, [si]         MOVS is faster and

                           inc si               smaller to move a single

                           mov [di], al         byte, word or dword

                           inc di               on the 8088-386.

                                                On the 486+ the MOV/INC

                                                method is faster.



                                                NOTE: REP MOVS is always

                                                faster to move a large block.





movsw           486+       mov ax, [si]         see MOVSB

                           add si, 2

                           mov [di], ax

                           add di, 2





movsd           486+       mov eax, [esi]       see MOVSB

                           add esi, 4

                           mov [edi], eax

                           add edi, 4





movzx r16, rm8  486+       xor bx, bx           MOVZX is faster and

                           mov bl, al           smaller on the 386.

                                                On the 486+ XOR/MOV

movzx r32, rm8  486+       xor ebx, ebx         is faster. Possible

                           mov bl, al           pairing on the Pentium.

                                                (source can be reg or mem)

movzx r32, rm16 486+       xor ebx, ebx         disadvantage: modifies flags

                           mov bx, ax





mul n           8088+      shl ax, cl           Use shifts or ADDs instead of

                                                multiply when n is a power of 2



mul n           Pent       add ax, ax           ADD is better than single

                                                shift because it pairs better.



mul             32-bit     lea                  Use LEA to multiply by

                                                2, 3, 4, 5, 7, 8, 9



                           lea eax, [eax+eax*4] (ex: multiply EAX * 5)



                                                LEA is better than SHL on the

                                                Pentium because it pairs in

                                                both pipes, SHL pairs only in

                                                the U pipe.



or reg, reg     Pent       test reg, reg        Better pairing because

                                                OR writes to register.

                                                (This is for src = dest.)



pop mem         486+       pop reg              Faster on 486+

                           mov mem, reg         Better pairing on Pentium



push mem        486+       mov  reg, mem        Faster on 486

                           push reg             Better pairing on Pentium



pushf           486+       rcr reg, 1           To save only the carry flag

                                                use a rotate (RCR or RCL)

                              or                into a register. RCR and RCL

                                                are pairiable (U pipe only)

                           rcl reg, 1           and take 1 cycle. PUSHF is

                                                slow and not pairable.





popf            486+       rcl reg, 1           To restore only the carry flag.

                                                See PUSHF.

                              or



                           rcr reg, 1







rep scasb       Pent       loop1:               REP SCAS is faster and

                             mov al, [di]       smaller on 8088-486.

                             inc di             Expanded code is faster

                             cmp al, reg2       on Pentium due to pairing.

                             je  exit

                             dec cx

                             jnz loop1

                           exit:





shl reg, 1      Pent       add reg, reg         ADD pairs better. SHL

                                                only pairs in the U pipe.





stosb           486+       mov [di], al         STOS is faster and smaller

                           inc di               on the 8088-286, and the same

                                                speed on the 386. On the 486+

stosw           486+       mov [di], ax         the MOV/INC is slightly

                           add di, 2            faster.



stosd           486+       mov [edi], eax       REP STOS is faster on 8088-386.

                           add edi, 4           MOV/INC or MOV/ADD is faster

                                                on the 486+



                                                Note: use LEA SI, [SI+n]

                                                to advance LEA without

                                                changing the flags.





xchg            all                             Use xchg acc, reg to do a

                                                1 byte MOV when one register

                                                can be ignored.



xchg reg1, reg2 Pent       push reg1            pushes and pops are 1 cycle

                           push reg2            faster on Pentium due to

                           pop  reg1            pairing.

                           pop  reg2



                                                disadvantage: uses stack



                Pent       mov  reg3, reg1      Faster and better pairing

                           mov  reg1, reg2      if reg3 is available.

                           mov  reg2, reg3



xlatb           486+       mov bh, 0            XLAT is faster and smaller

                           mov bl, al           on 8088-386. MOV's are faster

                           mov al, [bx]         on 486+. Best to rearrange

                                                instructions to prevent AGI's

xlatb           486+       xor ebx, ebx         and get pairing on Pentium.

                           mov bl, al           Force high part of BX/EBX

                           mov al, [ebx]        to zero outside of loop.



                                                disadvantage: modifies flags





Home Page    e-mail to Quantasm     Order form    Site Map