Embedded Assembly Languages


Overview

Translating languages

Virtual machines

Abstractions for computers

High-level language

  • Level 5

  • Application-oriented languages

  • Programs compile into assembly language (Level 4)

Assembly language

  • Level 4

  • Instruction mnemonics that have a one-to-one correspondence to machine language

  • Calls functions written at the operating system level (Level 3)

  • Programs are translated into machine language (Level 2)

Operating system

  • Level 3

  • Provides services

  • Programs translated and run at the instruction set architecture level (Level 2)

Instruction set architecture

  • Level 2

  • Also known as conventional machine language

  • Executed by Level 1 program (microarchitecture, Level 1)

Microarchitecture

  • Level 1

  • Interprets conventional machine instructions (Level 2)

  • Executed by digital hardware (Level 0)

Digital logic

  • Level 0

  • CPU, constructed from digital logic gates

  • System bus

  • Memory

Data representation

  • Computer is a construction of digital circuits with two states: on and off

  • You need to have the ability to translate between different representations to examine the content of the machine

  • Common number systems: binary, octal, decimal and hexadecimal

Binary representations

  • Electronic Implementation
    • Easy to store with bistable elements
    • Reliably transmitted on noisy and inaccurate wires

  • Digits are 1 and 0 (a binary digit is called a bit)

    1 = true

    0 = false

  • MSB –most significant bit

  • LSB –least significant bit

  • Bit numbering:

  • A bit string could have different interpretations

Unsigned binary integers

  • Each digit (bit) is either 1 or 0

  • Each bit represents a power of 2:

Integer storage sizes

Large measurements

•Kilobyte (KB), 210 bytes

•Megabyte (MB), 220 bytes

•Gigabyte (GB), 230 bytes

•Terabyte (TB), 240 bytes

•Petabyte

•Exabyte

•Zettabyte

•Yottabyte

Signed integers

The highest bit indicates the sign. 1 = negative, 0 = positive

If the highest digit of a hexadecmal integer is > 7, the value is negative. Examples: 8A, C5, A2, 9D

Two’s complement notation

Steps:

  • Complement (reverse) each bit

  • Add 1

Ranges of signed integers

Character

  • Character sets

    • Standard ASCII (0 – 127)
    • Extended ASCII (0 – 255)
    • ANSI (0 – 255)
    • Unicode (0 – 65,535)
  • Null-terminated String

    • Array of characters followed by a null byte
  • Using the ASCII table

    • back inside cover of book

Representing Instructions

int sum(int x, int y)
{
   return x+y;
}
  • For this example, Alpha & Sun use two 4-byte instructions

    • Use differing numbers of instructions in other cases
  • PC uses 7 instructions with lengths 1, 2, and 3 bytes

    • Same for NT and for Linux
    • NT / Linux not fully binary compatible

Boolean algebra

  • Boolean expressions created from:
    • NOT, AND, OR

Implementation of gates

Advanced Architecture

Basic architecture

Basic microcomputer design

  • clock synchronizes CPU operations

  • control unit (CU) coordinates sequence of execution steps

  • ALU performs arithmetic and logic operations

  • The memory storage unit holds instructions and data for a running program

  • A bus is a group of wires that transfer data from one part to another (data, address, control)

Clock

  • synchronizes all CPU and BUS operations

  • machine (clock) cycle measures time of a single operation

  • clock is used to trigger events

  • Basic unit of time, 1GHz→clock cycle=1ns

  • An instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles

Instruction execution cycle

  • Fetch

  • Decode

  • Fetch operands

  • Execute

  • Store output

Pipeline

Multi-stage pipeline

  • Pipelining makes it possible for processor to execute instructions in parallel

  • Instruction execution divided into discrete stages

Example of a non-pipelined processor. For example, 80386. Many wasted cycles.

Pipelined execution

  • More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)

    For k stages and n instructions, the number of required cycles is: k + (n – 1) compared to k*n

  • Pipelining requires buffers
    • Each buffer holds a single value
    • Ideal scenario: equal work for each stage
      • Sometimes it is not possible
      • Slowest stage determines the flow rate in the entire pipeline

  • Some reasons for unequal work stages

    • A complex step cannot be subdivided conveniently
    • An operation takes variable amount of time to execute, e.g. operand fetch time depends on where the operands are located
      • Registers
      • Cache
      • Memory
    • Complexity of operation depends on the type of operation
      • Add: may take one cycle
      • Multiply: may take several cycles
  • Operand fetch of I2 takes three cycles

    • Pipeline stalls for two cycles
      • Caused by hazards
    • Pipeline stalls reduce overall throughput

Wasted cycles (pipelined)

  • When one of the stages requires two or more clock cycles, clock cycles are again wasted.

    For k stages and n instructions, the number of required cycles is: k + (2n – 1)

Superscalar

  • A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).

    For k states and n instructions, the number of required cycles is: k + n

  • Pentium: 2 pipelines

  • Pentium Pro: 3

Pipeline stages

  • Pentium 3: 10

  • Pentium 4: 20~31

  • Next-generation micro-architecture: 14

  • ARM7: 3

Hazards(危害)

  • Three types of hazards
    • Resource hazards
      • Occurs when two or more instructions use the same resource, also called structural hazards
    • Data hazards
      • Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2
    • Control hazards
      • Default: sequential execution suits pipelining
      • Altering control flow (e.g., branching) causes problems, introducing control dependencies

Data hazards

add r1, r2, #10        ; write r1
sub r3, r1, #20        ; read r1

  • Forwarding: provides output result as soon as possible

Control hazards

            bz  r1, target        
            add r2, r4, 0
            ...
target:    add r2, r3, 0

  • Braches alter control flow

    • Require special attention in pipelining
    • Need to throw away some instructions in the pipeline
      • Depends on when we know the branch is taken
      • Pipeline wastes three clock cycles
        • Called branch penalty
    • Reducing branch penalty
      • Determine branch decision early
  • Delayed branch execution

    • Effectively reduces the branch penalty
    • We always fetch the instruction following the branch
      • Why throw it away?
      • Place a useful instruction to execute
      • This is called delay slot

Branch prediction

  • Three prediction strategies

    • Fixed

      • Prediction is fixed

        • Example: branch-never-taken

          »Not proper for loop structures

    • Static

      • Strategy depends on the branch type
        • Conditional branch: always not taken
        • Loop: always taken
    • Dynamic

      • Takes run-time history to make more accurate predictions
  • Static prediction

    • Improves prediction accuracy over Fixed

  • Dynamic branch prediction

    • Uses runtime history
      • Takes the past n branch executions of the branch type and makes the prediction
    • Simple strategy
      • Prediction of the next branch is the majority of the previous n branch executions
      • Example: n = 3
        • If two or more of the last three branches were taken, the prediction is “branch taken”
      • Depending on the type of mix, we get more than 90% prediction accuracy
  • Impact of past n branches on prediction accuracy

Multitasking

  • OS can run multiple programs at the same time.

  • Multiple threads of execution within the same program.

  • Scheduler utility assigns a given amount of CPU time to each running program.

  • Rapid switching of tasks

    • gives illusion that all programs are running at once
    • the processor must support task switching
    • scheduling policy, round-robin, priority

Cache

SRAM vs DRAM

The CPU-Memory gap

The gap widens between DRAM, disk, and CPU speeds.

Memory hierarchies(等级)

  • Some fundamental and enduring properties of hardware and software:

    • Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
    • The gap between CPU and main memory speed is widening.
    • Well-written programs tend to exhibit good locality.
  • They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

Memory system in practice

Reading from memory

  • Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.

Cache memory

  • High-speed expensive static RAM both inside and outside the CPU.

    • Level-1 cache: inside the CPU
    • Level-2 cache: outside the CPU
  • Cache hit: when data to be read is already in cache memory

  • Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.

  • Cache design: cache size, n-way, block size, replacement policy

Caching in a memory hierarchy

General caching concepts

  • Program needs object d, which is stored in some block b.
    • Cache hit
      • Program finds b in the cache at level k. E.g., block 14.
    • Cache miss
      • b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12.
      • If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
        • Placement policy: where can the new block go? E.g., b mod 4
        • Replacement policy: which block should be evicted? E.g., LRU

Locality

  • Principle of Locality: programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.

    • Temporal locality(时域): recently referenced items are likely to be referenced in the near future.
    • Spatial locality(空间域): items with nearby addresses tend to be referenced close together in time.
  • In general, programs with good locality run faster then programs with poor locality

  • Locality is the reason why cache and virtual memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.

sum = 0;
for (i = 0; i < n; i++)
    sum += a[i];
return sum;
  • Being able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;

    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

stride-1 reference pattern

  • Does this function have good locality?
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;

    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

stride-N reference pattern

Blocked matrix multiply performance

  • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)
    • relatively insensitive to array size.

Cache-conscious programming

  • make sure that memory is cache-aligned

  • Split data into hot and cold (list example)

  • Use union and bitfields to reduce size and increase locality

RISC v.s. CISC

Trade-offs of instruction sets

  • Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)

  • 1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.

RISC

  • 1980, Patternson and Ditzel (Berkeley),RISC

  • Features

    • Fixed-length instructions
    • Load-store architecture
    • Register file
  • Organization

    • Hard-wired logic
    • Single-cycle instruction
    • Pipeline
  • Pros: small die size, short development time, high performance

  • Cons: low code density, not x86 compatible

RISC Design Principles

  • Simple operations

    • Simple instructions that can execute in one cycle
  • Register-to-register operations

    • Only load and store operations access memory
    • Rest of the operations on a register-to-register basis
  • Simple addressing modes

    • A few addressing modes (1 or 2)
  • Large number of registers

    • Needed to support register-to-register operations
    • Minimize the procedure call and return overhead
  • Fixed-length instructions

    • Facilitates efficient instruction execution
  • Simple instruction format

    • Fixed boundaries for various fields
      • opcode, source operands,…

CISC and RISC

  • CISC – complex instruction set

    • large instruction set
    • high-level operations (simpler for compiler?)
    • requires microcode interpreter (could take a long time)
    • examples: Intel 80x86 family
  • RISC – reduced instruction set

    • small instruction set
    • simple, atomic instructions
    • directly executed by hardware very quickly
    • easier to incorporate advanced architecture design
    • examples: ARM (Advanced RISC Machines) and DEC Alpha (now Compaq), PowerPC, MIPS
CISC (Intel 486) RISC (MIPS R4000)
#instructions 235 94
Addr. modes 11 1
Inst. Size (bytes) 1-12 4
GP registers 8 32

Why RISC?

  • Simple instructions are preferred

    • Complex instructions are mostly ignored by compilers
      • Due to semantic gap
  • Simple data structures

    • Complex data structures are used relatively infrequently
    • Better to support a few simple data types efficiently
      • Synthesize complex ones
  • Simple addressing modes

    • Complex addressing modes lead to variable length instructions
      • Lead to inefficient instruction decoding and scheduling
  • Large register set

    • Efficient support for procedure calls and returns

      • Patterson and Sequin’s study

        • Procedure call/return: 12-15% of HLL statements

          »Constitute 31-33% of machine language instructions

          »Generate nearly half (45%) of memory references

    • Small activation record

      • Tanenbaum’s study
        • Only 1.25% of the calls have more than 6 arguments
        • More than 93% have less than 6 local scalar variables
        • Large register set can avoid memory references

ISA design issues

ISA:Instruction Set Architecture

Instruction set design

  • Issues when determining ISA
    • Instruction types
    • Number of addresses
    • Addressing modes

Instruction types

  • Arithmetic and logic

  • Data movement

  • I/O (memory-mapped, isolated I/O)

  • Flow control

    • Branches (unconditional, conditional)
      • set-then-jump (cmp AX, BX; je target)
      • Test-and-jump (beq r1, r2, target)
    • Procedure calls (register-based, stack-based)
      • Pentium: ret; MIPS: jr
      • Register: faster but limited number of parameters
      • Stack: slower but more general

Operand types

  • Instructions support basic data types

    • Characters
    • Integers
    • Floating-point
  • Instruction overload

    • Same instruction for different data types

    • Example: Pentium

      mov AL,address ;loads an 8-bit value

      mov AX,address ;loads a 16-bit value

      mov EAX,address ;loads a 32-bit value

  • Separate instructions

    • Instructions specify the operand size
    • Example: MIPS
lb    Rdest,address   ;loads a byte
lh    Rdest,address   ;loads a halfword 
                      ;(16 bits) 
lw    Rdest,address   ;loads a word 
                      ;(32 bits) 
ld    Rdest,address   ;loads a doubleword 
                      ;(64 bits)

Number of addresses

  • Four categories
    • 3-address machines
      • two for the source operands and one for the result
    • 2-address machines
      • One address doubles as source and result
    • 1-address machine
      • Accumulator machines
      • Accumulator is used for one source and result
    • 0-address machines
      • Stack machines
      • Operands are taken from the stack
      • Result goes onto the stack
Number of addresses instruction operation
3 OP A, B, C A ← B OP C
2 OP A, B A ← A OP B
1 OP A AC ← AC OP A
0 OP T ← (T-1) OP T
  • A, B, C: memory or register locations
  • AC: accumulator
  • T: top of stack
  • T-1: second element of stack

3-address

Example: RISC machines, TOY

SUB Y, A, B    ; Y = A - B
MUL T, D, E    ; T = D × E
ADD T, T, C    ; T = T + C
DIV Y, Y, T    ; Y = Y / T

2-address

Example: IA32

MOV Y, A        ; Y = A
SUB Y, B        ; Y = Y - B
MOV T, D        ; T = D
MUL T, E        ; T = T × E
ADD T, C        ; T = T + C
DIV Y, T        ; Y = Y / T

1-address

Example: IA32’s MUL (EAX)

LD     D        ; AC = D
MUL    E        ; AC = AC × E
ADD    C        ; AC = AC + C
ST    Y        ; Y = AC
LD    A        ; AC = A
SUB    B        ; AC = AC – B
DIV     Y        ; AC = AC / Y
ST    Y        ; Y = AC

0-address

Example: IA32’s FPU, HP3000

PUSH     A        ; A
PUSH B        ; A, B
SUB            ; A-B
PUSH C        ; A-B, C
PUSH    D        ; A-B, C, D
PUSH E        ; A-B, C, D, E
MUL            ; A-B, C, D× E
ADD            ; A-B, C+(D× E)
DIV             ; (A-B) / (C+(D× E))
POP    Y

  • A basic design decision; could be mixed

  • Fewer addresses per instruction results in

    • a less complex processor
    • shorter instructions
    • longer and more complex programs
    • longer execution time
  • The decision has impacts on register usage policy as well

    • 3-address usually means more general-purpose registers
    • 1-address usually means less

Addressing modes

  • How to specify location of operands? Trade-off for address range, address flexibility, number of memory references, calculation of addresses

  • Operands can be in three places

    • Registers
      • Register addressing mode
    • Part of instruction
      • Constant
      • Immediate addressing mode
      • All processors support these two addressing modes
    • Memory
      • Difference between RISC and CISC
      • CISC supports a large variety of addressing modes
      • RISC follows load/store architecture
  • Common addressing modes
    • Implied
    • Immediate (lda R1, 1)
    • Direct (st R1, A)
    • Indirect
    • Register (add R1, R2, R3)
    • Register indirect (sti R1, R2)
    • Displacement
    • Stack

Implied addressing

  • No address field; operand is implied by the instruction

    CLC ; clear carry

  • A fixed and unvarying address

Immediate addressing

  • Address field contains the operand value

    ADD 5; AC=AC+5

  • Pros: no extra memory reference; faster

  • Cons: limited range

Direct addressing

  • Address field contains the effective address of the operand

    ADD A; AC=AC+[A]

  • single memory reference

  • Pros: no additional address calculation

  • Cons: limited address space

Indirect addressing

  • Address field contains the address of a pointer to the operand

    ADD [A]; AC=AC+[[A]]

  • multiple memory references

  • Pros: large address space

  • Cons: slower

Register addressing

  • Address field contains the address of a register

    ADD R; AC=AC+R

  • Pros: only need a small address field; shorter instruction and faster fetch; no memory reference

  • Cons: limited address space

Register indirect addressing

  • Address field contains the address of the register containing a pointer to the operand

    ADD [R]; AC=AC+[R]

  • Pros: large address space

  • Cons: extra memory reference

Displacement addressing

  • Address field could contain a register address and an address

    MOV EAX, [A+ESI*4]

  • EA=A+[R×S] or vice versa

  • Several variants

    • Base-offset: [EBP+8]
    • Base-index: [EBX+ESI]
    • Scaled: [T+ESI*4]
  • Pros: flexible

  • Cons: complex

MOV EAX, [A+ESI*4]

  • Often, register, called indexing register, is used for displacement.

  • Usually, a mechanism is provided to efficiently increase the indexing register.

Stack addressing

  • Operand is on top of the stack

    ADD [R]; AC=AC+[R]

  • Pros: large address space

  • Pros: short and fast fetch

  • Cons: limited by FILO order

Mode Meaning Pros Cons
Implied Fast fetch Limited instructions
Immediate Operand=A No memory ref Limited operand
Direct EA=A Simple Limited address space
Indirect EA=[A] Large address space Multiple memory ref
Register EA=R No memory ref Limited address space
Register indirect EA=[R] Large address space Extra memory ref
Displacement EA=A+[R] Flexibility Complexity
stack EA=stack top No memory ref Limited applicability

IA32 addressing modes

Effective address calculation (IA32)

Based Addressing

  • Effective address is computed as base + signed displacement

    • Displacement:
      • 16-bit addresses: 8- or 16-bit number
      • 32-bit addresses: 8- or 32-bit number
  • Useful to access fields of a structure or record

    • Base register -> points to the base address of the structure
    • Displacement -> relative offset within the structure
  • Useful to access arrays whose element size is not 2, 4, or 8 bytes

    • Displacement -> points to the beginning of the array
    • Base register -> relative offset of an element within the array

Indexed Addressing

  • Effective address is computed as (index * scale factor) + signed displacement

    • 16-bit addresses:
      • displacement: 8- or 16-bit number
      • scale factor: none (i.e., 1)
    • 32-bit addresses:
      • displacement: 8- or 32-bit number
      • scale factor: 2, 4, or 8
  • Useful to access elements of an array (particularly if the element size is 2, 4, or 8 bytes)

    • Displacement -> points to the beginning of the array
    • Index register -> selects an element of the array (array index)
    • Scaling factor -> size of the array element

Based-Indexed Addressing

Based-indexed addressing with no scale factor

  • Effective address is computed as base + index + signed displacement

  • Useful in accessing two-dimensional arrays

    • Displacement -> points to the beginning of the array
    • Base and index registers point to a row and an element within that row
  • Useful in accessing arrays of records

    • Displacement -> represents the offset of a field in a record
    • Base and index registers hold a pointer to the base of the array and the offset of an element relative to the base of the array
  • Useful in accessing arrays passed on to a procedure

    • Base register -> points to the beginning of the array
    • Index register -> represents the offset of an element relative to the base of the array

Example

Assuming BX points to table1

 **mov AX,[BX+SI]**

 **cmp** **AX,[BX+SI+2]** 

compares two successive elements of table1

Based-indexed addressing with scale factor

  • Effective address is computed as base + (index * scale factor) + signed displacement

  • Useful in accessing two-dimensional arrays when the element size is 2, 4, or 8 bytes

    • Displacement ==> points to the beginning of the array
    • Base register ==> holds offset to a row (relative to start of array)
    • Index register ==> selects an element of the row
    • Scaling factor ==> size of the array element

ARM Architecture

ARM history

  • 1983 developed by Acorn computers

    • To replace 6502 in BBC computers
    • 4-man VLSI design team
    • Its simplicity comes from the inexperience team
    • Match the needs for generalized SoC for reasonable power, performance and die size
    • The first commercial RISC implemenation
  • 1990 ARM (Advanced RISC Machine), owned by Acorn, Apple and VLSI

Design and license ARM core design but not fabricate

Why ARM?

  • One of the most licensed and thus widespread processor cores in the world

    • Used in PDA, cell phones, multimedia players, handheld game console, digital TV and cameras
    • ARM7: GBA, iPod
    • ARM9: NDS, PSP, Sony Ericsson, BenQ
    • ARM11: Apple iPhone, Nokia N93, N800
    • 90% of 32-bit embedded RISC processors till 2009
  • Used especially in portable devices due to its low power consumption and reasonable performance

ARM processors

  • A simple but powerful design

  • A whole family of designs sharing similar design principles and a common instruction set

Naming ARM

  • ARMxyzTDMIEJFS
    • x: series
    • y: MMU
    • z: cache
    • T: Thumb
    • D: debugger
    • M: Multiplier
    • I: EmbeddedICE (built-in debugger hardware)
    • E: Enhanced instruction
    • J: Jazelle (JVM)
    • F: Floating-point
    • S: Synthesizible version (source code version for EDA tools)
  • ARM7TDMI

    • 3 pipeline stages (fetch/decode/execute)
    • High code density/low power consumption
    • One of the most used ARM-version (for low-end systems)
    • All ARM cores after ARM7TDMI include TDMI even if they do not include TDMI in their labels
  • ARM9TDMI

    • Compatible with ARM7
    • 5 stages (fetch/decode/execute/memory/write)
    • Separate instruction and data cache
  • ARM11

ARM is a RISC

  • RISC: simple but powerful instructions that execute within a single cycle at high clock speed.

  • Four major design rules:

    • Instructions: reduced set/single cycle/fixed length
    • Pipeline: decode in one stage/no need for microcode
    • Registers: a large set of general-purpose registers
    • Load/store architecture: data processing instructions apply to registers only; load/store to transfer data from memory
  • Results in simple design and fast clock rate

  • The distinction blurs because CISC implements RISC concepts

ARM design philosophy(哲学)

  • Small processor for lower power consumption (for embedded system)
  • High code density for limited memory and physical size restrictions
  • The ability to use slow and low-cost memory
  • Reduced die size for reducing manufacture cost and accommodating more peripherals

ARM features

  • Different from pure RISC in several ways:
    • Variable cycle execution for certain instructions: multiple-register load/store (faster/higher code density)
    • Inline barrel shifter leading to more complex instructions: improves performance and code density
    • Thumb 16-bit instruction set: 30% code density improvement
    • Conditional execution: improve performance and code density by reducing branch
    • Enhanced instructions: DSP instructions

  • Load/store architecture
  • A large array of uniform registers
  • Fixed-length 32-bit instructions
  • 3-address instructions

Registers

  • Only 16 registers are visible to a specific mode. A mode could access
    • A particular set of r0-r12
    • r13 (sp, stack pointer)
    • r14 (lr, link register)
    • r15 (pc, program counter)
    • Current program status register (cpsr)
    • The uses of r0-r13 are orthogonal

General-purpose registers

  • 6 data types (signed/unsigned)
  • All ARM operations are 32-bit. Shorter data types are only supported by data transfer operations.

Program counter

  • Store the address of the instruction to be executed
  • All instructions are 32-bit wide and word-aligned
  • Thus, the last two bits of pc are undefined.

Program status register (CPSR)

Processor modes

Register organization

Instruction sets

  • ARM/Thumb/Jazelle

Pipeline

  • Execution of a branch or direct modification of pc causes ARM core to flush its pipeline
  • ARM10 starts to use branch prediction
  • An instruction in the execution stage will complete even though an interrupt has been raised. Other instructions in the pipeline are abondond.

Interrupts

ARM Instruction Set

ARM programmer model

  • The state of an ARM system is determined by the content of visible registers and memory.
  • A user-mode program can see 15 32-bit general-purpose registers (R0-R14), program counter (PC) and CPSR.
  • Instruction set defines the operations that can change the state.

Memory system

Byte ordering

ARM programmer model

Instruction set

Features of ARM instruction set

  • Load-store architecture

  • 3-address instructions

  • Conditional execution of every instruction

  • Possible to load/store multiple registers at once

  • Possible to combine shift and ALU operations in a single instruction

Data processing

  • They are move, arithmetic, logical, comparison and multiply instructions.

  • Most data processing instructions can process one of their operands using the barrel shifter.

  • General rules:

    • All operands are 32-bit, coming from registers or literals.
    • The result, if any, is 32-bit and placed in a register (with the exception for long multiply which produces a 64-bit result)
    • 3-address format

MOV<cc><S>  Rd, <operands>

MOVCS R0, R1 @ if carry is set
             @ then R0:=R1

MOVS  R0, #0 @ R0:=0
             @ Z=1, N=0
             @ C, V unaffected

Conditional execution

  • Almost all ARM instructions have a condition field which allows it to be executed conditionally.

movcs R0, R1

Register movement

Addressing modes

  • Register operands

    ADD R0, R1, R2

  • Immediate operands

Shifted register operands

  • One operand to ALU is routed through the Barrel shifter. Thus, the operand can be modified before it is used. Useful for fast multipliation and dealing with lists, table and other complex data structure. (similar to the displacement addressing mode in CISC.)
  • Some instructions (e.g. MUL, CLZ, QADD) do not read barrel shifter.

  • It is possible to use a register to specify the number of bits to be shifted; only the bottom 8 bits of the register are significant.

Multiplication

MOV  R1, #35
 MUL  R2, R0, R1
         or
 ADD  R0, R0, R0, LSL #2  @ R0’=5xR0
 RSB  R2, R0, R0, LSL #3  @ R2 =7xR0’

Shifted register operands

Encoding data processing instructions

Arithmetic

  • Add and subtraction

Setting the condition codes

  • Any data processing instruction can set the condition codes if the programmers wish it to

64-bit addition

ADDS R2, R2, R0

ADC R3, R3, R1

Logical

Comparison

  • These instructions do not generate a result, but set condition code bits (N, Z, C, V) in CPSR. Often, a branch operation follows to change the program flow.

Multiplication

  • MUL R0, R1, R2 @ R0 = (R1xR2)[31:0]

  • Features:

    • Second operand can’t be immediate
    • The result register must be different from the first operand
    • Cycles depends on core type
    • If S bit is set, C flag is meaningless
  • See the reference manual

  • Multiply-accumulate (2D array indexing)

    MLA R4, R3, R2, R1 @ R4 = R3xR2+R1

  • Multiply with a constant can often be more efficiently implemented using shifted register operand

    MOV R1, #35

    MUL R2, R0, R1

​ or

ADD R0, R0, R0, LSL #2 @ R0’=5xR0

 **RSB R2, R0, R0, LSL #3 @ R2 =7xR0’**

Flow control instructions

  • Determine the instruction to be executed next

  • Branch instruction
B  label    
            …
label:    …
  • Conditional branches
MOV  R0, #0    
loop:        …
            ADD  R0, R0, #1
            CMP  R0, #10
            BNE  loop

Branch conditions

  • BL instruction saves the return address to R14 (lr)
        BL    sub     @ call sub
          CMP   R1, #5  @ return to here
          MOVEQ R1, #0
          …
sub:    …            @ sub entry point
        …
        MOV   PC, LR    @ return 

Conditional execution

CMP  R0, #5    
            BEQ  bypass     @ if (R0!=5) {
            ADD  R1, R1, R0 @  R1=R1+R0-R2 
            SUB  R1, R1, R2 @ }
bypass:    …
----------------------------------------------------
            CMP   R0, #5
            ADDNE R1, R1, R0
            SUBNE R1, R1, R2    //smaller and faster

Rule of thumb: if the conditional sequence is three instructions or less, it is better to use conditional execution than a branch.

if ((R0==R1) && (R2==R3)) R4++
-----------------------------------------------------
          CMP   R0, R1
          BNE   skip
          CMP   R2, R3
          BNE   skip
          ADD   R4, R4, #1
skip:    …  
-----------------------------------------------------
          CMP   R0, R1
          CMPEQ R2, R3
          ADDEQ R4, R4, #1

Data transfer instructions

  • Move data between registers and memory

  • Three basic forms

    • Single register load/store
    • Multiple register load/store
    • Single register swap: SWP(B), atomic instruction for semaphore

Single register load/store

No STRSB/STRSH since STRB/STRH stores both

signed/unsigned ones

  • The data items can be a 8-bit byte, 16-bit half-word or 32-bit word. Addresses must be boundary aligned. (e.g. 4’s multiple for LDR/STR)

Addressing modes

  • Memory is addressed by a register and an offset.

    LDR R0, [R1] @ mem[R1]

  • Three ways to specify offsets:

    • Immediate

      LDR R0, [R1, #4] @ mem[R1+4]

    • Register

      LDR R0, [R1, R2] @ mem[R1+R2]

    • Scaled register @ mem[R1+4*R2]

      LDR R0, [R1, R2, LSL #2]

  • Pre-index addressing (LDR R0, [R1, #4]) without a writeback

  • Auto-indexing addressing (LDR R0, [R1, #4]!)

    Pre-index with writeback

    calculation before accessing with a writeback

  • Post-index addressing (LDR R0, [R1], #4)

    calculation after accessing with a writeback

Comparisons

  • Pre-indexed addressing

LDR R0, [R1, R2] @ R0=mem[R1+R2]

@ R1 unchanged

  • Auto-indexing addressing

LDR R0, [R1, R2]! @ R0=mem[R1+R2]

@ R1=R1+R2

  • Post-indexed addressing

LDR R0, [R1], R2 @ R0=mem[R1]

@ R1=R1+R2

Summary of addressing modes

Load an address into a register

  • Note that all addressing modes are register-offseted. Can we issue LDR R0, Table? The pseudo instruction ADR loads a register with an address
table:     .word    10
…
            ADR  R0, table
  • Assembler transfer pseudo instruction into a sequence of appropriate instructions

    sub r0, pc, #12

Application

            ADR R1, table
loop:     LDR R0, [R1]
            ADD R1, R1, #4
            @ operations on R0
            …
--------------------------------------------
            ADR R1, table
loop:     LDR R0, [R1], #4

            @ operations on R0
            …

Multiple register load/store

  • Transfer a block of data more efficiently.

  • Used for procedure entry and exit for saving and restoring workspace registers and the return address

  • For ARM7, 2+Nt cycles (N:#words, t:time for a word for sequential access). Increase interrupt latency since it can’t be interrupted.

  • registers are arranged an in increasing order; see manual

LDMIA  R1, {R0, R2, R5} @ R0 = mem[R1]
                         @ R2 = mem[r1+4]
                         @ R5 = mem[r1+8]
  • LDM load multiple registers

  • STM store multiple registers

suffix meaning

  • IA increase after

  • IB increase before

  • DA decrease after

  • DB decrease before

LDM<mode> Rn, {<registers>}
IA: addr:=Rn    //1
IB: addr:=Rn+4    //2
DA: addr:=Rn-#<registers>*4+4    //3
DB: addr:=Rn-#<registers>*4        //4
For each Ri in <registers> 
  IB: addr:=addr+4
  DB: addr:=addr-4
  Ri:=M[addr]
  IA: addr:=addr+4
  DA: addr:=addr-4
<!>: Rn:=addr
1 2 3 4
LDMIA R0, {R1,R2,R3}
  or
LDMIA R0, {R1-R3}


R1: 10
R2: 20
R3: 30
R0: 0x10

LDMIA R0!, {R1,R2,R3}

R1: 10
R2: 20
R3: 30
R0: 0x01C
LDMIB R0!, {R1,R2,R3}

R1: 20
R2: 30
R3: 40
R0: 0x01C

LDMDA R0!, {R1,R2,R3}

R1: 40
R2: 50
R3: 60
R0: 0x018
LDMDB R0!, {R1,R2,R3}

R1: 30
R2: 40
R3: 50
R0: 0x018

  • Copy a block of memory
    • R9: address of the source
    • R10: address of the destination
    • R11: end address of the source
loop: LDMIA R9!, {R0-R7}
         STMIA R10!, {R0-R7}
         CMP   R9, R11    
          BNE   loop

  • Stack (full: pointing to the last used; ascending: grow towards increasing memory addresses)
mode POP =LDM PUSH =STM
Full ascending (FA) LDMFA LDMDA STMFA STMIB
Full descending (FD) LDMFD LDMIA STMFD STMDB
Empty ascending (EA) LDMEA LDMDB STMEA STMIA
Empty descending (ED) LDMED LDMIB STMED STMDA
STMFD R13!, {R2-R9} @ used for ATPCS
… @ modify R2-R9
LDMFD R13!, {R2-R9}

Swap instruction

  • Swap between memory and register. Atomic operation preventing any other instruction from reading/writing to that location until it completes

Software interrupt

  • A software interrupt instruction causes a software interrupt exception, which provides a mechanism for applications to call OS routines.

Load constants

  • No ARM instruction loads a 32-bit constant into a register because ARM instructions are 32-bit long. There is a pseudo code for this.

Immediate numbers

  • Assemblers implement this usually with two options depending on the number you try to load.

  • Assume that you want to load 511 into R0

    • Construct in multiple instructions

      mov r0, #256

      add r0, #255

    • Load from memory; declare L511 .word 511

      ldr r0, L511 ldr r0, [pc, #0]

  • Guideline: if you can construct it in two instructions, do it; otherwise, load it.

  • The assembler decides for you

    ldr r0, =255 mov r0, 255

    ldr r0, =511 ldr r0, [pc, #4]

PC-relative modes

Instruction set

ARM Assembly Programming

GNU compiler and binutils

  • GNU compiler and binutils

  • gcc: GNU C compiler

  • as: GNU assembler

  • ld: GNU linker

  • gdb: GNU project debugger

  • insight: a (Tcl/Tk) graphic interface to gdb

Pipeline

  • COFF (common object file format)

  • ELF (extended linker format)

  • Segments in the object file

    • Text: code
    • Data: initialized global variables
    • BSS: uninitialized global variables

GAS program format

            .file “test.s”
            .text
            .global main
            .type main, %function
main:
            MOV R0, #100
            ADD R0, R0, R0
            SWI #11
            .end

ARM assembly program

Control structures

  • Program is to implement algorithms to solve problems. Program decomposition and flow of control are important concepts to express algorithms.

  • Flow of control:

    • Sequence.
    • Decision: if-then-else, switch
    • Iteration: repeat-until, do-while, for
  • Decomposition: split a problem into several smaller and manageable ones and solve them independently. (subroutines/functions/procedures)

Decision

If statements

if (R1==1 || R1==5 || R1==12) R0=1;

TEQ   R1, #1        ...
TEQNE R1, #5        ...
TEQNE R1, #12        ...
MOVEQ R0, #1        BNE fail
if (R1==0) zero
else if (R1>0) plus
else if (R1<0) neg

      CMP   R1, #0        
      BMI   neg
      BEQ   zero
      BPL   plus
neg:  ...
      B exit
Zero: ...
      B exit
      ...
R0=abs(R0)

CMP    R0, #0
RSBMI  R0, R0, #0

Multi-way branches

CMP R0, #`0’
        BCC other    @ less than ‘0’
        CMP R0, #`9’
        BLS digit   @ between ‘0’ and ‘9’
        ------------------------------------
        CMP R0, #`A’
        BCC other
        CMP R0, #`Z’
        BLS letter  @ between ‘A’ and ‘Z’
        ------------------------------------
        CMP R0, #`a’
        BCC other
        CMP R0, #`z’
        BHI other  @ not between ‘a’ and ‘z’  
        ------------------------------------
letter: ... 

Switch statements

Iteration

repeat loops

while loops

GCD

int gcd (int i, int j)
{
    while (i!=j)
    {
      if (i>j)
        i -= j;
      else
        j -= i;
    }
}
Loop:  CMP   R1, R2
       SUBGT R1, R1, R2
       SUBLT R2, R2, R1
       BNE   loop

for loops

Procedures

  • Arguments: expressions passed into a function

  • Parameters: values received by the function

  • Caller and callee

  • How to pass arguments? By registers? By stack? By memory? In what order?

  • Who should save R5? Caller? Callee?

  • We need a protocol for these.

ARM Procedure Call Standard (APCS)

  • ARM Ltd. defines a set of rules for procedure entry and exit so that

    • Object codes generated by different compilers can be linked together
    • Procedures can be called between high-level languages and assembly
  • APCS defines

    • Use of registers
    • Use of stack
    • Format of stack-based data structure
    • Mechanism for argument passing

APCS register usage convention

  • 0-4:

    • Used to pass the first 4 parameters

    • Caller-saved if necessary

  • 4-11:

    • Register variables, must return unchanged

    • Callee-saved

  • 9-15

    • Registers for special purposes
    • Could be used as temporary variables if saved properly.

Argument passing

  • The first four word arguments are passed through R0 to R3.

  • Remaining parameters are pushed into stack in the reverse order.

  • Procedures with less than four parameters are more effective.

Return value

  • One word value in R0

  • A value of length 2~4 words (R0-R1, R0-R2, R0-R3)

Function entry/exit

  • A simple leaf function with less than four parameters has the minimal overhead. 50% of calls are to leaf functions

  • Save a minimal set of temporary variables
BL leaf2
       ...

leaf2: STMFD sp!, {regs, lr} @ save
       ...
       LDMFD sp!, {regs, pc} @ restore and
                               @ return

Accessing operands

  • A procedure often accesses operands in the following ways
    • An argument passed on a register: no further work
    • An argument passed on the stack: use stack pointer (R13) relative addressing with an immediate offset known at compiling time
    • A constant: PC-relative addressing, offset known at compiling time
    • A local variable: allocate on the stack and access through stack pointer relative addressing
    • A global variable: allocated in the static area and can be accessed by the static base relative (R9) addressing
main:
        LDR   R0, #0
        ...
        BL    func
        ...
func:   STMFD SP!, {R4-R6, LR}
        SUB   SP, SP, #0xC
        ...
        STR   R0, [SP, #0] @ v1=a1

        ...
        ADD   SP, SP, #0xC
        LDMFD SP!, {R4-R6, PC}

Optimizing ARM Assembly

Optimization

  • Compilers do perform optimization, but they have blind sites. There are some optimization tools that you can’t explicitly use by writing C, for example.

    • Instruction scheduling
    • Register allocation
    • Conditional execution

    You have to use hand-written assembly to optimize critical routines.

  • Use ARM9TDMI as the example, but the rules apply to all ARM cores.

  • Note that the codes are sometimes in armasm format, not gas.

ARM optimization

  • Utilize ARM ISA’s features
    • Conditional execution
    • Multiple register load/store
    • Scaled register operand
    • Addressing modes

Instruction scheduling

  • ARM9 pipeline

  • Hazard/Interlock: If the required data is the unavailable result from the previous instruction, then the process stalls.

  • No hazard, 2 cycles

  • One-cycle interlock

  • One-cycle interlock, 4 cycles

  • Brach takes 3 cycles due to stalls

Scheduling of load instructions

  • Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions. Careful scheduling of loads can avoid stalls.

  • 2-cycle stall. Total 11 cycles for a character.
  • It can be avoided by preloading and unrolling.
  • The key is to do some work when awaiting data

Load scheduling by preloading

  • Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop.

  • Since loop i is loading data for loop i+1, there is always a problem with the first and last loops. For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution.

Load scheduling by unrolling

  • Unroll and interleave the body of the loop. For example, we can perform three loops together. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.

Register allocation

  • APCS requires callee to save R4~R11 and to keep the stack 8-byte aligned.

  • We stack R12 only for making the stack 8-byte aligned.

  • Unroll the loop to handle 8 words at a time and to use multiple load/store

  • What variables do we have?

  • We still need to assign carry and kr, but we have used 13 registers and only one remains.

    • Work on 4 words instead
    • Use stack to save least-used variable, here N
    • Alter the code
  • We notice that carry does not need to stay in the same register. Thus, we can use yi for it.

More than 14 local variables

  • If you need more than 14 local variables, then you store some on the stack.

  • Work outwards from the inner loops since they have more performance impact.

Packing

  • Pack multiple (sub-32bit) variables into a single register.

  • When shifting by a register amount, ARM uses bits 0~7 and ignores others.

  • Shift an array of 40 entries by shift bits.

  • Simulate SIMD (single instruction multiple data).

  • Assume that we want to merge two images X and Y to produce Z by

  • Load 4 bytes at a time

  • Unpack it and promote to 16-bit data

  • Work on 176x144 images

Conditional execution

  • By combining conditional execution and conditional setting of the flags, you can implement simple if statements without any need of branches.

  • This improves efficiency since branches can take many cycles and also reduces code size.

Block copy example

void bcopy(char *to, char *from, int n)
{
  while (n--)
    *to++ = *from++;
}
@ arguments: R0: to, R1: from, R2: n
bcopy:  TEQ  R2, #0
        BEQ  end
loop:   SUB  R2, R2, #1
        LDRB R3, [R1], #1
        STRB R3, [R0], #1
        B    bcopy
end:    MOV  PC, LR
@ arguments: R0: to, R1: from, R2: n
@ rewrite “n–-” as “-–n>=0”
bcopy:  SUBS   R2, R2, #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        BPL    bcopy
        MOV    PC, LR
@ arguments: R0: to, R1: from, R2: n
@ assume n is a multiple of 4; loop unrolling
bcopy:  SUBS   R2, R2, #4
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        BPL    bcopy
        MOV    PC, LR
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16; 
bcopy:  SUBS   R2, R2, #16
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        BPL   bcopy
        MOV   PC, LR
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16; 
bcopy:  SUBS   R2, R2, #16
        LDMPL  R1!, {R3-R6} 
        STMPL  R0!, {R3-R6}
        BPL    bcopy
        MOV    PC, LR

@ could be extend to copy 40 byte at a time
@ if not multiple of 40, add a copy_rest loop

Search example

int main(void)
{
  int a[10]={7,6,4,5,5,1,3,2,9,8};
  int i;
  int s=4;

  for (i=0; i<10; i++)
    if (s==a[i]) break;
  if (i>=10) return -1;
  else return i;
}
           .section     .rodata
.LC0:
       .word      7
       .word      6
       .word      4
     .word      5
       .word  5
       .word      1
       .word      3
       .word      2
       .word      9
       .word      8
           .text
           .global    main
        .type    main, %function
main: sub   sp, sp, #48
      adr   r4, L9 @ =.LC0
      add   r5, sp, #8
      ldmia r4!, {r0, r1, r2, r3} 
      stmia r5!, {r0, r1, r2, r3} 
      ldmia r4!, {r0, r1, r2, r3} 
      stmia r5!, {r0, r1, r2, r3} 
      ldmia r4!, {r0, r1} 
      stmia r5!, {r0, r1} 
mov  r3, #4
      str  r3, [sp, #0] @ s=4
      mov  r3, #0
      str  r3, [sp, #4] @ i=0

loop: ldr  r0, [sp, #4] @ r0=i
      cmp  r0, #10      @ i<10?
      bge  end
      ldr  r1, [sp, #0] @ r1=s
      mov  r2, #4
      mul  r3, r0, r2
      add  r3, r3, #8
      ldr  r4, [sp, r3] @ r4=a[i]
teq  r1, r4  @ test if s==a[i]
     beq  end  

     add  r0, r0, #1   @ i++
     str  r0, [sp, #4] @ update i
     b    loop  

end: str   r0, [sp, #4]
     cmp   r0, #10
     movge r0, #-1
     add   sp, sp, #48
     mov   pc, lr 

Optimization

  • Remove unnecessary load/store
  • Remove loop invariant
  • Use addressing mode
  • Use conditional execution

Search (remove load/store)

Search (loop invariant/addressing mode)

  • Remove unnecessary load/store
  • Remove loop invariant
  • Use addressing mode
  • Use conditional execution
  • From 22 words to 13 words and execution time is greatly reduced.

文章作者: 杰克成
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 杰克成 !
评论
  目录