Assembly

Embedded

发布日期: 2021-10-21

更新日期: 2021-11-08

文章字数: 7.1k

阅读时长: 44 分

阅读次数:

Overview

Translating languages

Virtual machines

Abstractions for computers

High-level language

Level 5
Application-oriented languages
Programs compile into assembly language (Level 4)

Assembly language

Level 4
Instruction mnemonics that have a one-to-one correspondence to machine language
Calls functions written at the operating system level (Level 3)
Programs are translated into machine language (Level 2)

Operating system

Level 3
Provides services
Programs translated and run at the instruction set architecture level (Level 2)

Instruction set architecture

Level 2
Also known as conventional machine language
Executed by Level 1 program (microarchitecture, Level 1)

Microarchitecture

Level 1
Interprets conventional machine instructions (Level 2)
Executed by digital hardware (Level 0)

Digital logic

Level 0
CPU, constructed from digital logic gates
System bus
Memory

Data representation

Computer is a construction of digital circuits with two states: on and off
You need to have the ability to translate between different representations to examine the content of the machine
Common number systems: binary, octal, decimal and hexadecimal

Binary representations

Electronic Implementation
- Easy to store with bistable elements
- Reliably transmitted on noisy and inaccurate wires

Digits are 1 and 0 (a binary digit is called a bit)

1 = true

0 = false
MSB –most significant bit
LSB –least significant bit
Bit numbering:
A bit string could have different interpretations

Unsigned binary integers

Each digit (bit) is either 1 or 0
Each bit represents a power of 2:

Integer storage sizes

Large measurements

•Kilobyte (KB), 210 bytes

•Megabyte (MB), 220 bytes

•Gigabyte (GB), 230 bytes

•Terabyte (TB), 240 bytes

•Petabyte

•Exabyte

•Zettabyte

•Yottabyte

Signed integers

The highest bit indicates the sign. 1 = negative, 0 = positive

If the highest digit of a hexadecmal integer is > 7, the value is negative. Examples: 8A, C5, A2, 9D

Two’s complement notation

Steps:

Complement (reverse) each bit
Add 1

Ranges of signed integers

Character

Character sets
- Standard ASCII (0 – 127)
- Extended ASCII (0 – 255)
- ANSI (0 – 255)
- Unicode (0 – 65,535)
Null-terminated String
- Array of characters followed by a null byte
Using the ASCII table
- back inside cover of book

Representing Instructions

int sum(int x, int y)
{
   return x+y;
}

For this example, Alpha & Sun use two 4-byte instructions
- Use differing numbers of instructions in other cases
PC uses 7 instructions with lengths 1, 2, and 3 bytes
- Same for NT and for Linux
- NT / Linux not fully binary compatible

Boolean algebra

Boolean expressions created from:
- NOT, AND, OR

Implementation of gates

Advanced Architecture

Basic architecture

Basic microcomputer design

clock synchronizes CPU operations
control unit (CU) coordinates sequence of execution steps
ALU performs arithmetic and logic operations
The memory storage unit holds instructions and data for a running program
A bus is a group of wires that transfer data from one part to another (data, address, control)

Clock

synchronizes all CPU and BUS operations
machine (clock) cycle measures time of a single operation
clock is used to trigger events
Basic unit of time, 1GHz→clock cycle=1ns
An instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles

Instruction execution cycle

Fetch
Decode
Fetch operands
Execute
Store output

Pipeline

Multi-stage pipeline

Pipelining makes it possible for processor to execute instructions in parallel
Instruction execution divided into discrete stages

Example of a non-pipelined processor. For example, 80386. Many wasted cycles.

Pipelined execution

More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)

For k stages and n instructions, the number of required cycles is: k + (n – 1) compared to k*n

Pipelining requires buffers
- Each buffer holds a single value
- Ideal scenario: equal work for each stage
  - Sometimes it is not possible
  - Slowest stage determines the flow rate in the entire pipeline

Some reasons for unequal work stages
- A complex step cannot be subdivided conveniently
- An operation takes variable amount of time to execute, e.g. operand fetch time depends on where the operands are located
  - Registers
  - Cache
  - Memory
- Complexity of operation depends on the type of operation
  - Add: may take one cycle
  - Multiply: may take several cycles
Operand fetch of I2 takes three cycles
- Pipeline stalls for two cycles
  - Caused by hazards
- Pipeline stalls reduce overall throughput

Wasted cycles (pipelined)

When one of the stages requires two or more clock cycles, clock cycles are again wasted.

For k stages and n instructions, the number of required cycles is: k + (2n – 1)

Superscalar

A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).

For k states and n instructions, the number of required cycles is: k + n
Pentium: 2 pipelines
Pentium Pro: 3

Pipeline stages

Pentium 3: 10
Pentium 4: 20~31
Next-generation micro-architecture: 14
ARM7: 3

Hazards(危害)

Three types of hazards
- Resource hazards
  - Occurs when two or more instructions use the same resource, also called structural hazards
- Data hazards
  - Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2
- Control hazards
  - Default: sequential execution suits pipelining
  - Altering control flow (e.g., branching) causes problems, introducing control dependencies

Data hazards

add r1, r2, #10        ; write r1
sub r3, r1, #20        ; read r1

Forwarding: provides output result as soon as possible

Control hazards

            bz  r1, target        
            add r2, r4, 0
            ...
target:    add r2, r3, 0

Braches alter control flow
- Require special attention in pipelining
- Need to throw away some instructions in the pipeline
  - Depends on when we know the branch is taken
  - Pipeline wastes three clock cycles
    - Called branch penalty
- Reducing branch penalty
  - Determine branch decision early
Delayed branch execution
- Effectively reduces the branch penalty
- We always fetch the instruction following the branch
  - Why throw it away?
  - Place a useful instruction to execute
  - This is called delay slot

Branch prediction

Three prediction strategies
- Fixed
  - Prediction is fixed
    - Example: branch-never-taken
      
      »Not proper for loop structures
- Static
  - Strategy depends on the branch type
    - Conditional branch: always not taken
    - Loop: always taken
- Dynamic
  - Takes run-time history to make more accurate predictions
Static prediction
- Improves prediction accuracy over Fixed

Dynamic branch prediction
- Uses runtime history
  - Takes the past n branch executions of the branch type and makes the prediction
- Simple strategy
  - Prediction of the next branch is the majority of the previous n branch executions
  - Example: n = 3
    - If two or more of the last three branches were taken, the prediction is “branch taken”
  - Depending on the type of mix, we get more than 90% prediction accuracy
Impact of past n branches on prediction accuracy

Multitasking

OS can run multiple programs at the same time.
Multiple threads of execution within the same program.
Scheduler utility assigns a given amount of CPU time to each running program.
Rapid switching of tasks
- gives illusion that all programs are running at once
- the processor must support task switching
- scheduling policy, round-robin, priority

Cache

SRAM vs DRAM

The CPU-Memory gap

The gap widens between DRAM, disk, and CPU speeds.

Memory hierarchies(等级)

Some fundamental and enduring properties of hardware and software:
- Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
- The gap between CPU and main memory speed is widening.
- Well-written programs tend to exhibit good locality.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

Memory system in practice

Reading from memory

Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.

Cache memory

High-speed expensive static RAM both inside and outside the CPU.
- Level-1 cache: inside the CPU
- Level-2 cache: outside the CPU
Cache hit: when data to be read is already in cache memory
Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.
Cache design: cache size, n-way, block size, replacement policy

Caching in a memory hierarchy

General caching concepts

Program needs object d, which is stored in some block b.
- Cache hit
  - Program finds b in the cache at level k. E.g., block 14.
- Cache miss
  - b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12.
  - If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
    - Placement policy: where can the new block go? E.g., b mod 4
    - Replacement policy: which block should be evicted? E.g., LRU

Locality

Principle of Locality: programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.
- Temporal locality(时域): recently referenced items are likely to be referenced in the near future.
- Spatial locality(空间域): items with nearby addresses tend to be referenced close together in time.
In general, programs with good locality run faster then programs with poor locality
Locality is the reason why cache and virtual memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.

sum = 0;
for (i = 0; i < n; i++)
    sum += a[i];
return sum;

Being able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?

int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;

    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

stride-1 reference pattern

Does this function have good locality?

int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;

    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

stride-N reference pattern

Blocked matrix multiply performance

Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)
- relatively insensitive to array size.

Cache-conscious programming

make sure that memory is cache-aligned

Split data into hot and cold (list example)

Use union and bitfields to reduce size and increase locality

RISC v.s. CISC

Trade-offs of instruction sets

Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)
1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.

RISC

1980, Patternson and Ditzel (Berkeley),RISC
Features
- Fixed-length instructions
- Load-store architecture
- Register file
Organization
- Hard-wired logic
- Single-cycle instruction
- Pipeline
Pros: small die size, short development time, high performance
Cons: low code density, not x86 compatible

RISC Design Principles

Simple operations
- Simple instructions that can execute in one cycle
Register-to-register operations
- Only load and store operations access memory
- Rest of the operations on a register-to-register basis
Simple addressing modes
- A few addressing modes (1 or 2)
Large number of registers
- Needed to support register-to-register operations
- Minimize the procedure call and return overhead
Fixed-length instructions
- Facilitates efficient instruction execution
Simple instruction format
- Fixed boundaries for various fields
  - opcode, source operands,…

CISC and RISC

CISC – complex instruction set
- large instruction set
- high-level operations (simpler for compiler?)
- requires microcode interpreter (could take a long time)
- examples: Intel 80x86 family
RISC – reduced instruction set
- small instruction set
- simple, atomic instructions
- directly executed by hardware very quickly
- easier to incorporate advanced architecture design
- examples: ARM (Advanced RISC Machines) and DEC Alpha (now Compaq), PowerPC, MIPS

	CISC (Intel 486)	RISC (MIPS R4000)
#instructions	235	94
Addr. modes	11	1
Inst. Size (bytes)	1-12	4
GP registers	8	32

Why RISC?

Simple instructions are preferred
- Complex instructions are mostly ignored by compilers
  - Due to semantic gap
Simple data structures
- Complex data structures are used relatively infrequently
- Better to support a few simple data types efficiently
  - Synthesize complex ones
Simple addressing modes
- Complex addressing modes lead to variable length instructions
  - Lead to inefficient instruction decoding and scheduling

Large register set
- Efficient support for procedure calls and returns
  - Patterson and Sequin’s study
    - Procedure call/return: 12-15% of HLL statements
      
      »Constitute 31-33% of machine language instructions
      
      »Generate nearly half (45%) of memory references
- Small activation record
  - Tanenbaum’s study
    - Only 1.25% of the calls have more than 6 arguments
    - More than 93% have less than 6 local scalar variables
    - Large register set can avoid memory references

ISA design issues

ISA：Instruction Set Architecture

Instruction set design

Issues when determining ISA
- Instruction types
- Number of addresses
- Addressing modes

Instruction types

Arithmetic and logic
Data movement
I/O (memory-mapped, isolated I/O)
Flow control
- Branches (unconditional, conditional)
  - set-then-jump (cmp AX, BX; je target)
  - Test-and-jump (beq r1, r2, target)
- Procedure calls (register-based, stack-based)
  - Pentium: ret; MIPS: jr
  - Register: faster but limited number of parameters
  - Stack: slower but more general

Operand types

Instructions support basic data types
- Characters
- Integers
- Floating-point
Instruction overload
- Same instruction for different data types
- Example: Pentium
  
  mov AL,address ;loads an 8-bit value
  
  mov AX,address ;loads a 16-bit value
  
  mov EAX,address ;loads a 32-bit value
Separate instructions
- Instructions specify the operand size
- Example: MIPS

lb    Rdest,address   ;loads a byte
lh    Rdest,address   ;loads a halfword 
                      ;(16 bits) 
lw    Rdest,address   ;loads a word 
                      ;(32 bits) 
ld    Rdest,address   ;loads a doubleword 
                      ;(64 bits)

Number of addresses

Four categories
- 3-address machines
  - two for the source operands and one for the result
- 2-address machines
  - One address doubles as source and result
- 1-address machine
  - Accumulator machines
  - Accumulator is used for one source and result
- 0-address machines
  - Stack machines
  - Operands are taken from the stack
  - Result goes onto the stack

Number of addresses	instruction	operation
3	OP A, B, C	A ← B OP C
2	OP A, B	A ← A OP B
1	OP A	AC ← AC OP A
0	OP	T ← (T-1) OP T

A, B, C: memory or register locations
AC: accumulator
T: top of stack
T-1: second element of stack

3-address

Example: RISC machines, TOY

SUB Y, A, B    ; Y = A - B
MUL T, D, E    ; T = D × E
ADD T, T, C    ; T = T + C
DIV Y, Y, T    ; Y = Y / T

2-address

Example: IA32

MOV Y, A        ; Y = A
SUB Y, B        ; Y = Y - B
MOV T, D        ; T = D
MUL T, E        ; T = T × E
ADD T, C        ; T = T + C
DIV Y, T        ; Y = Y / T

1-address

Example: IA32’s MUL (EAX)

LD     D        ; AC = D
MUL    E        ; AC = AC × E
ADD    C        ; AC = AC + C
ST    Y        ; Y = AC
LD    A        ; AC = A
SUB    B        ; AC = AC – B
DIV     Y        ; AC = AC / Y
ST    Y        ; Y = AC

0-address

Example: IA32’s FPU, HP3000

PUSH     A        ; A
PUSH B        ; A, B
SUB            ; A-B
PUSH C        ; A-B, C
PUSH    D        ; A-B, C, D
PUSH E        ; A-B, C, D, E
MUL            ; A-B, C, D× E
ADD            ; A-B, C+(D× E)
DIV             ; (A-B) / (C+(D× E))
POP    Y

A basic design decision; could be mixed
Fewer addresses per instruction results in
- a less complex processor
- shorter instructions
- longer and more complex programs
- longer execution time
The decision has impacts on register usage policy as well
- 3-address usually means more general-purpose registers
- 1-address usually means less

Addressing modes

How to specify location of operands? Trade-off for address range, address flexibility, number of memory references, calculation of addresses
Operands can be in three places
- Registers
  - Register addressing mode
- Part of instruction
  - Constant
  - Immediate addressing mode
  - All processors support these two addressing modes
- Memory
  - Difference between RISC and CISC
  - CISC supports a large variety of addressing modes
  - RISC follows load/store architecture

Common addressing modes
- Implied
- Immediate (lda R1, 1)
- Direct (st R1, A)
- Indirect
- Register (add R1, R2, R3)
- Register indirect (sti R1, R2)
- Displacement
- Stack

Implied addressing

No address field; operand is implied by the instruction

CLC ; clear carry
A fixed and unvarying address

Immediate addressing

Address field contains the operand value

ADD 5; AC=AC+5
Pros: no extra memory reference; faster
Cons: limited range

Direct addressing

Address field contains the effective address of the operand

ADD A; AC=AC+[A]
single memory reference
Pros: no additional address calculation
Cons: limited address space

Indirect addressing

Address field contains the address of a pointer to the operand

ADD [A]; AC=AC+[[A]]
multiple memory references
Pros: large address space
Cons: slower

Register addressing

Address field contains the address of a register

ADD R; AC=AC+R
Pros: only need a small address field; shorter instruction and faster fetch; no memory reference
Cons: limited address space

Register indirect addressing

Address field contains the address of the register containing a pointer to the operand

ADD [R]; AC=AC+[R]
Pros: large address space
Cons: extra memory reference

Displacement addressing

Address field could contain a register address and an address

MOV EAX, [A+ESI*4]
EA=A+[R×S] or vice versa
Several variants
- Base-offset: [EBP+8]
- Base-index: [EBX+ESI]
- Scaled: [T+ESI*4]
Pros: flexible
Cons: complex

MOV EAX, [A+ESI*4]

Often, register, called indexing register, is used for displacement.
Usually, a mechanism is provided to efficiently increase the indexing register.

Stack addressing

Operand is on top of the stack

ADD [R]; AC=AC+[R]
Pros: large address space
Pros: short and fast fetch
Cons: limited by FILO order

Mode	Meaning	Pros	Cons
Implied		Fast fetch	Limited instructions
Immediate	Operand=A	No memory ref	Limited operand
Direct	EA=A	Simple	Limited address space
Indirect	EA=[A]	Large address space	Multiple memory ref
Register	EA=R	No memory ref	Limited address space
Register indirect	EA=[R]	Large address space	Extra memory ref
Displacement	EA=A+[R]	Flexibility	Complexity
stack	EA=stack top	No memory ref	Limited applicability

IA32 addressing modes

Effective address calculation (IA32)

Based Addressing

Effective address is computed as base + signed displacement
- Displacement:
  - 16-bit addresses: 8- or 16-bit number
  - 32-bit addresses: 8- or 32-bit number
Useful to access fields of a structure or record
- Base register -> points to the base address of the structure
- Displacement -> relative offset within the structure
Useful to access arrays whose element size is not 2, 4, or 8 bytes
- Displacement -> points to the beginning of the array
- Base register -> relative offset of an element within the array

Indexed Addressing

Effective address is computed as (index * scale factor) + signed displacement
- 16-bit addresses:
  - displacement: 8- or 16-bit number
  - scale factor: none (i.e., 1)
- 32-bit addresses:
  - displacement: 8- or 32-bit number
  - scale factor: 2, 4, or 8
Useful to access elements of an array (particularly if the element size is 2, 4, or 8 bytes)
- Displacement -> points to the beginning of the array
- Index register -> selects an element of the array (array index)
- Scaling factor -> size of the array element

Based-Indexed Addressing

Based-indexed addressing with no scale factor

Effective address is computed as base + index + signed displacement
Useful in accessing two-dimensional arrays
- Displacement -> points to the beginning of the array
- Base and index registers point to a row and an element within that row
Useful in accessing arrays of records
- Displacement -> represents the offset of a field in a record
- Base and index registers hold a pointer to the base of the array and the offset of an element relative to the base of the array
Useful in accessing arrays passed on to a procedure
- Base register -> points to the beginning of the array
- Index register -> represents the offset of an element relative to the base of the array

Example

Assuming BX points to table1

 **mov AX,[BX+SI]**

 **cmp** **AX,[BX+SI+2]**

compares two successive elements of table1

Based-indexed addressing with scale factor

Effective address is computed as base + (index * scale factor) + signed displacement
Useful in accessing two-dimensional arrays when the element size is 2, 4, or 8 bytes
- Displacement ==> points to the beginning of the array
- Base register ==> holds offset to a row (relative to start of array)
- Index register ==> selects an element of the row
- Scaling factor ==> size of the array element

ARM Architecture

ARM history

1983 developed by Acorn computers
- To replace 6502 in BBC computers
- 4-man VLSI design team
- Its simplicity comes from the inexperience team
- Match the needs for generalized SoC for reasonable power, performance and die size
- The first commercial RISC implemenation
1990 ARM (Advanced RISC Machine), owned by Acorn, Apple and VLSI

Design and license ARM core design but not fabricate

Why ARM?

One of the most licensed and thus widespread processor cores in the world
- Used in PDA, cell phones, multimedia players, handheld game console, digital TV and cameras
- ARM7: GBA, iPod
- ARM9: NDS, PSP, Sony Ericsson, BenQ
- ARM11: Apple iPhone, Nokia N93, N800
- 90% of 32-bit embedded RISC processors till 2009
Used especially in portable devices due to its low power consumption and reasonable performance

ARM processors

A simple but powerful design
A whole family of designs sharing similar design principles and a common instruction set

Naming ARM

ARMxyzTDMIEJFS
- x: series
- y: MMU
- z: cache
- T: Thumb
- D: debugger
- M: Multiplier
- I: EmbeddedICE (built-in debugger hardware)
- E: Enhanced instruction
- J: Jazelle (JVM)
- F: Floating-point
- S: Synthesizible version (source code version for EDA tools)

Popular ARM architectures

ARM7TDMI
- 3 pipeline stages (fetch/decode/execute)
- High code density/low power consumption
- One of the most used ARM-version (for low-end systems)
- All ARM cores after ARM7TDMI include TDMI even if they do not include TDMI in their labels
ARM9TDMI
- Compatible with ARM7
- 5 stages (fetch/decode/execute/memory/write)
- Separate instruction and data cache
ARM11

ARM is a RISC

RISC: simple but powerful instructions that execute within a single cycle at high clock speed.
Four major design rules:
- Instructions: reduced set/single cycle/fixed length
- Pipeline: decode in one stage/no need for microcode
- Registers: a large set of general-purpose registers
- Load/store architecture: data processing instructions apply to registers only; load/store to transfer data from memory
Results in simple design and fast clock rate
The distinction blurs because CISC implements RISC concepts

ARM design philosophy(哲学)

Small processor for lower power consumption (for embedded system)
High code density for limited memory and physical size restrictions
The ability to use slow and low-cost memory
Reduced die size for reducing manufacture cost and accommodating more peripherals

ARM features

Different from pure RISC in several ways:
- Variable cycle execution for certain instructions: multiple-register load/store (faster/higher code density)
- Inline barrel shifter leading to more complex instructions: improves performance and code density
- Thumb 16-bit instruction set: 30% code density improvement
- Conditional execution: improve performance and code density by reducing branch
- Enhanced instructions: DSP instructions

Load/store architecture
A large array of uniform registers
Fixed-length 32-bit instructions
3-address instructions

Registers

Only 16 registers are visible to a specific mode. A mode could access
- A particular set of r0-r12
- r13 (sp, stack pointer)
- r14 (lr, link register)
- r15 (pc, program counter)
- Current program status register (cpsr)
- The uses of r0-r13 are orthogonal

General-purpose registers

6 data types (signed/unsigned)
All ARM operations are 32-bit. Shorter data types are only supported by data transfer operations.

Program counter

Store the address of the instruction to be executed
All instructions are 32-bit wide and word-aligned
Thus, the last two bits of pc are undefined.

Program status register (CPSR)

Processor modes

Register organization

Instruction sets

ARM/Thumb/Jazelle

Pipeline

Execution of a branch or direct modification of pc causes ARM core to flush its pipeline
ARM10 starts to use branch prediction
An instruction in the execution stage will complete even though an interrupt has been raised. Other instructions in the pipeline are abondond.

Interrupts

ARM Instruction Set

ARM programmer model

The state of an ARM system is determined by the content of visible registers and memory.
A user-mode program can see 15 32-bit general-purpose registers (R0-R14), program counter (PC) and CPSR.
Instruction set defines the operations that can change the state.

Memory system

Byte ordering

ARM programmer model

Instruction set

Features of ARM instruction set

Load-store architecture
3-address instructions
Conditional execution of every instruction
Possible to load/store multiple registers at once
Possible to combine shift and ALU operations in a single instruction

Data processing

They are move, arithmetic, logical, comparison and multiply instructions.
Most data processing instructions can process one of their operands using the barrel shifter.
General rules:
- All operands are 32-bit, coming from registers or literals.
- The result, if any, is 32-bit and placed in a register (with the exception for long multiply which produces a 64-bit result)
- 3-address format

MOV<cc><S>  Rd, <operands>

MOVCS R0, R1 @ if carry is set
             @ then R0:=R1

MOVS  R0, #0 @ R0:=0
             @ Z=1, N=0
             @ C, V unaffected

Conditional execution

Almost all ARM instructions have a condition field which allows it to be executed conditionally.

movcs R0, R1

Register movement

Addressing modes

Register operands

ADD R0, R1, R2
Immediate operands

Shifted register operands

One operand to ALU is routed through the Barrel shifter. Thus, the operand can be modified before it is used. Useful for fast multipliation and dealing with lists, table and other complex data structure. (similar to the displacement addressing mode in CISC.)
Some instructions (e.g. MUL, CLZ, QADD) do not read barrel shifter.

It is possible to use a register to specify the number of bits to be shifted; only the bottom 8 bits of the register are significant.

Multiplication

MOV  R1, #35
 MUL  R2, R0, R1
         or
 ADD  R0, R0, R0, LSL #2  @ R0’=5xR0
 RSB  R2, R0, R0, LSL #3  @ R2 =7xR0’

Shifted register operands

Encoding data processing instructions

Arithmetic

Add and subtraction

Setting the condition codes

Any data processing instruction can set the condition codes if the programmers wish it to

64-bit addition

ADDS R2, R2, R0

ADC R3, R3, R1

Logical

Comparison

These instructions do not generate a result, but set condition code bits (N, Z, C, V) in CPSR. Often, a branch operation follows to change the program flow.

Multiplication

MUL R0, R1, R2 @ R0 = (R1xR2)[31:0]
Features:
- Second operand can’t be immediate
- The result register must be different from the first operand
- Cycles depends on core type
- If S bit is set, C flag is meaningless
See the reference manual

Multiply-accumulate (2D array indexing)

MLA R4, R3, R2, R1 @ R4 = R3xR2+R1
Multiply with a constant can often be more efficiently implemented using shifted register operand

MOV R1, #35

MUL R2, R0, R1

ADD R0, R0, R0, LSL #2 @ R0’=5xR0

 **RSB R2, R0, R0, LSL #3 @ R2 =7xR0’**

Flow control instructions

Determine the instruction to be executed next

Branch instruction

B  label    
            …
label:    …

Conditional branches

MOV  R0, #0    
loop:        …
            ADD  R0, R0, #1
            CMP  R0, #10
            BNE  loop

Branch conditions

Branch and link

BL instruction saves the return address to R14 (lr)

        BL    sub     @ call sub
          CMP   R1, #5  @ return to here
          MOVEQ R1, #0
          …
sub:    …            @ sub entry point
        …
        MOV   PC, LR    @ return

Conditional execution

CMP  R0, #5    
            BEQ  bypass     @ if (R0!=5) {
            ADD  R1, R1, R0 @  R1=R1+R0-R2 
            SUB  R1, R1, R2 @ }
bypass:    …
----------------------------------------------------
            CMP   R0, #5
            ADDNE R1, R1, R0
            SUBNE R1, R1, R2    //smaller and faster

Rule of thumb: if the conditional sequence is three instructions or less, it is better to use conditional execution than a branch.

if ((R0==R1) && (R2==R3)) R4++
-----------------------------------------------------
          CMP   R0, R1
          BNE   skip
          CMP   R2, R3
          BNE   skip
          ADD   R4, R4, #1
skip:    …  
-----------------------------------------------------
          CMP   R0, R1
          CMPEQ R2, R3
          ADDEQ R4, R4, #1

Data transfer instructions

Move data between registers and memory
Three basic forms
- Single register load/store
- Multiple register load/store
- Single register swap: SWP(B), atomic instruction for semaphore

Single register load/store

No STRSB/STRSH since STRB/STRH stores both

signed/unsigned ones

The data items can be a 8-bit byte, 16-bit half-word or 32-bit word. Addresses must be boundary aligned. (e.g. 4’s multiple for LDR/STR)

Addressing modes

Memory is addressed by a register and an offset.

LDR R0, [R1] @ mem[R1]
Three ways to specify offsets:
- Immediate
  
  LDR R0, [R1, #4] @ mem[R1+4]
- Register
  
  LDR R0, [R1, R2] @ mem[R1+R2]
- Scaled register @ mem[R1+4*R2]
  
  LDR R0, [R1, R2, LSL #2]
Pre-index addressing (LDR R0, [R1, #4]) without a writeback
Auto-indexing addressing (LDR R0, [R1, #4]!)

Pre-index with writeback

calculation before accessing with a writeback
Post-index addressing (LDR R0, [R1], #4)

calculation after accessing with a writeback

Comparisons

Pre-indexed addressing

LDR R0, [R1, R2] @ R0=mem[R1+R2]

@ R1 unchanged

Auto-indexing addressing

LDR R0, [R1, R2]! @ R0=mem[R1+R2]

@ R1=R1+R2

Post-indexed addressing

LDR R0, [R1], R2 @ R0=mem[R1]

@ R1=R1+R2

Summary of addressing modes

Load an address into a register

Note that all addressing modes are register-offseted. Can we issue LDR R0, Table? The pseudo instruction ADR loads a register with an address

table:     .word    10
…
            ADR  R0, table

Assembler transfer pseudo instruction into a sequence of appropriate instructions

sub r0, pc, #12

Application

            ADR R1, table
loop:     LDR R0, [R1]
            ADD R1, R1, #4
            @ operations on R0
            …
--------------------------------------------
            ADR R1, table
loop:     LDR R0, [R1], #4

            @ operations on R0
            …

Multiple register load/store

Transfer a block of data more efficiently.
Used for procedure entry and exit for saving and restoring workspace registers and the return address
For ARM7, 2+Nt cycles (N:#words, t:time for a word for sequential access). Increase interrupt latency since it can’t be interrupted.
registers are arranged an in increasing order; see manual

LDMIA  R1, {R0, R2, R5} @ R0 = mem[R1]
                         @ R2 = mem[r1+4]
                         @ R5 = mem[r1+8]

LDM load multiple registers
STM store multiple registers

suffix meaning

IA increase after
IB increase before
DA decrease after
DB decrease before

LDM<mode> Rn, {<registers>}
IA: addr:=Rn    //1
IB: addr:=Rn+4    //2
DA: addr:=Rn-#<registers>*4+4    //3
DB: addr:=Rn-#<registers>*4        //4
For each Ri in <registers> 
  IB: addr:=addr+4
  DB: addr:=addr-4
  Ri:=M[addr]
  IA: addr:=addr+4
  DA: addr:=addr-4
<!>: Rn:=addr

LDMIA R0, {R1,R2,R3}
  or
LDMIA R0, {R1-R3}


R1: 10
R2: 20
R3: 30
R0: 0x10

LDMIA R0!, {R1,R2,R3}

R1: 10
R2: 20
R3: 30
R0: 0x01C

LDMIB R0!, {R1,R2,R3}

R1: 20
R2: 30
R3: 40
R0: 0x01C

LDMDA R0!, {R1,R2,R3}

R1: 40
R2: 50
R3: 60
R0: 0x018

LDMDB R0!, {R1,R2,R3}

R1: 30
R2: 40
R3: 50
R0: 0x018

Copy a block of memory
- R9: address of the source
- R10: address of the destination
- R11: end address of the source

loop: LDMIA R9!, {R0-R7}
         STMIA R10!, {R0-R7}
         CMP   R9, R11    
          BNE   loop

Stack (full: pointing to the last used; ascending: grow towards increasing memory addresses)

mode	POP	=LDM	PUSH	=STM
Full ascending (FA)	LDMFA	LDMDA	STMFA	STMIB
Full descending (FD)	LDMFD	LDMIA	STMFD	STMDB
Empty ascending (EA)	LDMEA	LDMDB	STMEA	STMIA
Empty descending (ED)	LDMED	LDMIB	STMED	STMDA

STMFD R13!, {R2-R9} @ used for ATPCS
… @ modify R2-R9
LDMFD R13!, {R2-R9}

Swap instruction

Swap between memory and register. Atomic operation preventing any other instruction from reading/writing to that location until it completes

Software interrupt

A software interrupt instruction causes a software interrupt exception, which provides a mechanism for applications to call OS routines.

Load constants

No ARM instruction loads a 32-bit constant into a register because ARM instructions are 32-bit long. There is a pseudo code for this.

Immediate numbers

Assemblers implement this usually with two options depending on the number you try to load.

Assume that you want to load 511 into R0
- Construct in multiple instructions
  
  mov r0, #256
  
  add r0, #255
- Load from memory; declare L511 .word 511
  
  ldr r0, L511 ldr r0, [pc, #0]
Guideline: if you can construct it in two instructions, do it; otherwise, load it.
The assembler decides for you

ldr r0, =255 mov r0, 255

ldr r0, =511 ldr r0, [pc, #4]

PC-relative modes

Instruction set

ARM Assembly Programming

GNU compiler and binutils

GNU compiler and binutils
gcc: GNU C compiler
as: GNU assembler
ld: GNU linker
gdb: GNU project debugger
insight: a (Tcl/Tk) graphic interface to gdb

Pipeline

COFF (common object file format)
ELF (extended linker format)
Segments in the object file
- Text: code
- Data: initialized global variables
- BSS: uninitialized global variables

GAS program format

            .file “test.s”
            .text
            .global main
            .type main, %function
main:
            MOV R0, #100
            ADD R0, R0, R0
            SWI #11
            .end

ARM assembly program

Control structures

Program is to implement algorithms to solve problems. Program decomposition and flow of control are important concepts to express algorithms.
Flow of control:
- Sequence.
- Decision: if-then-else, switch
- Iteration: repeat-until, do-while, for
Decomposition: split a problem into several smaller and manageable ones and solve them independently. (subroutines/functions/procedures)

Decision

If statements

if (R1==1 || R1==5 || R1==12) R0=1;

TEQ   R1, #1        ...
TEQNE R1, #5        ...
TEQNE R1, #12        ...
MOVEQ R0, #1        BNE fail

if (R1==0) zero
else if (R1>0) plus
else if (R1<0) neg

      CMP   R1, #0        
      BMI   neg
      BEQ   zero
      BPL   plus
neg:  ...
      B exit
Zero: ...
      B exit
      ...

R0=abs(R0)

CMP    R0, #0
RSBMI  R0, R0, #0

Multi-way branches

CMP R0, #`0’
        BCC other    @ less than ‘0’
        CMP R0, #`9’
        BLS digit   @ between ‘0’ and ‘9’
        ------------------------------------
        CMP R0, #`A’
        BCC other
        CMP R0, #`Z’
        BLS letter  @ between ‘A’ and ‘Z’
        ------------------------------------
        CMP R0, #`a’
        BCC other
        CMP R0, #`z’
        BHI other  @ not between ‘a’ and ‘z’  
        ------------------------------------
letter: ...

Switch statements

Iteration

repeat loops

while loops

GCD

int gcd (int i, int j)
{
    while (i!=j)
    {
      if (i>j)
        i -= j;
      else
        j -= i;
    }
}

Loop:  CMP   R1, R2
       SUBGT R1, R1, R2
       SUBLT R2, R2, R1
       BNE   loop

for loops

Procedures

Arguments: expressions passed into a function
Parameters: values received by the function
Caller and callee

How to pass arguments? By registers? By stack? By memory? In what order?

Who should save R5? Caller? Callee?

We need a protocol for these.

ARM Procedure Call Standard (APCS)

ARM Ltd. defines a set of rules for procedure entry and exit so that
- Object codes generated by different compilers can be linked together
- Procedures can be called between high-level languages and assembly
APCS defines
- Use of registers
- Use of stack
- Format of stack-based data structure
- Mechanism for argument passing

APCS register usage convention

0-4:
- Used to pass the first 4 parameters
- Caller-saved if necessary
4-11:
- Register variables, must return unchanged
- Callee-saved
9-15
- Registers for special purposes
- Could be used as temporary variables if saved properly.

Argument passing

The first four word arguments are passed through R0 to R3.
Remaining parameters are pushed into stack in the reverse order.
Procedures with less than four parameters are more effective.

Return value

One word value in R0
A value of length 2~4 words (R0-R1, R0-R2, R0-R3)

Function entry/exit

A simple leaf function with less than four parameters has the minimal overhead. 50% of calls are to leaf functions

Save a minimal set of temporary variables

BL leaf2
       ...

leaf2: STMFD sp!, {regs, lr} @ save
       ...
       LDMFD sp!, {regs, pc} @ restore and
                               @ return

Accessing operands

A procedure often accesses operands in the following ways
- An argument passed on a register: no further work
- An argument passed on the stack: use stack pointer (R13) relative addressing with an immediate offset known at compiling time
- A constant: PC-relative addressing, offset known at compiling time
- A local variable: allocate on the stack and access through stack pointer relative addressing
- A global variable: allocated in the static area and can be accessed by the static base relative (R9) addressing

main:
        LDR   R0, #0
        ...
        BL    func
        ...

func:   STMFD SP!, {R4-R6, LR}
        SUB   SP, SP, #0xC
        ...
        STR   R0, [SP, #0] @ v1=a1

        ...
        ADD   SP, SP, #0xC
        LDMFD SP!, {R4-R6, PC}

Optimizing ARM Assembly

Optimization

Compilers do perform optimization, but they have blind sites. There are some optimization tools that you can’t explicitly use by writing C, for example.
- Instruction scheduling
- Register allocation
- Conditional execution
You have to use hand-written assembly to optimize critical routines.
Use ARM9TDMI as the example, but the rules apply to all ARM cores.
Note that the codes are sometimes in armasm format, not gas.

ARM optimization

Utilize ARM ISA’s features
- Conditional execution
- Multiple register load/store
- Scaled register operand
- Addressing modes

Instruction scheduling

ARM9 pipeline

Hazard/Interlock: If the required data is the unavailable result from the previous instruction, then the process stalls.
No hazard, 2 cycles

One-cycle interlock

One-cycle interlock, 4 cycles

Brach takes 3 cycles due to stalls

Scheduling of load instructions

Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions. Careful scheduling of loads can avoid stalls.

2-cycle stall. Total 11 cycles for a character.
It can be avoided by preloading and unrolling.
The key is to do some work when awaiting data

Load scheduling by preloading

Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop.
Since loop i is loading data for loop i+1, there is always a problem with the first and last loops. For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution.

Load scheduling by unrolling

Unroll and interleave the body of the loop. For example, we can perform three loops together. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.

Register allocation

APCS requires callee to save R4~R11 and to keep the stack 8-byte aligned.

We stack R12 only for making the stack 8-byte aligned.

Unroll the loop to handle 8 words at a time and to use multiple load/store

What variables do we have?

We still need to assign carry and kr, but we have used 13 registers and only one remains.
- Work on 4 words instead
- Use stack to save least-used variable, here N
- Alter the code
We notice that carry does not need to stay in the same register. Thus, we can use yi for it.

More than 14 local variables

If you need more than 14 local variables, then you store some on the stack.
Work outwards from the inner loops since they have more performance impact.

Packing

Pack multiple (sub-32bit) variables into a single register.

When shifting by a register amount, ARM uses bits 0~7 and ignores others.
Shift an array of 40 entries by shift bits.

Simulate SIMD (single instruction multiple data).
Assume that we want to merge two images X and Y to produce Z by

Load 4 bytes at a time

Unpack it and promote to 16-bit data

Work on 176x144 images

Conditional execution

By combining conditional execution and conditional setting of the flags, you can implement simple if statements without any need of branches.
This improves efficiency since branches can take many cycles and also reduces code size.

Block copy example

void bcopy(char *to, char *from, int n)
{
  while (n--)
    *to++ = *from++;
}

@ arguments: R0: to, R1: from, R2: n
bcopy:  TEQ  R2, #0
        BEQ  end
loop:   SUB  R2, R2, #1
        LDRB R3, [R1], #1
        STRB R3, [R0], #1
        B    bcopy
end:    MOV  PC, LR

@ arguments: R0: to, R1: from, R2: n
@ rewrite “n–-” as “-–n>=0”
bcopy:  SUBS   R2, R2, #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        BPL    bcopy
        MOV    PC, LR

@ arguments: R0: to, R1: from, R2: n
@ assume n is a multiple of 4; loop unrolling
bcopy:  SUBS   R2, R2, #4
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        BPL    bcopy
        MOV    PC, LR

@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16; 
bcopy:  SUBS   R2, R2, #16
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        BPL   bcopy
        MOV   PC, LR

@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16; 
bcopy:  SUBS   R2, R2, #16
        LDMPL  R1!, {R3-R6} 
        STMPL  R0!, {R3-R6}
        BPL    bcopy
        MOV    PC, LR

@ could be extend to copy 40 byte at a time
@ if not multiple of 40, add a copy_rest loop

Search example

int main(void)
{
  int a[10]={7,6,4,5,5,1,3,2,9,8};
  int i;
  int s=4;

  for (i=0; i<10; i++)
    if (s==a[i]) break;
  if (i>=10) return -1;
  else return i;
}

Search

           .section     .rodata
.LC0:
       .word      7
       .word      6
       .word      4
     .word      5
       .word  5
       .word      1
       .word      3
       .word      2
       .word      9
       .word      8

           .text
           .global    main
        .type    main, %function
main: sub   sp, sp, #48
      adr   r4, L9 @ =.LC0
      add   r5, sp, #8
      ldmia r4!, {r0, r1, r2, r3} 
      stmia r5!, {r0, r1, r2, r3} 
      ldmia r4!, {r0, r1, r2, r3} 
      stmia r5!, {r0, r1, r2, r3} 
      ldmia r4!, {r0, r1} 
      stmia r5!, {r0, r1}

mov  r3, #4
      str  r3, [sp, #0] @ s=4
      mov  r3, #0
      str  r3, [sp, #4] @ i=0

loop: ldr  r0, [sp, #4] @ r0=i
      cmp  r0, #10      @ i<10?
      bge  end
      ldr  r1, [sp, #0] @ r1=s
      mov  r2, #4
      mul  r3, r0, r2
      add  r3, r3, #8
      ldr  r4, [sp, r3] @ r4=a[i]

teq  r1, r4  @ test if s==a[i]
     beq  end  

     add  r0, r0, #1   @ i++
     str  r0, [sp, #4] @ update i
     b    loop  

end: str   r0, [sp, #4]
     cmp   r0, #10
     movge r0, #-1
     add   sp, sp, #48
     mov   pc, lr

Optimization

Remove unnecessary load/store
Remove loop invariant
Use addressing mode
Use conditional execution

Search (remove load/store)

Search (loop invariant/addressing mode)

Remove unnecessary load/store
Remove loop invariant
Use addressing mode
Use conditional execution
From 22 words to 13 words and execution time is greatly reduced.

杰克成

https://jackhcc.github.io/posts/Embedded-Assembly-Languages.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源杰克成 !

Assembly

DL专栏14-DL环境搭建指南

Deep Learning环境配置保姆教程

2021-10-22 Deep Learning

Environment

DL专栏13-Attention机制总结

各类Attention原理与实现

2021-10-20 Deep Learning

Attention

Embedded Assembly Languages

Overview

Translating languages

Virtual machines

High-level language

Assembly language

Operating system

Instruction set architecture

Microarchitecture

Digital logic

Data representation

Binary representations

Unsigned binary integers

Integer storage sizes

Large measurements

Signed integers

Two’s complement notation

Ranges of signed integers

Character

Representing Instructions

Boolean algebra

Implementation of gates

Advanced Architecture

Basic architecture

Basic microcomputer design

Clock

Instruction execution cycle

Pipeline

Multi-stage pipeline

Pipelined execution

Wasted cycles (pipelined)

Superscalar

Pipeline stages

Hazards(危害)

Data hazards

Control hazards

Branch prediction

Multitasking

Cache

SRAM vs DRAM

The CPU-Memory gap

Memory hierarchies(等级)

Memory system in practice

Reading from memory

Cache memory

Caching in a memory hierarchy

General caching concepts

Locality

Blocked matrix multiply performance

Cache-conscious programming

RISC v.s. CISC

Trade-offs of instruction sets

RISC

RISC Design Principles

CISC and RISC

Why RISC?

ISA design issues

Instruction set design

Instruction types

Operand types

Number of addresses

3-address

2-address

1-address

0-address

Addressing modes

Implied addressing

Immediate addressing

Direct addressing

Indirect addressing

Register addressing

Register indirect addressing

Displacement addressing

Stack addressing

IA32 addressing modes

Effective address calculation (IA32)

Based Addressing

Indexed Addressing

Based-Indexed Addressing

ARM Architecture