Overview
Translating languages
Virtual machines
Abstractions for computers
High-level language
Level 5
Application-oriented languages
Programs compile into assembly language (Level 4)
Assembly language
Level 4
Instruction mnemonics that have a one-to-one correspondence to machine language
Calls functions written at the operating system level (Level 3)
Programs are translated into machine language (Level 2)
Operating system
Level 3
Provides services
Programs translated and run at the instruction set architecture level (Level 2)
Instruction set architecture
Level 2
Also known as conventional machine language
Executed by Level 1 program (microarchitecture, Level 1)
Microarchitecture
Level 1
Interprets conventional machine instructions (Level 2)
Executed by digital hardware (Level 0)
Digital logic
Level 0
CPU, constructed from digital logic gates
System bus
Memory
Data representation
Computer is a construction of digital circuits with two states: on and off
You need to have the ability to translate between different representations to examine the content of the machine
Common number systems: binary, octal, decimal and hexadecimal
Binary representations
- Electronic Implementation
- Easy to store with bistable elements
- Reliably transmitted on noisy and inaccurate wires
Digits are 1 and 0 (a binary digit is called a bit)
1 = true
0 = false
MSB –most significant bit
LSB –least significant bit
Bit numbering:
A bit string could have different interpretations
Unsigned binary integers
Each digit (bit) is either 1 or 0
Each bit represents a power of 2:
Integer storage sizes
Large measurements
•Kilobyte (KB), 210 bytes
•Megabyte (MB), 220 bytes
•Gigabyte (GB), 230 bytes
•Terabyte (TB), 240 bytes
•Petabyte
•Exabyte
•Zettabyte
•Yottabyte
Signed integers
The highest bit indicates the sign. 1 = negative, 0 = positive
If the highest digit of a hexadecmal integer is > 7, the value is negative. Examples: 8A, C5, A2, 9D
Two’s complement notation
Steps:
Complement (reverse) each bit
Add 1
Ranges of signed integers
Character
Character sets
- Standard ASCII (0 – 127)
- Extended ASCII (0 – 255)
- ANSI (0 – 255)
- Unicode (0 – 65,535)
Null-terminated String
- Array of characters followed by a null byte
Using the ASCII table
- back inside cover of book
Representing Instructions
int sum(int x, int y)
{
return x+y;
}
For this example, Alpha & Sun use two 4-byte instructions
- Use differing numbers of instructions in other cases
PC uses 7 instructions with lengths 1, 2, and 3 bytes
- Same for NT and for Linux
- NT / Linux not fully binary compatible
Boolean algebra
- Boolean expressions created from:
- NOT, AND, OR
Implementation of gates
Advanced Architecture
Basic architecture
Basic microcomputer design
clock synchronizes CPU operations
control unit (CU) coordinates sequence of execution steps
ALU performs arithmetic and logic operations
The memory storage unit holds instructions and data for a running program
A bus is a group of wires that transfer data from one part to another (data, address, control)
Clock
synchronizes all CPU and BUS operations
machine (clock) cycle measures time of a single operation
clock is used to trigger events
Basic unit of time, 1GHz→clock cycle=1ns
An instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles
Instruction execution cycle
Fetch
Decode
Fetch operands
Execute
Store output
Pipeline
Multi-stage pipeline
Pipelining makes it possible for processor to execute instructions in parallel
Instruction execution divided into discrete stages
Example of a non-pipelined processor. For example, 80386. Many wasted cycles.
Pipelined execution
More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)
For k stages and n instructions, the number of required cycles is: k + (n – 1) compared to k*n
- Pipelining requires buffers
- Each buffer holds a single value
- Ideal scenario: equal work for each stage
- Sometimes it is not possible
- Slowest stage determines the flow rate in the entire pipeline
Some reasons for unequal work stages
- A complex step cannot be subdivided conveniently
- An operation takes variable amount of time to execute, e.g. operand fetch time depends on where the operands are located
- Registers
- Cache
- Memory
- Complexity of operation depends on the type of operation
- Add: may take one cycle
- Multiply: may take several cycles
Operand fetch of I2 takes three cycles
- Pipeline stalls for two cycles
- Caused by hazards
- Pipeline stalls reduce overall throughput
- Pipeline stalls for two cycles
Wasted cycles (pipelined)
When one of the stages requires two or more clock cycles, clock cycles are again wasted.
For k stages and n instructions, the number of required cycles is: k + (2n – 1)
Superscalar
A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).
For k states and n instructions, the number of required cycles is: k + n
Pentium: 2 pipelines
Pentium Pro: 3
Pipeline stages
Pentium 3: 10
Pentium 4: 20~31
Next-generation micro-architecture: 14
ARM7: 3
Hazards(危害)
- Three types of hazards
- Resource hazards
- Occurs when two or more instructions use the same resource, also called structural hazards
- Data hazards
- Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2
- Control hazards
- Default: sequential execution suits pipelining
- Altering control flow (e.g., branching) causes problems, introducing control dependencies
- Resource hazards
Data hazards
add r1, r2, #10 ; write r1
sub r3, r1, #20 ; read r1
- Forwarding: provides output result as soon as possible
Control hazards
bz r1, target
add r2, r4, 0
...
target: add r2, r3, 0
Braches alter control flow
- Require special attention in pipelining
- Need to throw away some instructions in the pipeline
- Depends on when we know the branch is taken
- Pipeline wastes three clock cycles
- Called branch penalty
- Reducing branch penalty
- Determine branch decision early
Delayed branch execution
- Effectively reduces the branch penalty
- We always fetch the instruction following the branch
- Why throw it away?
- Place a useful instruction to execute
- This is called delay slot
Branch prediction
Three prediction strategies
Fixed
Prediction is fixed
Example: branch-never-taken
»Not proper for loop structures
Static
- Strategy depends on the branch type
- Conditional branch: always not taken
- Loop: always taken
- Strategy depends on the branch type
Dynamic
- Takes run-time history to make more accurate predictions
Static prediction
- Improves prediction accuracy over Fixed
Dynamic branch prediction
- Uses runtime history
- Takes the past n branch executions of the branch type and makes the prediction
- Simple strategy
- Prediction of the next branch is the majority of the previous n branch executions
- Example: n = 3
- If two or more of the last three branches were taken, the prediction is “branch taken”
- Depending on the type of mix, we get more than 90% prediction accuracy
- Uses runtime history
Impact of past n branches on prediction accuracy
Multitasking
OS can run multiple programs at the same time.
Multiple threads of execution within the same program.
Scheduler utility assigns a given amount of CPU time to each running program.
Rapid switching of tasks
- gives illusion that all programs are running at once
- the processor must support task switching
- scheduling policy, round-robin, priority
Cache
SRAM vs DRAM
The CPU-Memory gap
The gap widens between DRAM, disk, and CPU speeds.
Memory hierarchies(等级)
Some fundamental and enduring properties of hardware and software:
- Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
- The gap between CPU and main memory speed is widening.
- Well-written programs tend to exhibit good locality.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
Memory system in practice
Reading from memory
- Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.
Cache memory
High-speed expensive static RAM both inside and outside the CPU.
- Level-1 cache: inside the CPU
- Level-2 cache: outside the CPU
Cache hit: when data to be read is already in cache memory
Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.
Cache design: cache size, n-way, block size, replacement policy
Caching in a memory hierarchy
General caching concepts
- Program needs object d, which is stored in some block b.
- Cache hit
- Program finds b in the cache at level k. E.g., block 14.
- Cache miss
- b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12.
- If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
- Placement policy: where can the new block go? E.g., b mod 4
- Replacement policy: which block should be evicted? E.g., LRU
- Cache hit
Locality
Principle of Locality: programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.
- Temporal locality(时域): recently referenced items are likely to be referenced in the near future.
- Spatial locality(空间域): items with nearby addresses tend to be referenced close together in time.
In general, programs with good locality run faster then programs with poor locality
Locality is the reason why cache and virtual memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
- Being able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
stride-1 reference pattern
- Does this function have good locality?
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
stride-N reference pattern
Blocked matrix multiply performance
- Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)
- relatively insensitive to array size.
Cache-conscious programming
- make sure that memory is cache-aligned
- Split data into hot and cold (list example)
- Use union and bitfields to reduce size and increase locality
RISC v.s. CISC
Trade-offs of instruction sets
Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)
1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.
RISC
1980, Patternson and Ditzel (Berkeley),RISC
Features
- Fixed-length instructions
- Load-store architecture
- Register file
Organization
- Hard-wired logic
- Single-cycle instruction
- Pipeline
Pros: small die size, short development time, high performance
Cons: low code density, not x86 compatible
RISC Design Principles
Simple operations
- Simple instructions that can execute in one cycle
Register-to-register operations
- Only load and store operations access memory
- Rest of the operations on a register-to-register basis
Simple addressing modes
- A few addressing modes (1 or 2)
Large number of registers
- Needed to support register-to-register operations
- Minimize the procedure call and return overhead
Fixed-length instructions
- Facilitates efficient instruction execution
Simple instruction format
- Fixed boundaries for various fields
- opcode, source operands,…
- Fixed boundaries for various fields
CISC and RISC
CISC – complex instruction set
- large instruction set
- high-level operations (simpler for compiler?)
- requires microcode interpreter (could take a long time)
- examples: Intel 80x86 family
RISC – reduced instruction set
- small instruction set
- simple, atomic instructions
- directly executed by hardware very quickly
- easier to incorporate advanced architecture design
- examples: ARM (Advanced RISC Machines) and DEC Alpha (now Compaq), PowerPC, MIPS
CISC (Intel 486) | RISC (MIPS R4000) | |
---|---|---|
#instructions | 235 | 94 |
Addr. modes | 11 | 1 |
Inst. Size (bytes) | 1-12 | 4 |
GP registers | 8 | 32 |
Why RISC?
Simple instructions are preferred
- Complex instructions are mostly ignored by compilers
- Due to semantic gap
- Complex instructions are mostly ignored by compilers
Simple data structures
- Complex data structures are used relatively infrequently
- Better to support a few simple data types efficiently
- Synthesize complex ones
Simple addressing modes
- Complex addressing modes lead to variable length instructions
- Lead to inefficient instruction decoding and scheduling
- Complex addressing modes lead to variable length instructions
Large register set
Efficient support for procedure calls and returns
Patterson and Sequin’s study
Procedure call/return: 12-15% of HLL statements
»Constitute 31-33% of machine language instructions
»Generate nearly half (45%) of memory references
Small activation record
- Tanenbaum’s study
- Only 1.25% of the calls have more than 6 arguments
- More than 93% have less than 6 local scalar variables
- Large register set can avoid memory references
- Tanenbaum’s study
ISA design issues
ISA:Instruction Set Architecture
Instruction set design
- Issues when determining ISA
- Instruction types
- Number of addresses
- Addressing modes
Instruction types
Arithmetic and logic
Data movement
I/O (memory-mapped, isolated I/O)
Flow control
- Branches (unconditional, conditional)
- set-then-jump (cmp AX, BX; je target)
- Test-and-jump (beq r1, r2, target)
- Procedure calls (register-based, stack-based)
- Pentium: ret; MIPS: jr
- Register: faster but limited number of parameters
- Stack: slower but more general
- Branches (unconditional, conditional)
Operand types
Instructions support basic data types
- Characters
- Integers
- Floating-point
Instruction overload
Same instruction for different data types
Example: Pentium
mov AL,address ;loads an 8-bit value
mov AX,address ;loads a 16-bit value
mov EAX,address ;loads a 32-bit value
Separate instructions
- Instructions specify the operand size
- Example: MIPS
lb Rdest,address ;loads a byte
lh Rdest,address ;loads a halfword
;(16 bits)
lw Rdest,address ;loads a word
;(32 bits)
ld Rdest,address ;loads a doubleword
;(64 bits)
Number of addresses
- Four categories
- 3-address machines
- two for the source operands and one for the result
- 2-address machines
- One address doubles as source and result
- 1-address machine
- Accumulator machines
- Accumulator is used for one source and result
- 0-address machines
- Stack machines
- Operands are taken from the stack
- Result goes onto the stack
- 3-address machines
Number of addresses | instruction | operation |
---|---|---|
3 | OP A, B, C | A ← B OP C |
2 | OP A, B | A ← A OP B |
1 | OP A | AC ← AC OP A |
0 | OP | T ← (T-1) OP T |
- A, B, C: memory or register locations
- AC: accumulator
- T: top of stack
- T-1: second element of stack
3-address
Example: RISC machines, TOY
SUB Y, A, B ; Y = A - B
MUL T, D, E ; T = D × E
ADD T, T, C ; T = T + C
DIV Y, Y, T ; Y = Y / T
2-address
Example: IA32
MOV Y, A ; Y = A
SUB Y, B ; Y = Y - B
MOV T, D ; T = D
MUL T, E ; T = T × E
ADD T, C ; T = T + C
DIV Y, T ; Y = Y / T
1-address
Example: IA32’s MUL (EAX)
LD D ; AC = D
MUL E ; AC = AC × E
ADD C ; AC = AC + C
ST Y ; Y = AC
LD A ; AC = A
SUB B ; AC = AC – B
DIV Y ; AC = AC / Y
ST Y ; Y = AC
0-address
Example: IA32’s FPU, HP3000
PUSH A ; A
PUSH B ; A, B
SUB ; A-B
PUSH C ; A-B, C
PUSH D ; A-B, C, D
PUSH E ; A-B, C, D, E
MUL ; A-B, C, D× E
ADD ; A-B, C+(D× E)
DIV ; (A-B) / (C+(D× E))
POP Y
A basic design decision; could be mixed
Fewer addresses per instruction results in
- a less complex processor
- shorter instructions
- longer and more complex programs
- longer execution time
The decision has impacts on register usage policy as well
- 3-address usually means more general-purpose registers
- 1-address usually means less
Addressing modes
How to specify location of operands? Trade-off for address range, address flexibility, number of memory references, calculation of addresses
Operands can be in three places
- Registers
- Register addressing mode
- Part of instruction
- Constant
- Immediate addressing mode
- All processors support these two addressing modes
- Memory
- Difference between RISC and CISC
- CISC supports a large variety of addressing modes
- RISC follows load/store architecture
- Registers
- Common addressing modes
- Implied
- Immediate (lda R1, 1)
- Direct (st R1, A)
- Indirect
- Register (add R1, R2, R3)
- Register indirect (sti R1, R2)
- Displacement
- Stack
Implied addressing
No address field; operand is implied by the instruction
CLC ; clear carry
A fixed and unvarying address
Immediate addressing
Address field contains the operand value
ADD 5; AC=AC+5
Pros: no extra memory reference; faster
Cons: limited range
Direct addressing
Address field contains the effective address of the operand
ADD A; AC=AC+[A]
single memory reference
Pros: no additional address calculation
Cons: limited address space
Indirect addressing
Address field contains the address of a pointer to the operand
ADD [A]; AC=AC+[[A]]
multiple memory references
Pros: large address space
Cons: slower
Register addressing
Address field contains the address of a register
ADD R; AC=AC+R
Pros: only need a small address field; shorter instruction and faster fetch; no memory reference
Cons: limited address space
Register indirect addressing
Address field contains the address of the register containing a pointer to the operand
ADD [R]; AC=AC+[R]
Pros: large address space
Cons: extra memory reference
Displacement addressing
Address field could contain a register address and an address
MOV EAX, [A+ESI*4]
EA=A+[R×S] or vice versa
Several variants
- Base-offset: [EBP+8]
- Base-index: [EBX+ESI]
- Scaled: [T+ESI*4]
Pros: flexible
Cons: complex
MOV EAX, [A+ESI*4]
Often, register, called indexing register, is used for displacement.
Usually, a mechanism is provided to efficiently increase the indexing register.
Stack addressing
Operand is on top of the stack
ADD [R]; AC=AC+[R]
Pros: large address space
Pros: short and fast fetch
Cons: limited by FILO order
Mode | Meaning | Pros | Cons |
---|---|---|---|
Implied | Fast fetch | Limited instructions | |
Immediate | Operand=A | No memory ref | Limited operand |
Direct | EA=A | Simple | Limited address space |
Indirect | EA=[A] | Large address space | Multiple memory ref |
Register | EA=R | No memory ref | Limited address space |
Register indirect | EA=[R] | Large address space | Extra memory ref |
Displacement | EA=A+[R] | Flexibility | Complexity |
stack | EA=stack top | No memory ref | Limited applicability |
IA32 addressing modes
Effective address calculation (IA32)
Based Addressing
Effective address is computed as base + signed displacement
- Displacement:
- 16-bit addresses: 8- or 16-bit number
- 32-bit addresses: 8- or 32-bit number
- Displacement:
Useful to access fields of a structure or record
- Base register -> points to the base address of the structure
- Displacement -> relative offset within the structure
Useful to access arrays whose element size is not 2, 4, or 8 bytes
- Displacement -> points to the beginning of the array
- Base register -> relative offset of an element within the array
Indexed Addressing
Effective address is computed as (index * scale factor) + signed displacement
- 16-bit addresses:
- displacement: 8- or 16-bit number
- scale factor: none (i.e., 1)
- 32-bit addresses:
- displacement: 8- or 32-bit number
- scale factor: 2, 4, or 8
- 16-bit addresses:
Useful to access elements of an array (particularly if the element size is 2, 4, or 8 bytes)
- Displacement -> points to the beginning of the array
- Index register -> selects an element of the array (array index)
- Scaling factor -> size of the array element
Based-Indexed Addressing
Based-indexed addressing with no scale factor
Effective address is computed as base + index + signed displacement
Useful in accessing two-dimensional arrays
- Displacement -> points to the beginning of the array
- Base and index registers point to a row and an element within that row
Useful in accessing arrays of records
- Displacement -> represents the offset of a field in a record
- Base and index registers hold a pointer to the base of the array and the offset of an element relative to the base of the array
Useful in accessing arrays passed on to a procedure
- Base register -> points to the beginning of the array
- Index register -> represents the offset of an element relative to the base of the array
Example
Assuming BX points to table1
**mov AX,[BX+SI]**
**cmp** **AX,[BX+SI+2]**
compares two successive elements of table1
Based-indexed addressing with scale factor
Effective address is computed as base + (index * scale factor) + signed displacement
Useful in accessing two-dimensional arrays when the element size is 2, 4, or 8 bytes
- Displacement ==> points to the beginning of the array
- Base register ==> holds offset to a row (relative to start of array)
- Index register ==> selects an element of the row
- Scaling factor ==> size of the array element
ARM Architecture
ARM history
1983 developed by Acorn computers
- To replace 6502 in BBC computers
- 4-man VLSI design team
- Its simplicity comes from the inexperience team
- Match the needs for generalized SoC for reasonable power, performance and die size
- The first commercial RISC implemenation
1990 ARM (Advanced RISC Machine), owned by Acorn, Apple and VLSI
Design and license ARM core design but not fabricate
Why ARM?
One of the most licensed and thus widespread processor cores in the world
- Used in PDA, cell phones, multimedia players, handheld game console, digital TV and cameras
- ARM7: GBA, iPod
- ARM9: NDS, PSP, Sony Ericsson, BenQ
- ARM11: Apple iPhone, Nokia N93, N800
- 90% of 32-bit embedded RISC processors till 2009
Used especially in portable devices due to its low power consumption and reasonable performance
ARM processors
A simple but powerful design
A whole family of designs sharing similar design principles and a common instruction set
Naming ARM
- ARMxyzTDMIEJFS
- x: series
- y: MMU
- z: cache
- T: Thumb
- D: debugger
- M: Multiplier
- I: EmbeddedICE (built-in debugger hardware)
- E: Enhanced instruction
- J: Jazelle (JVM)
- F: Floating-point
- S: Synthesizible version (source code version for EDA tools)
Popular ARM architectures
ARM7TDMI
- 3 pipeline stages (fetch/decode/execute)
- High code density/low power consumption
- One of the most used ARM-version (for low-end systems)
- All ARM cores after ARM7TDMI include TDMI even if they do not include TDMI in their labels
ARM9TDMI
- Compatible with ARM7
- 5 stages (fetch/decode/execute/memory/write)
- Separate instruction and data cache
ARM11
ARM is a RISC
RISC: simple but powerful instructions that execute within a single cycle at high clock speed.
Four major design rules:
- Instructions: reduced set/single cycle/fixed length
- Pipeline: decode in one stage/no need for microcode
- Registers: a large set of general-purpose registers
- Load/store architecture: data processing instructions apply to registers only; load/store to transfer data from memory
Results in simple design and fast clock rate
The distinction blurs because CISC implements RISC concepts
ARM design philosophy(哲学)
- Small processor for lower power consumption (for embedded system)
- High code density for limited memory and physical size restrictions
- The ability to use slow and low-cost memory
- Reduced die size for reducing manufacture cost and accommodating more peripherals
ARM features
- Different from pure RISC in several ways:
- Variable cycle execution for certain instructions: multiple-register load/store (faster/higher code density)
- Inline barrel shifter leading to more complex instructions: improves performance and code density
- Thumb 16-bit instruction set: 30% code density improvement
- Conditional execution: improve performance and code density by reducing branch
- Enhanced instructions: DSP instructions
- Load/store architecture
- A large array of uniform registers
- Fixed-length 32-bit instructions
- 3-address instructions
Registers
- Only 16 registers are visible to a specific mode. A mode could access
- A particular set of r0-r12
- r13 (sp, stack pointer)
- r14 (lr, link register)
- r15 (pc, program counter)
- Current program status register (cpsr)
- The uses of r0-r13 are orthogonal
General-purpose registers
- 6 data types (signed/unsigned)
- All ARM operations are 32-bit. Shorter data types are only supported by data transfer operations.
Program counter
- Store the address of the instruction to be executed
- All instructions are 32-bit wide and word-aligned
- Thus, the last two bits of pc are undefined.
Program status register (CPSR)
Processor modes
Register organization
Instruction sets
- ARM/Thumb/Jazelle
Pipeline
- Execution of a branch or direct modification of pc causes ARM core to flush its pipeline
- ARM10 starts to use branch prediction
- An instruction in the execution stage will complete even though an interrupt has been raised. Other instructions in the pipeline are abondond.
Interrupts
ARM Instruction Set
ARM programmer model
- The state of an ARM system is determined by the content of visible registers and memory.
- A user-mode program can see 15 32-bit general-purpose registers (R0-R14), program counter (PC) and CPSR.
- Instruction set defines the operations that can change the state.
Memory system
Byte ordering
ARM programmer model
Instruction set
Features of ARM instruction set
Load-store architecture
3-address instructions
Conditional execution of every instruction
Possible to load/store multiple registers at once
Possible to combine shift and ALU operations in a single instruction
Data processing
They are move, arithmetic, logical, comparison and multiply instructions.
Most data processing instructions can process one of their operands using the barrel shifter.
General rules:
- All operands are 32-bit, coming from registers or literals.
- The result, if any, is 32-bit and placed in a register (with the exception for long multiply which produces a 64-bit result)
- 3-address format
MOV<cc><S> Rd, <operands>
MOVCS R0, R1 @ if carry is set
@ then R0:=R1
MOVS R0, #0 @ R0:=0
@ Z=1, N=0
@ C, V unaffected
Conditional execution
- Almost all ARM instructions have a condition field which allows it to be executed conditionally.
movcs R0, R1
Register movement
Addressing modes
Register operands
ADD R0, R1, R2
Immediate operands
Shifted register operands
- One operand to ALU is routed through the Barrel shifter. Thus, the operand can be modified before it is used. Useful for fast multipliation and dealing with lists, table and other complex data structure. (similar to the displacement addressing mode in CISC.)
- Some instructions (e.g. MUL, CLZ, QADD) do not read barrel shifter.
- It is possible to use a register to specify the number of bits to be shifted; only the bottom 8 bits of the register are significant.
Multiplication
MOV R1, #35
MUL R2, R0, R1
or
ADD R0, R0, R0, LSL #2 @ R0’=5xR0
RSB R2, R0, R0, LSL #3 @ R2 =7xR0’
Shifted register operands
Encoding data processing instructions
Arithmetic
- Add and subtraction
Setting the condition codes
- Any data processing instruction can set the condition codes if the programmers wish it to
64-bit addition
ADDS R2, R2, R0
ADC R3, R3, R1
Logical
Comparison
- These instructions do not generate a result, but set condition code bits (N, Z, C, V) in CPSR. Often, a branch operation follows to change the program flow.
Multiplication
MUL R0, R1, R2 @ R0 = (R1xR2)[31:0]
Features:
- Second operand can’t be immediate
- The result register must be different from the first operand
- Cycles depends on core type
- If S bit is set, C flag is meaningless
See the reference manual
Multiply-accumulate (2D array indexing)
MLA R4, R3, R2, R1 @ R4 = R3xR2+R1
Multiply with a constant can often be more efficiently implemented using shifted register operand
MOV R1, #35
MUL R2, R0, R1
or
ADD R0, R0, R0, LSL #2 @ R0’=5xR0
**RSB R2, R0, R0, LSL #3 @ R2 =7xR0’**
Flow control instructions
- Determine the instruction to be executed next
- Branch instruction
B label
…
label: …
- Conditional branches
MOV R0, #0
loop: …
ADD R0, R0, #1
CMP R0, #10
BNE loop
Branch conditions
Branch and link
- BL instruction saves the return address to R14 (lr)
BL sub @ call sub
CMP R1, #5 @ return to here
MOVEQ R1, #0
…
sub: … @ sub entry point
…
MOV PC, LR @ return
Conditional execution
CMP R0, #5
BEQ bypass @ if (R0!=5) {
ADD R1, R1, R0 @ R1=R1+R0-R2
SUB R1, R1, R2 @ }
bypass: …
----------------------------------------------------
CMP R0, #5
ADDNE R1, R1, R0
SUBNE R1, R1, R2 //smaller and faster
Rule of thumb: if the conditional sequence is three instructions or less, it is better to use conditional execution than a branch.
if ((R0==R1) && (R2==R3)) R4++
-----------------------------------------------------
CMP R0, R1
BNE skip
CMP R2, R3
BNE skip
ADD R4, R4, #1
skip: …
-----------------------------------------------------
CMP R0, R1
CMPEQ R2, R3
ADDEQ R4, R4, #1
Data transfer instructions
Move data between registers and memory
Three basic forms
- Single register load/store
- Multiple register load/store
- Single register swap: SWP(B), atomic instruction for semaphore
Single register load/store
No STRSB/STRSH since STRB/STRH stores both
signed/unsigned ones
- The data items can be a 8-bit byte, 16-bit half-word or 32-bit word. Addresses must be boundary aligned. (e.g. 4’s multiple for LDR/STR)
Addressing modes
Memory is addressed by a register and an offset.
LDR R0, [R1] @ mem[R1]
Three ways to specify offsets:
Immediate
LDR R0, [R1, #4] @ mem[R1+4]
Register
LDR R0, [R1, R2] @ mem[R1+R2]
Scaled register @ mem[R1+4*R2]
LDR R0, [R1, R2, LSL #2]
Pre-index addressing (LDR R0, [R1, #4]) without a writeback
Auto-indexing addressing (LDR R0, [R1, #4]!)
Pre-index with writeback
calculation before accessing with a writeback
Post-index addressing (LDR R0, [R1], #4)
calculation after accessing with a writeback
Comparisons
- Pre-indexed addressing
LDR R0, [R1, R2] @ R0=mem[R1+R2]
@ R1 unchanged
- Auto-indexing addressing
LDR R0, [R1, R2]! @ R0=mem[R1+R2]
@ R1=R1+R2
- Post-indexed addressing
LDR R0, [R1], R2 @ R0=mem[R1]
@ R1=R1+R2
Summary of addressing modes
Load an address into a register
- Note that all addressing modes are register-offseted. Can we issue LDR R0, Table? The pseudo instruction ADR loads a register with an address
table: .word 10
…
ADR R0, table
Assembler transfer pseudo instruction into a sequence of appropriate instructions
sub r0, pc, #12
Application
ADR R1, table
loop: LDR R0, [R1]
ADD R1, R1, #4
@ operations on R0
…
--------------------------------------------
ADR R1, table
loop: LDR R0, [R1], #4
@ operations on R0
…
Multiple register load/store
Transfer a block of data more efficiently.
Used for procedure entry and exit for saving and restoring workspace registers and the return address
For ARM7, 2+Nt cycles (N:#words, t:time for a word for sequential access). Increase interrupt latency since it can’t be interrupted.
registers are arranged an in increasing order; see manual
LDMIA R1, {R0, R2, R5} @ R0 = mem[R1]
@ R2 = mem[r1+4]
@ R5 = mem[r1+8]
LDM load multiple registers
STM store multiple registers
suffix meaning
IA increase after
IB increase before
DA decrease after
DB decrease before
LDM<mode> Rn, {<registers>}
IA: addr:=Rn //1
IB: addr:=Rn+4 //2
DA: addr:=Rn-#<registers>*4+4 //3
DB: addr:=Rn-#<registers>*4 //4
For each Ri in <registers>
IB: addr:=addr+4
DB: addr:=addr-4
Ri:=M[addr]
IA: addr:=addr+4
DA: addr:=addr-4
<!>: Rn:=addr
LDMIA R0, {R1,R2,R3}
or
LDMIA R0, {R1-R3}
R1: 10
R2: 20
R3: 30
R0: 0x10
LDMIA R0!, {R1,R2,R3}
R1: 10
R2: 20
R3: 30
R0: 0x01C
LDMIB R0!, {R1,R2,R3}
R1: 20
R2: 30
R3: 40
R0: 0x01C
LDMDA R0!, {R1,R2,R3}
R1: 40
R2: 50
R3: 60
R0: 0x018
LDMDB R0!, {R1,R2,R3}
R1: 30
R2: 40
R3: 50
R0: 0x018
- Copy a block of memory
- R9: address of the source
- R10: address of the destination
- R11: end address of the source
loop: LDMIA R9!, {R0-R7}
STMIA R10!, {R0-R7}
CMP R9, R11
BNE loop
- Stack (full: pointing to the last used; ascending: grow towards increasing memory addresses)
mode | POP | =LDM | PUSH | =STM |
---|---|---|---|---|
Full ascending (FA) | LDMFA | LDMDA | STMFA | STMIB |
Full descending (FD) | LDMFD | LDMIA | STMFD | STMDB |
Empty ascending (EA) | LDMEA | LDMDB | STMEA | STMIA |
Empty descending (ED) | LDMED | LDMIB | STMED | STMDA |
STMFD R13!, {R2-R9} @ used for ATPCS
… @ modify R2-R9
LDMFD R13!, {R2-R9}
Swap instruction
- Swap between memory and register. Atomic operation preventing any other instruction from reading/writing to that location until it completes
Software interrupt
- A software interrupt instruction causes a software interrupt exception, which provides a mechanism for applications to call OS routines.
Load constants
- No ARM instruction loads a 32-bit constant into a register because ARM instructions are 32-bit long. There is a pseudo code for this.
Immediate numbers
- Assemblers implement this usually with two options depending on the number you try to load.
Assume that you want to load 511 into R0
Construct in multiple instructions
mov r0, #256
add r0, #255
Load from memory; declare L511 .word 511
ldr r0, L511 ldr r0, [pc, #0]
Guideline: if you can construct it in two instructions, do it; otherwise, load it.
The assembler decides for you
ldr r0, =255 mov r0, 255
ldr r0, =511 ldr r0, [pc, #4]
PC-relative modes
Instruction set
ARM Assembly Programming
GNU compiler and binutils
GNU compiler and binutils
gcc: GNU C compiler
as: GNU assembler
ld: GNU linker
gdb: GNU project debugger
insight: a (Tcl/Tk) graphic interface to gdb
Pipeline
COFF (common object file format)
ELF (extended linker format)
Segments in the object file
- Text: code
- Data: initialized global variables
- BSS: uninitialized global variables
GAS program format
.file “test.s”
.text
.global main
.type main, %function
main:
MOV R0, #100
ADD R0, R0, R0
SWI #11
.end
ARM assembly program
Control structures
Program is to implement algorithms to solve problems. Program decomposition and flow of control are important concepts to express algorithms.
Flow of control:
- Sequence.
- Decision: if-then-else, switch
- Iteration: repeat-until, do-while, for
Decomposition: split a problem into several smaller and manageable ones and solve them independently. (subroutines/functions/procedures)
Decision
If statements
if (R1==1 || R1==5 || R1==12) R0=1;
TEQ R1, #1 ...
TEQNE R1, #5 ...
TEQNE R1, #12 ...
MOVEQ R0, #1 BNE fail
if (R1==0) zero
else if (R1>0) plus
else if (R1<0) neg
CMP R1, #0
BMI neg
BEQ zero
BPL plus
neg: ...
B exit
Zero: ...
B exit
...
R0=abs(R0)
CMP R0, #0
RSBMI R0, R0, #0
Multi-way branches
CMP R0, #`0’
BCC other @ less than ‘0’
CMP R0, #`9’
BLS digit @ between ‘0’ and ‘9’
------------------------------------
CMP R0, #`A’
BCC other
CMP R0, #`Z’
BLS letter @ between ‘A’ and ‘Z’
------------------------------------
CMP R0, #`a’
BCC other
CMP R0, #`z’
BHI other @ not between ‘a’ and ‘z’
------------------------------------
letter: ...
Switch statements
Iteration
repeat loops
while loops
GCD
int gcd (int i, int j)
{
while (i!=j)
{
if (i>j)
i -= j;
else
j -= i;
}
}
Loop: CMP R1, R2
SUBGT R1, R1, R2
SUBLT R2, R2, R1
BNE loop
for loops
Procedures
Arguments: expressions passed into a function
Parameters: values received by the function
Caller and callee
- How to pass arguments? By registers? By stack? By memory? In what order?
- Who should save R5? Caller? Callee?
- We need a protocol for these.
ARM Procedure Call Standard (APCS)
ARM Ltd. defines a set of rules for procedure entry and exit so that
- Object codes generated by different compilers can be linked together
- Procedures can be called between high-level languages and assembly
APCS defines
- Use of registers
- Use of stack
- Format of stack-based data structure
- Mechanism for argument passing
APCS register usage convention
0-4:
Used to pass the first 4 parameters
Caller-saved if necessary
4-11:
Register variables, must return unchanged
Callee-saved
9-15
- Registers for special purposes
- Could be used as temporary variables if saved properly.
Argument passing
The first four word arguments are passed through R0 to R3.
Remaining parameters are pushed into stack in the reverse order.
Procedures with less than four parameters are more effective.
Return value
One word value in R0
A value of length 2~4 words (R0-R1, R0-R2, R0-R3)
Function entry/exit
- A simple leaf function with less than four parameters has the minimal overhead. 50% of calls are to leaf functions
- Save a minimal set of temporary variables
BL leaf2
...
leaf2: STMFD sp!, {regs, lr} @ save
...
LDMFD sp!, {regs, pc} @ restore and
@ return
Accessing operands
- A procedure often accesses operands in the following ways
- An argument passed on a register: no further work
- An argument passed on the stack: use stack pointer (R13) relative addressing with an immediate offset known at compiling time
- A constant: PC-relative addressing, offset known at compiling time
- A local variable: allocate on the stack and access through stack pointer relative addressing
- A global variable: allocated in the static area and can be accessed by the static base relative (R9) addressing
main:
LDR R0, #0
...
BL func
...
func: STMFD SP!, {R4-R6, LR}
SUB SP, SP, #0xC
...
STR R0, [SP, #0] @ v1=a1
...
ADD SP, SP, #0xC
LDMFD SP!, {R4-R6, PC}
Optimizing ARM Assembly
Optimization
Compilers do perform optimization, but they have blind sites. There are some optimization tools that you can’t explicitly use by writing C, for example.
- Instruction scheduling
- Register allocation
- Conditional execution
You have to use hand-written assembly to optimize critical routines.
Use ARM9TDMI as the example, but the rules apply to all ARM cores.
Note that the codes are sometimes in armasm format, not gas.
ARM optimization
- Utilize ARM ISA’s features
- Conditional execution
- Multiple register load/store
- Scaled register operand
- Addressing modes
Instruction scheduling
- ARM9 pipeline
Hazard/Interlock: If the required data is the unavailable result from the previous instruction, then the process stalls.
No hazard, 2 cycles
- One-cycle interlock
- One-cycle interlock, 4 cycles
- Brach takes 3 cycles due to stalls
Scheduling of load instructions
- Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions. Careful scheduling of loads can avoid stalls.
- 2-cycle stall. Total 11 cycles for a character.
- It can be avoided by preloading and unrolling.
- The key is to do some work when awaiting data
Load scheduling by preloading
Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop.
Since loop i is loading data for loop i+1, there is always a problem with the first and last loops. For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution.
Load scheduling by unrolling
- Unroll and interleave the body of the loop. For example, we can perform three loops together. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.
Register allocation
- APCS requires callee to save R4~R11 and to keep the stack 8-byte aligned.
- We stack R12 only for making the stack 8-byte aligned.
- Unroll the loop to handle 8 words at a time and to use multiple load/store
- What variables do we have?
We still need to assign carry and kr, but we have used 13 registers and only one remains.
- Work on 4 words instead
- Use stack to save least-used variable, here N
- Alter the code
We notice that carry does not need to stay in the same register. Thus, we can use yi for it.
More than 14 local variables
If you need more than 14 local variables, then you store some on the stack.
Work outwards from the inner loops since they have more performance impact.
Packing
- Pack multiple (sub-32bit) variables into a single register.
When shifting by a register amount, ARM uses bits 0~7 and ignores others.
Shift an array of 40 entries by shift bits.
Simulate SIMD (single instruction multiple data).
Assume that we want to merge two images X and Y to produce Z by
- Load 4 bytes at a time
- Unpack it and promote to 16-bit data
- Work on 176x144 images
Conditional execution
By combining conditional execution and conditional setting of the flags, you can implement simple if statements without any need of branches.
This improves efficiency since branches can take many cycles and also reduces code size.
Block copy example
void bcopy(char *to, char *from, int n)
{
while (n--)
*to++ = *from++;
}
@ arguments: R0: to, R1: from, R2: n
bcopy: TEQ R2, #0
BEQ end
loop: SUB R2, R2, #1
LDRB R3, [R1], #1
STRB R3, [R0], #1
B bcopy
end: MOV PC, LR
@ arguments: R0: to, R1: from, R2: n
@ rewrite “n–-” as “-–n>=0”
bcopy: SUBS R2, R2, #1
LDRPLB R3, [R1], #1
STRPLB R3, [R0], #1
BPL bcopy
MOV PC, LR
@ arguments: R0: to, R1: from, R2: n
@ assume n is a multiple of 4; loop unrolling
bcopy: SUBS R2, R2, #4
LDRPLB R3, [R1], #1
STRPLB R3, [R0], #1
LDRPLB R3, [R1], #1
STRPLB R3, [R0], #1
LDRPLB R3, [R1], #1
STRPLB R3, [R0], #1
LDRPLB R3, [R1], #1
STRPLB R3, [R0], #1
BPL bcopy
MOV PC, LR
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16;
bcopy: SUBS R2, R2, #16
LDRPL R3, [R1], #4
STRPL R3, [R0], #4
LDRPL R3, [R1], #4
STRPL R3, [R0], #4
LDRPL R3, [R1], #4
STRPL R3, [R0], #4
LDRPL R3, [R1], #4
STRPL R3, [R0], #4
BPL bcopy
MOV PC, LR
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16;
bcopy: SUBS R2, R2, #16
LDMPL R1!, {R3-R6}
STMPL R0!, {R3-R6}
BPL bcopy
MOV PC, LR
@ could be extend to copy 40 byte at a time
@ if not multiple of 40, add a copy_rest loop
Search example
int main(void)
{
int a[10]={7,6,4,5,5,1,3,2,9,8};
int i;
int s=4;
for (i=0; i<10; i++)
if (s==a[i]) break;
if (i>=10) return -1;
else return i;
}
Search
.section .rodata
.LC0:
.word 7
.word 6
.word 4
.word 5
.word 5
.word 1
.word 3
.word 2
.word 9
.word 8
.text
.global main
.type main, %function
main: sub sp, sp, #48
adr r4, L9 @ =.LC0
add r5, sp, #8
ldmia r4!, {r0, r1, r2, r3}
stmia r5!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1, r2, r3}
stmia r5!, {r0, r1, r2, r3}
ldmia r4!, {r0, r1}
stmia r5!, {r0, r1}
mov r3, #4
str r3, [sp, #0] @ s=4
mov r3, #0
str r3, [sp, #4] @ i=0
loop: ldr r0, [sp, #4] @ r0=i
cmp r0, #10 @ i<10?
bge end
ldr r1, [sp, #0] @ r1=s
mov r2, #4
mul r3, r0, r2
add r3, r3, #8
ldr r4, [sp, r3] @ r4=a[i]
teq r1, r4 @ test if s==a[i]
beq end
add r0, r0, #1 @ i++
str r0, [sp, #4] @ update i
b loop
end: str r0, [sp, #4]
cmp r0, #10
movge r0, #-1
add sp, sp, #48
mov pc, lr
Optimization
- Remove unnecessary load/store
- Remove loop invariant
- Use addressing mode
- Use conditional execution
Search (remove load/store)
Search (loop invariant/addressing mode)
- Remove unnecessary load/store
- Remove loop invariant
- Use addressing mode
- Use conditional execution
- From 22 words to 13 words and execution time is greatly reduced.