計算機結構(上)

Chap01 Introduction

What can we learn

  • How programs finally executed on hardware
  • How to write an efficient program
  • How hardware and software work well together
  • How to improve hardware performance

What really matters is the functioning of the complete system, hardware, runtime system, compiler, operating system, and application
“In networking, this is called the “End to End argument” --- H&P

Why need to take this course

You will be able to answer the following questions:
Q1: How can I write a program with good performance?code B perform better than code A?
(I don't know why)

Code A
for (i = 0; i < N; i = i+1)
	for (j = 0; j < N; j = j+1)
		{r = 0;
			for (k = 0; k < N; k = k+1)
			{
				r = r + y[i][k]*z[k][j];
			};
			x[i][j] = r;
		};
========
Code B
for (jj = 0; jj < N; jj = jj+B)
	for (kk = 0; kk < N; kk = kk+B)
		for (i = 0; i < N; i = i+1)
			for (j = jj; j < min(jj+B-1,N); j = j+1)
			{
				r = 0;
				for (k = kk; k < min(kk+B-1,N); k = k+1)
				{
					r = r + y[i][k]*z[k][j];
				}
				x[i][j] = x[i][j] + r;
			};

Q2: CPU frequency ? Performance
Ans at Chap02
Q3: Why do Nvidia GPUs get so much attention today?
Heterogeneous(異質) Computing : Integrated CPU/GPU

Post PC Era

資訊界:小魚吃大魚

Ex. Cloud computing

  • Software as a Service (SaaS)
  • Amazon and Google

Chip War: Big core or Small core

Workload changes in data centers
Past: databases, financial services
Now: web service, cloud computing (accessing mails, photos, facebooks, etc)
=> Simple tasks & large amount of parallel tasks
=> Small Core is better

Ex. ARM

  • Founded at the year of 1990, in Cambridge, England
  • Co-founded by Apple & Acron
  • Emphasize low-power since day one
  • First low-power, 32 bit mobile processor
  • sells designs(IP license) instead of chips

Inside the Processor (CPU)

-Datapath: performs operations on data
-Control: sequences datapath, memory, ...
-Cache memory
-Small fast SRAM memory for immediate access to data

Moore’s Law (1965)

  • Gordon Moore, Intel founder
  • “The density of transistors in an integrated circuit will double every year.”
  • Reality
    • The density of silicon chips doubles every 18 months

Uniprocessor Performance

uni p
VAX : 25%/year 1978 to 1986
RISC + x86: 52%/year 1986 to 2002
RISC + x86: 20%/year 2002 to present

Concept Change

  • Power expensive,** Transistors free** (Can put more on chip than can afford to turn on)
    • so hot that make low efficiency
  • law of diminishing returns(收益遞減) on more HW for ILP( Instruction Level Parallelism, 指令層級平行)
    • 愈平行,效率提升的投資投酬愈少(control overhead)
  • Memory slow,** CPU fast** (200 clock cycles to DRAM memory, 4 clocks for multiply)
  • Power Wall + ILP Wall + Memory Wall = CPU Wall
  • Uniprocessor performance 2X / 1.5 yrs -> multiple cores (2X processors per chip / 2 years)

CPU constrained by power, instruction-level parallelism, memory latency

CPU history

8086 Introduced x86 ISA 1978
80286 Virtual memory 1982
80386 32-bit processor (1985)
80486 Pipelining Floating point unit 8 KB cache (1989)
Pentium Superscalar (1993)
Pentium Pro / II / III Dynamic execution Multimedia instructions 1995-9
Pentium 4 HyperThreading Deep pipeline (2001)
Pentium D (2005) Dual core 2 Pentium 4 cores
AMD big.LITTLE 大核+小核
AMD big.LITTLE

8 Design Principle for Computer Architecture/System

  1. Design for Moore’s Law
  2. Use abstraction to simplify design
  3. Make the common case fast
  4. Performance via parallelism
  5. Performance via pipelining
  6. Performance via prediction
  7. Hierarchy of memories
  8. Dependability via redundancy(多餘使其安全)

Chap02 Performance/Power/Cost

How to measure, report, and summarize performance/power/cost?

  • Metric(度量)
  • Benchmark(基準)

Two Notions of “Performance”

  • Response Time (latency)
  • Throughput

-> We focus on response time

Metrics for Performance Evaluation

  • Program execution time(Elapsed time經過時間)
  • Total time to complete a task, including disk access, I/O, etc
  • CPU execution time(使用CPU的時間)
  • doesn't count I/O or time spent running other programs
  • Our Focus: user CPU time(without system CPU time)
  • time spent executing the lines of code that are "in" program

performance(x) = 1 / execution_time(x)

For Embedded System

  • Hard real time
    • A fixed bound on the time to respond to or process an event
  • Soft real time(寬鬆)
    • An average response or a response within a limited time

CPU performance

CPU time
= CPU clockcycles x Clockcycle time
= CPU clockcycles / clockrate
(Because Clock Cycles = Instruction Count x Cycles per Instruction )
= Instruction Count x Cycles per Instruction x Clockcycle time

Instruction Count of program is determined by program, ISA and compiler
cycles per instruction is determined by CPU hardware

different instruction may have different CPI(Clocks Per Instruction)
-> use weighted average CPI(加權平均)

Aspects of CPU Performance

Chap01 Q2: CPU frequency(時脈) is not fully related to performance

MIPS

  • Millions of Instructions Per Second
  • = Clock Rate / (CPI * 10^6)
  • Can't be performance metric(CPI varies between programs)

CMOS IC power performance


$V ∝ Frequency$ => $power ∝ V^3$
high ascending rate

dynamic energy: energy that is consumed when transistors switch states from 0 to 1 and vice versa. It depends on the capacitive loading of each transistor and the voltage applied:

Energy ∝ Capacitive load  x Voltage^2

Why is Multi-Core Good for Energy-Efficiency?

only use 1/4 power

Requires explicitly parallel programming. Hard to do because

  • Programming for performance
  • Load balancing
  • Optimizing communication and synchronization

Manufacturing ICs

Yield(良率): proportion of working dies(晶片) per wafer
Nonlinear relation to area and defect rate
Yield

Benchmark

SPEC benchmark

  • Standard Performance Evaluation Corp. (SPEC)
    • Develops benchmarks for CPU, I/O, Web, …
    • CPU Benchmark
      • Programs to simulate actual workload
      • focuses on CPU performance(a little I/O)
      • Test both integer and floating point applications
        CINT (integer) and CFP (floating-point)

SPECRatio
Normalize execution times to reference computer
performance = time on standard computer/ time on computer being rated

Since ratios, geometric mean is proper

Power Benchmark

  • Power consumption of server at different workload levels(不同的工作量下,平均的工作效能)
  • Report power consumption of servers at different workload levels, divided into 10% increments
  • Performance: ssj_ops/sec(每秒完成的事務數)
  • Power: Watts (Joules/sec)

Amdahl’s Law
程式的平行化只能將執行時間縮短到一定程度(因為分散運算的overhead)

Low Power at Idle(x)

  • At 100% load: 295W
  • At 50% load: 246W (83%)
  • At 10% load: 180W (61%)
  • not porportional!

=> In general, the more load, the better power consumption rate
=> Consider designing processors to make power proportional to load

Chap03 Instruction Set Architecture(ISA)

Definition: Instruction set provides an layer of abstraction to programmers

ISA Design Principle

makes it easy to build the hardware and the compiler while maximizing performance and minimizing cost.

Interface Design
(portability, compatibility, generality)
(convenient at high level , efficient in low level)

Design Principles

  • Design Principle 1: Simplicity favors regularity(All operation takes three operands)
  • Design Principle 2: Smaller is faster(MIPS only 32 register)
  • Design Principle 3: Make the common thing fast(addi)
  • Design Principle 4: Good design demands good compromises(妥協)(three format of instruction code)

MIPS Instruction Set

Stanford MIPS be Memory are referenced with byte addresses in MIPScommercialized
a kind of RISC精簡指令集(Reduced Instruction Set Computing)

  • 31個 32-bit integer registers (R0 = const 0)
  • 32個 32-bit floating-point registers
  • 32-bit HI, LO, PC (program counter) -- internal, can't modify
    Memory are referenced with byte addresses in MIPS
    1 word = 4 byte
指令 意義 備註
add Ra Rb Rc Ra =Rb + Rc
load Ra A Ra =mem[A]
store Ra A mem[A] =Ra Memory are referenced with byte addresses in MIPS
Data transfer instructions
lw $t0, 8 ($s3) t0 = mem[8+reg[$s3]] lw dest offset base, offset must be constant
Ex.
sw $t0, 8 ($s3) mem[8+reg[$s3]]= $t0
addi $s3, $s3, 4 add constant value
srl $10, $16, 4 $t2 = $s0 >> 4 bits
sll $10, $16, 4 $t2 = $s0 << 4 bits
rs fill 0 because unused
or $t0, $t1, $t2 $t0 = $t1 | $t2
ori $6, $6, 0x00ff
and $t0, $t1, $t2 $t0 = $t1 & $t2
andi $6, $6, 0x0000
nor $t0, $t1, $t3 t0 = ~($t1 $ t3)
(nor $t0, $t1, $t3) = ~($t1), if $t3 =
~(A) = 1 if A = 0
~(A) = 0 if A != 0
lui $t0, 1010101010101010 # load upper immediate
// that is: 1010101010101010 0000000000000000

mult rs, rt
Multiplying two 32-bit numbers can yield a 64-bit number.
higher 32 bits are stored in HI
lower 32 bits are stored in LO
MFHI rd -- $rd = HI
MFLO rd -- $rd = LO

div rs, rt | rs / rt
Quotient stored in Lo
Remainder stored in Hi

Branch instructions
Conditional branches
beq reg1, reg2, L1
Go to statement L1 if [reg1] == [reg2]
bne reg1, reg2, L2
Go to statement L2 if [reg1] != [reg2]
Unconditional branches
j L1

jr(jump register)
unconditional jump to the address stored in a register
Provides full 32bits address

Use J for calling subroutines
Use Jal for calling functions
Use Jr for ending a subroutine by jumping to the return address (ra)

slt reg1, reg2, reg3
=> if reg1 = (reg2 < reg3) ? 1 : 0;

move copies a value from one register to another

Byte order: Big Endian vs. Little Endian
Big endian(順序, 一般數字寫法): byte 0 is most significant bits e.g., IBM/360/370, Motorola 68K, MIPS, Sparc, HP PA
Little endian(逆序): byte 0 is least significant bits e.g., Intel 80x86, DEC Vax, DEC Alpha

Alignment
object on the address that is multiple of their size(一次讀取大小的倍數)

Stored-Program Concept
Computers built on 2 key principles:

  1. Instructions are represented as numbers
  2. entire programs can be stored in memory to be read or written just like numbers

Shift is faster than multiplication
=> Multiplying by 4 is the same as shifting left by 2:

•Fetch instruction from mem [PC]
•without decision making instruction
•next instruction = mem [PC + instruction_size]

MIPS Instruction Format

instruction code - 32bit

R-type

  • rs, rt = source register, rd = destination register
    • 32個register => 5bit
  • shamt: shift amount
    • filling 0 when unused
  • func: function field(same operate code with different function code can looked as different argument)
    • Ex. add,addu, addi

I-type

for those have constant argument(immediate value)

first example, rs = $2(source) rt = $1(destination)

PC addressing mode
New PC = PC + 4(auto go to next instruction) + Immediate x 4

if want to jump farther than 16bit

bne s1 s2 L2 
jump L1 # equals to beq s1 s2 L1, but jump can go farther
L2:

J-type

j type

goto absolute address(I-type is relative address)
equal to immediate in the middle of the program counter address

指令表



function

pass value

  • $a0 ~ $a3 : 4 arguments

    • if # of parameters is larger than 4 – store to the stack
  • $v0 ~ $v1 : 2 return values
    preserve register values of caller

  • stack

Caller-save register

  • Caller saved register: 由Caller負責清理或存入stack frame
    • hold temporary quantities that need not be preserved across calls
    • Caller先備份,所以callee便可直接使用caller-save register
    • $t0~t7
  • Callee saved register: 由Callee負責清理或存入stack
    • hold values that should be preserved across calls(caller還需要用, callee用的時候要先備份)
    • callee should save them and restore them before returning to the caller
    • $s0~s7

call function: jal procedure_address (jump and link)

  • Store the return address (PC + 4) at $ra
  • set PC = procedure_address
    return : jr $ra

Frame pointer points to the first word of the procedure frame

procedure call stack

memory layout
Memory layout

  • text: code
  • static data: global variable
    • $gp is the offset of static data
  • dynamic data: heap
    • malloc, new
  • stack: automatic storage

Caller Steps

  1. pass the argument $a0,$a1
  2. save caller-saved registers
  3. jal

Callee Steps

  1. establish stack frame $sp
    subi $sp, $sp
  2. saved callee saved registers
    Ex. $ra, $fp, $s0-$s7
  3. establish frame pointer $fp
    Add $fp, $sp, -4
  4. Do Something
  5. put returned values in $v0, $v1
  6. restore(load) callee-saved registers
  7. pop the stack
  8. return: jr $ra

registers

nested procedure

ASCII (American Standard Code for Information Interchange)
8 bits/character

`lb sb`

Unicode (Universal Encoding)
16 bits/character

`lf sf`

performance

1
2
3
4
5
6
7
8
9
10
11
12
Clear1(int array[ ], int size)
{
int i;
for (i=0, i< size; i+= 1)
array[i] = 0;
}
Clear 2(int *array, int size)
{
int *p,
for (p = &array[0]; p < &array[size]; p = p+1)
*p = 0;
}

Clear2 > Clear1
pointer is faster than fetching value

1
2
3
4
5
6
7
8
move t0, $a0 			# p = &array[0]
sll t1, $a1, 2 # t1 = size x 4
add t2, $a0, $t1 # t2 = &array[size]
Loop2:
sw zero, 0($t0) # memory[p] = 0
addi t0, $t0, 4 # p= p+4
slt t3, $t0, $t2 # compare p & array[size]
bne t3, $zero, Loop2

Synchronization

SWAP: atomically interchange a value in a register for a value in memory; nothing else can interpose itself between the read and the write to the memory location

1
2
3
4
S4 = 1;
Swap ($S4, Lock);
If Lock = 0 # change success
enter critical section;

Load linked: load value from rs(atomic)

`ll rt offset(rs)`

Store conditional

`sc rt offset(rs)`

if location not changed since the ll , rt return 1, 否則回傳 0

1
2
3
4
5
6
swap s4 and s1
try:add t0,$zero,$s4 ;copy exchange value
ll t1,0($s1) ;load s1 to t1
sc t0,0($s1) ;store t0 to s1
beq t0,$zero,try ;if store fail, try again
add s4,$zero,$t1 ;put load value in $s4

MIPS Addresssing Mode

  1. Register addressing
    運算對象在register
    add
  2. Immediate addressing
    運算對象是constant
    addi
  3. Base addressing(The argument is at MEM)
    運算對象在memory
    lw
    way to get address
    1.
    .data # define data
    xyz:
    .word 1 # some data here

    .text # program code

    lw $5,xyz # equals to lw $5, offset($gp)
    2.
    la $6, xyz # $6 = &xyz
  4. PC-relative addressing
    beq
  5. Pseudodirect addressing
    j 100

from program to memory

Program -----------> assembly ---------> object file:
          compiler            assembler
machine language module + library module --------->  
						                   linker  
machine code program ---------> memory
                       loader

Assembler
Symbol table: translate variables into memory address
Psudoinstruction: common variation of assembly language instructions
Linker
executable

generate executables file header

  1. Place code and data modules symbolically in memory.
  2. Determine the addresses of data and instruction labels.
  3. Patch both the internal and external references

Loader
determine the size of the text and data segment
Creates an address space
Copies the instructions and data to memory
Initializes the machine registers and sets the stack pointer
Jump to a start-up routine

Dynamically Linked Libraries (DLL)
Loading the whole library even if all of the library is not used => libraries are not linked and loaded until the program is run, and use lazy procedure linkage

java

java
Java bytecode: use instruction set designed to interpret Java programs
Just In Time Compiler (JIT): compiler that operates at runtime, translating bytecodes into the native code of the compiler
Java Virtual Machine (JVM): The program that interprets Java bytecodes

ARM

AMD64 (2003): extended architecture to 64 bits
AMD64 (announced 2007): SSE5 instructions
Intel declined to follow...

Similar to MIPS
arm mips
do anything for code density
include
32-bit ARM instruction
16-bit Thumb instruction
8-bit Java Instruction Set

Current program status register: Top four bits of CPSR
use for performance
N : Negative
Z : Zero
C : Carry
V : Overflow

Conditional execution
ADDEQ r0,r1,r2 => r1-r2, If zero flag set then do r0 = r1 + r2

Intel x86 ISA

History: skipped

| IA-32 | MIPS |
| -------------------------- | --------------- | ----------------- |
| RISC | RISC |
| general purpose register | 8 | 32 |
| operand operation | 2 or 3 | 3 |
| operations to be performed | Register-memory | register-register |
| addressing modes | more | less |
| encoding | variable-length | fixed-length |

simple instruction -> higher performance
because complex instructions are hard to implement, slow down all instructions

assembly code for high performance(x)
More lines of code -> more errors and less productivity

Chap04 Building Single-Cycle Datapath and Control Unit

How to Design a Processor

  1. Analyze instruction set => datapath(include PC,... functional units) requirements
  2. Select set of datapath components and establish clocking methodology
  3. Assemble datapath
  4. Analyze implementation of each instruction to determine control points
  5. Assemble the control logic

Details

register-tranfer暫存器傳輸
RTL (Register Transfer Languages)
Ex. ADDUR[rd] <–R[rs] + R[rt], PC <–PC + 4
2.
Combinational Elements: outputs only depend on input
Ex. ALU, MUX, Adder
Storage Elements: outputs depend on input and state(clock)
Ex. Flip-Flop, register, memory
Register
register graph
Edge-triggered clocking
until next edge would the value change
3.
Instruction Fetch Unit: mem[PC]
Sequential code
PC <- PC + 4
Branch and Jump
PC <- Target addr
Memory Operations
Mem[R[rs] + SignExt[imm16]]
sign extension: increasing the number of bits without changing value (Ex. 10 0010 -> 0010 0010)
4.
Control Unit: set control flag to change the operation with the change of operation code
Control Flags

  • MemWr: write memory
  • MemtoReg: 0 => use ALU output 1 => use Mem value
  • RegDst: 1 => “rd” when 3 operand are all register; 0 => “rt”
  • RegWr: write to register
  • ALUsrc: 1=> immed; 0=>regB
  • ALUctr: “add”, “sub”
  • PCSrc: 1=> PC = PC + 4; 0=> PC = branch target address
instruction MemWr MemtoReg RegDst RegWr ALUsrc ALUctr PCSrc
Add 0 0 1 1 0 add 1
Load 0 1 0 1 1 add 1
Store 1 x x 0 1 add 1
Branch 0 x x 0 0 sub value(after sub)
Ori 0 0 0 1 1 or 1
Jump ?

use k-map to make control flags simple
ALU control
I-type: (00) add for load/store and (01) sub for beqand (10) for ori
R-type: (11, need to check funct field)

alu ctr truth table
truth table for main

functional unit used by instructions

Chap05 Pipeline

when CPI=1:

  • Load is too slow! -> Long Cycle Time
  • Real memory is not so nice as our idealized memory
    • cannot always get the job done in one (short) cycle

Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduces speedup(切平均,增加效能最多)

Break the instruction into smaller steps(5 steps in MIPS):
Execute each step in 1 clock cycle

  • Instruction Fetch(Ifetch)
  • Instruction Decode and Register Fetch(Reg/Dec)
  • Execution, Memory Address Computation, or Branch Completion(Exec)
  • Memory Access or R-type instruction completion(Mem)
  • Write back to register(Wr)

simplified: IF ID EX MEM WB

The Four Stages of R-type: without Mem
The Four Stages of store: without Wr
The Three Stages of Beq: without Mem, Wr

pipe 2ns
Ideal speedup from pipelining == # of pipeline stages
(分成N段->理想上,速度提升N倍)

Pipeline Hazards

  • structural hazards: use the same resource at the same time(Single Memory)
    structure hazard
  • data hazards: use item before it is ready(load, update)
  • control hazards: make a decision before condition is evaluated(branch)
    Can always resolve hazards by waiting
    Need to detect and resolve hazards

Way to Solve Hazards

  1. structural
    1. Stall(put bubbles that do nothing)
      stall
    2. Split instruction and data memory(2 different memories)
    3. Delay R-type’s Write by One Cycle()
  2. data
    1. “Forward” result from one stage to another(算出答案後,在放入記憶體前,就先給之後的instruction使用)(可以同時read and save)
    2. Load-use data hazard: can't solved by forward, can solved by reordering instructions, otherwise, need stall lw hazard loaduse->forward
  3. control
    1. stall until the outcome of the branch is known
    2. solve branch earlier: put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC
    3. Predict: flush if the assumption is wrong
      dynamic scheme: history of branch (90% success)
    4. performance: 13% of branch instructions executed in SPECint2000, the CPI is slowdown of 1.13 versus the ideal case

Pipelined datapath

Pipeline registers
4bar

  • The registers are named between two stages

control in each stage

Pass control signals along just like the data
control