跳至主要內容

Y86-64处理器架构简介

Hirsun大约 26 分钟

Y86-64处理器架构简介

image.png

ISA(指令集架构)

指令集架构(英语:Instruction Set Architecture,缩写为ISA),又称指令集指令集体系,是计算机体系结构open in new window中与程序设计open in new window有关的部分,包含了基本数据类型open in new window,指令集,寄存器open in new window寻址模式open in new window存储体系open in new window中断open in new window异常处理open in new window以及外部I/Oopen in new window。指令集架构包含一系列的opcodeopen in new window即操作码(机器语言open in new window),以及由特定处理器执行的基本命令。

不同的处理器“家族”——例如Intelopen in new window IA-32open in new windowx86-64open in new windowIBMopen in new window/Freescale Power和ARMopen in new window处理器家族——有不同的指令集架构。[1]open in new window

指令集体系与微架构open in new window(一套用于执行指令集的微处理器设计方法)不同。使用不同微架构的电脑可以共享一种指令集。例如,Intelopen in new windowPentiumopen in new windowAMDopen in new windowAMD Athlonopen in new window,两者几乎采用相同版本的x86open in new window指令集体系,但是两者在内部设计上有本质的区别。

维基百科

Designing an ISA

  • Designing processor states visible to programmers「设计对程序员可见的处理器状态」
  • Designing a set of instructions「设计一套指令」
  • Encoding the instructions「对指令编码」

All the above designs will be exemplified by the Y86-64 ISA --- a much simpler X86 ISA enough to demonstrate the key concepts「以上所有设计都将以Y86-64 ISA为例-一种更简单的X86 ISA,足以演示关键概念」

Circuits

image.png

Digital circuits

Logic gates

Logic gates:Basic computing electronic circuit elements「Basic computing electronic circuit elements」

image.png
  • Logic gates are always active, is some input to a gate changes, then within some small amount of time, the output will change accordingly「逻辑门始终处于活动状态,某个门的某些输入会发生变化,然后在短时间内,输出将相应地发生变化」
  • Can be represented by hardware control language (HCL)「可以用硬件控制语言(HCL)表示」
    • 比如:out = a && b; out = a || b …

Combinational circuits

image.png

Example 1

if a and b are equal, output 1; otherwise, output 0With and, or and not gates

image.png

Example 2

selecting a or b according to s

image.png

Example 3

From a single bit to multiple bits (word)

image.pngimage.png

Arithmetic Logic Unit (ALU)

image.png
  • Using and, or, not gates to implement arithmetic logic「Using and, or, not gates to implement arithmetic logic」
  • Compute the result, and set the conditional codes「Compute the result, and set the conditional codes」
  • Inputs and outputs are multi-bit word「Inputs and outputs are multi-bit word」

Storage elements

Storage elements are special electronic circuits that can retain data values「存储元件是可以保留数据值的特殊电子电路」

  • Storage elements can be read or written
  • Storage elements can be addressed
  • Storage elements rely on clocks to retain data values「存储元件依靠时钟来保留数据值」
image.png

Y86-64 processor

state

image.png
  • 15 64-bit general purpose registers
  • Conditional codes
    • ZF: zero;
    • SF: negative;
    • OF: overflow
  • Program Counter:Indicates address of next instruction
  • Memory
    • Byte-addressable storage array「字节-可寻址存储阵列」
    • Words stored in little-endian byte order「以little-endian字节顺序存储的单词」

little-endian

其实big endian是指低地址存放最高有效字节(MSB),而little endian则是低地址存放最低有效字节(LSB)。

用文字说明可能比较抽象,下面用图像加以说明。比如数字0x12345678在两种不同字节序CPU中的存储顺序如下所示:

image.png

从上面两图可以看出,采用big endian方式存储数据是符合我们人类的思维习惯的。

Instruction set

image.pngimage.png

Encoding registers

给寄存器编码

Each register is uniquely specified by a 4-bit ID「每个寄存器由一个4位ID唯一地指定」

image.png

ID 15 (0xF) indicates “no register”「ID 15(0x F)表示“无寄存器”」

Instruction examples

image.png

Uniqueness (requirement on designing an ISA)「唯一性(设计ISA的要求)」

  • The encodings must have a unique interpretation「编码必须具有唯一的解释」
  • Given a sequence of bytes (machine code), it can be interpreted into only one valid sequence of instructions「给定一个字节序列(机器代码),它只能解释为一个有效的指令序列」
  • From the first instruction, always being able to find the start byte of the next instruction「从第一条指令开始,总是能够找到下一条指令的起始字节

Standard stages to execute one instruction

  • We have ……
    • Hardware building blocks that can do arithmetic computation
    • Hardware storage elements to store data
    • Machine instructions defined
  • We want to put all these things together to build a CPU
    • That can read and understand a program in machine instructions
    • That can perform the functions specified by the machine instructions
      • By operating the computation and storage elements of the CPU
image.png
  • As there are so many instructions, it will be not wise to design a specific hardware circuit for each instruction「由于指令太多,为每个指令设计特定的硬件电路是不明智的。」
  • The execution of instructions is standardized, i.e., all instructions follow the same steps, an in each step share the same hardware「指令的执行是标准化的,即所有指令都遵循相同的步骤,并且每一步都共享相同的硬件」
Stages/StepsFunctions
FetchRead an instruction from the memory「从内存中读取指令」
DecodeRead operands「操作数」 from registers「从寄存器读取操作数」
ExecuteCompute value or address「数学计算」
Memory accessRead or write data from/to memory「从内存读取数据或向内存写入数据」
Write backWrite results to registers「将结果写入寄存器」
PC updateUpdate PC, get ready for the next instruction「更新PC,准备下一条指令」
image.png

Computed values

Stored in CPU on hardware lines/pins

image.png

Run the machine codes

Use an example program to show how the CPU run a program in the machine code form「使用示例程序来显示CPU如何以机器代码形式运行程序」

image.pngimage.pngimage.pngimage.pngimage.pngimage.png

Pipeline

image.png
  • The whole production process is composed of multiple stages「整个生产过程由多个阶段组成」
  • Worker on each stage do only ONE thing「每个阶段的工人只能做一件事情」
  • Products line up on the pipeline, each goes through all stages「产品在流水线中排列,每个阶段都经过各个阶段」

Rethinking the sequential machine

Every instruction goes through six stages「每条指令都经过六个阶段」

In the sequential implementation, when the instruction is in one stage, e.g., execute, all the hardware components in other stages are idle「在顺序实现中,当指令处于一个阶段,例如执行时,其他阶段的所有硬件组件都处于空闲状态」

This is under-utilization of the processor hardware「这是在-处理器硬件的利用率不足」

Understanding the performance of pipeline

Executing an instruction consumes 300ps (1ps = 10-12s)

How many instructions can we execute in 1s? (throughput, IPS)
1/(300 * 10-12) = 3.33 X 109 instructions(在无同时执行的情况下

image.png

Decompose the execution of each instruction into 3 stages, each stage takes 100ps to execute「将每个指令的执行分解为3个阶段,每个阶段需要100ps的执行时间」

image.png

How many instructions can we execute in 1s? (throughput, IPS)「我们可以在一秒内执行多少条指令? (吞吐量,IPS)」

  • 1/(100 * 10-12) = 1010

  • 3 times faster than the execution above

  • Adding registers between two consecutive pipeline stages「在两个连续的流水线阶段之间添加寄存器」

  • Each time a clock signal arrives, the result of stage-x will be written to the register between stage-x and stage-(x+1)「每次时钟信号到达时,stage-x 的结果将被写入stage-x和 stage-(x + 1)之间的寄存器中」

  • Once the result of stage-x is written, the next stage can start execution with the result as its input「写入stage-x的结果后,下一个stage可以开始执行,并将结果作为输入」

  • Accessing registers introduces extra time delay; the end-to-end latency of finishing a single instruction is increased「访问寄存器会带来额外的时间延迟;完成一条指令的端到端延迟增加了」

image.png
Redesign the CPU with pipeline
image.png
Bad pipeline design

Nonuniform partitioning「分区不均匀」
Latency is determined by the longest stage「延迟由最长的阶段决定」

image.png
Make the pipeline stages uniform
image.png
  • More stages: deep pipeline
  • More stage registers -> more time overhead
  • Sometimes, a stage cannot be decomposed「分解的」
Data hazard in pipelines

Data hazard in pipelines「管道中的数据危害」

  • Data dependencies: the results computed by one instruction are used as the data for a following instruction「数据依存关系:一条指令计算的结果用作下一条指令的数据」
  • Data hazard: data dependencies have the potential to cause an erroneous computation by the pipeline「数据危害:数据依赖性可能会导致管道计算错误」
image.png
Solution: Stalling
image.png
Other Problems
  • Stalling a pipeline reduces performance「停滞管道会降低性能」
  • There are other ways to remove the data hazards「还有其他方法可以消除数据危害」
  • There are control hazards, and of course solutions「有控制危险,当然也有解决办法」
  • There are out-of-order pipelines (instruction execution sequence are changed), multi-issue pipelines「有乱序管道(指令执行顺序已更改),多发布管道」
  • ……

Pipelines are a very important feature to the performance of contemporary powerful CPUs, very complex designs exist「管道是当代功能强大的CPUs性能的非常重要的特征,存在非常复杂的设计」


补充材料

Y86-64 instruction set

1620889874072.png

补充一些指令的stage

mrmovq

1620890018309.png

jg

1620890069997.png

cmovle

1620890163626.png

x86/x64 指令长度

AMD manual Vol3 第 1.1 Instruction Byte Brder 节中明确地说:An instruction can be between one and 15 bytes in length.

做题笔记

  1. 对于每个指令,都有执行Decode、Write back、Memory 这几个过程。le xxx 指令只有这三个过程

引用材料