Computer and Cache Memory

Hirsun (HAN Jiaming)大约 54 分钟

Computer and Cache Memory

Computer components revisited

Computer Memory

Computer memory refers to any physical device capable of storing information「计算机内存是指能够存储信息的任何物理设备」

a wide range of type, technology, performance, cost, etc. --- Characteristics of Memory「类型，技术，性能，成本等广泛的范围-内存的特征」

Memory organization: select the right memory, and put it to the right place

Motivation

Choosing the right memory (physical device) is just like buying the right desk「选择合适的内存（物理设备）就像购买合适的办公桌」

Underlying「基本的」 learning goals:

trade-off between performance and cost
design philosophies「哲理」/techniques to use constrained cost to achieve better performance「设计哲学/技术以使用有限的成本来获得更好的性能」

Characteristics of Memory

Memory can be categorized「分类」 from many aspects「方面」

the technology it uses
the physical media
the cost
the access speed
the location (where it is in the computer)

Location

internal memory or external memory?
- internal: registers in processor, main memory, cache, etc.
- external: the peripheral storage devices, such as disk and tape -- they are accessible to the processor via I/O controllers

注意：一般在持久在主板上插着的 Memory 算 internal memory

Access method

Sequential Access

「顺序访问」

memory is organized into units of data (records)「内存以数据（记录）为单位进行组织；」;
access each records in sequence: move from a current location to the desired location sequentially「依次访问每个记录：从当前位置顺序移动到所需位置」
as a result, the time to access an arbitrary record is highly variable「非常可变的」

example: tape「磁带」

Direct Access/Random Access

each location in the memory has a unique address「内存中的每个位置都有一个唯一的地址」
direct access often refers to secondary devices such as disks (access the block of data, may plus sequential search)「直接访问通常是指诸如磁盘之类的辅助设备（访问数据块，可能还需要顺序搜索）」
random access often refers to main memory or cache「随机访问通常是指主内存或缓存」
the time to access a given location is typically independent of the sequence of prior accesses and is often constant「访问给定位置的时间通常与先前访问的顺序无关，并且通常是恒定的」

Access Time (latency)

Measures of Performance -- Access Time (latency)「性能指标-访问时间（延迟）」
for random access, the time it takes to perform a read or write operation, i.e., from the time when an address is presented to memory to the time when data have been stored or become available「对于随机访问，执行读或写操作所花费的时间，即 从将地址给内存的时间到数据已存储或可用的时间」
for non-random access, from the time the instruction is issued to the time the data position is located for read/write operation「对于非随机访问，从 发出指令的时间到为读/写操作定位数据位置 的时间」

Memory Cycle Time

Measures of Performance -- Memory Cycle Time

Primarily「主要」 applied to random access memory「主要应用于随机存取存储器」：it consists of the access time plus any additional time required before a second access can be made (cycle)「它由访问时间加上任何额外的时间在第二次访问可被执行前」

Example: in computer games, skills (QWER) + cool down「在电脑游戏中，技能（QWER）+冷静下来」

it is concerned with the system bus「它与系统总线有关」

An associated concept is transfer rate = 1/memory cycle time
频率一般是这样计算出来的

Physical materials/technologies

semiconductor: RAM, Cache「高速缓存」
magnetic: hard-drive
solid-state devices (a kind of semiconductor technology); replace hard-drive「固态设备（一种半导体技术）；替代硬盘」
optical: CD「光学的：CD」

Volatile vs. Non-volatile

Physical features「物理特性」 -- Volatile vs. Non-volatile

Volatile memory: need power to maintain the stored information (memory is lost when power is off), e.g., registers, RAM「易失性内存：需要电源来维护存储的信息（电源关闭时内存丢失），例如寄存器，RAM」
non-volatile memory: no need to be charged all the time, e.g., hard drive「非易失性存储器：无需一直充电，例如硬盘驱动器」

Erasable vs. Non-erasable

Physical features -- Erasable vs. Non-erasable

erasable: can be deleted or rewritten with new data, e.g., registers, hard drive
non-earsable: cannot be delelted/rewritten, e.g., ROM (Read Only Memory), CD

Memory Hierarchy

Memory Hierarchy「内存层次结构」

In the hierarchy

closer to the processor: we have faster but lower-capacity, more expensive memory「距离处理器更近：我们拥有更快但容量更低，内存更昂贵的内存」
further to the processor: we have slower but higher-capacity, less expensive memory「距离处理器更远：我们的内存速度较慢但容量较高，内存价格较便宜」
the essential design question: can we "alloate" the information such that frequently accessed information is stored in faster memory「设计的基本问题：我们能否“分配”信息，以便将经常访问的信息存储在更快的内存中」

example: Program Counter register

Access latency

Access latency「存取延迟」

Time used by CPU to execute one instruction
- 1 cycle for most instructions (1GHz CPU, 1 cycle = 10-9s)
Time used to fetch a word from main memory
- 10 ~ 100 cycles
Time used to fetch a block of data from disks
- 10,000 ~ 1,000,000 cycles

The gap「间隙」 between CPU, main memory and disks

Solve the problem: a fast CPU is wasted by waiting for slow memory?

The processor has access to two levels of memory

Level-1: capacity C1 = 100 words, access time T1 = 0.1s
Level-2: capacity C2 = 1000 words, access time T2 = 1s

access mechanism: if some data is in Level-1, directly access the data; 「访问机制：如果某些数据处于级别1，则直接访问数据；」otherwise, move the data from Level-2 to Level-1 first, and then access the data from Level-1「否则，首先将数据从Level-2移到Level-1，然后从Level-1访问数据」

T1 vs. (T1 + T2)

Locality

Clustering effect

在内存中，bit组成byte，多行byte组成word，多个word组成block。

Basic idea: during program execution, memory accesses (for instructions and data) by processor tend to cluster「基本思想：在程序执行过程中，处理器对内存的访问（用于指令和数据）倾向于聚集」

Reasons: programs typically contain loops/subroutines -- repeatedly access a small set of instructions; also, for data, operations on tables and arrays involve access to a clustered set of data「原因：程序通常包含循环/子例程-重复访问一小组指令;同样，对于数据，对表和数组的操作涉及对集群数据集的访问」

During some specific time period, processor wants to access a portion of main memory (one cluster)「在某个特定时间段内，处理器希望访问主内存的一部分（一个集群）」

Over time, processor accesses different clusters「随着时间的流逝，处理器访问不同的集群」

Concept

Principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently「本地性原则：程序倾向于使用地址或地址与它们最近使用的地址接近或相等的数据和指令。」
Temporal「时间」 locality: recently referenced items (data, instructions) are likely to be referenced again in the near future「时间位置：近期引用的项目（数据，说明）可能会在不久的将来再次引用」
Spatial「空间」 locality: items (data, instructions) with nearby addresses tend to be referenced close together in time「空间位置：具有附近地址的项目（数据，指令）往往在时间上靠近在一起引用」

THINK: WHY LOCALITY EXISTS?

Example: Two-Level Memory

Hit Ratio: the fraction of accesses involving only Level-1

Conclusion-- increase the hit ratio

Examples of locality

To compute the sum of all elements in an 2-D array

同时我们发现数组是按x-y-z的维度存储的。所以，当我们使用下面的算法时

就是 BAD LOCALITY

Examples of temporal locality

Data references
- The access to “sum” in the inner loop
- Once accesses, will be accessed again in the near future「一旦执行，将在不久的将来再次执行」
Instruction references
- The instructions to do “sum += a[][]”
- Once executed, will be executed again in the near future

To understand locality for “data” and “instructions”:
They are essentially the same, as instructions are special data stored in memory

To measure locality

如何评价一个locality的好坏呢？

Stride: The distance of two adjacent data accesses in memory location, in the unit of 1 data element「步幅：内存位置中两个相邻数据访问的距离，以1个数据元素的长度为相同单位」
- Stride-1 reference pattern: access the data one by one according to their memory addresses, such as the good locality example「根据数据的存储地址一一访问数据，例如良好的locality示例」
- Stride-k reference pattern: for example, the bad locality example generally has a stride-4 reference pattern「例如，不良locality示例通常具有 a stride-4 reference pattern」
The smaller the stride, the better the locality「步幅越小，locality 越好」

Cache Memory

这里冰箱就相当于 Cache Memory

Caching idea：Adding a small but fast memory inside the CPU

cache是一行一行的。我们认为 cache 每行长度 = 内存每个block大小。

Concept

Cache is fast but has relatively small capacity.「缓存速度快，但容量相对较小。」

Basic mechanism: cache contains the copy of portions of main memory (blocks). 「基本机制：高速缓存包含主内存部分（块）的副本。」

when CPU wants to read a word:
- if the word is in cache, access the word
- else a block of main memory (containing that word) is read into cache
thenthe word is delivered to CPU「否则将一块主存储器（包含该单词）读入缓存，然后这个词被交付给CPU」

Due to locality of reference, the hit ratio is high (words can be found in cache with high chance)「Due to locality of reference 命中率很高（可以在高速缓存中找到单词）」

As a result, as if there is memory with large capacity and fast access speed「结果，好像有大容量且访问速度快的内存」

Example

Assume When a program executes, it generates 20 memory accesses
ABCDCCCDDDEGHGHGHGHB

The unit of data loading is “one block”：One block contains two data variables

Cache size = 2 blocks
Data access time
- Cache hit: 1 cycle「缓存命中：1个周期」
- 「缓存未命中：200个周期」

Performance with cache：Access time = 1 * 15 + 5 * 200 = 1015 cycles
Performance without cache：Access time = 200 * 20 = 4000 cycles

注：假设判断不需要时间。从内存提取数据的间隙是 200c ，从 cash 提取数据间隙是 15c

Multiple-Level Cache Organization 图示

Application

Structure of a Cache/Main-Memory System

The Cache consists of m lines, where each line has a length of K words (the size of one block in memory)「cache 由m行组成，其中每行的长度为K个字（内存中一个块的大小）」

m lines in Cache vs. M blocks in Memory (m <<M)

access mechanism: when we want to access a word in the memory, the whole block containing that word is copied into one line of cache「访问机制：当我们要访问内存中的一个word时，包含该word的整个块被复制到缓存的一行中

One cache line cannot be permanently allocated to one memory block -- use the tag to indicate which block is stored in cache (more on this later)「一条高速缓存行不能永久分配给一个内存块-使用标签指示哪个块存储在高速缓存中（稍后会详细介绍）」

Illustration of Cache Read Operation 如下

Typical Cache Organization 如下图

Mapping Function——Basic Elements of Cache Design

Mapping Function

The essential problem we want to address: map M memory blocks to m cache lines (m << M)——mapping function「我们要解决的基本问题：将M个内存块映射到m个缓存行（m << M）-映射功能」

A mapping function defines where the needed blocks in memory are loaded to Cache (M blocks, m positions)「映射功能定义将内存中所需的块加载到高速缓存的位置（M个块，m个位置）」

direct mapping: a kind of fixed mapping, i.e., blocks in memory are allocated to fixed cache lines
- Q = M/m, Q blocks are mapped to one single cache line; Q to 1 mapping「Q = M / m，Q个块被映射到一条 cache line； Q对1映射」
associative mapping: any block can go to any cache line「关联映射：任何块都可以到任何 cache line」

Direct mapping

Direct mapping Example

Map all students to 10 seats using direct mapping

suppose SID has 8 digits, split SID into 2 parts: first 7 digits and the last digit「假设SID有8位数字，则将SID分为2部分：前7位和后1位」
use your last digit as your seat number -- all students are divided into 10 groups, it is a fixed mapping「用您的最后一位数字作为座位号-所有学生分为10组，这是一个固定的映射」

How can we know who is currently sitting in seat i (i 0 to 9)?「我们如何知道谁现在坐在座位i（i 0到9）中？」

a student can use the first 7 digits as a tag -- the tag is unique for that particular student「一个学生可以使用前7位数字作为标签-该标签对于该特定学生是唯一的」

SID = memory address (tag in cache is a portion of memory address)「SID =内存地址（缓存中的标签是内存地址的一部分）」

Implementation

Use main memory address to implement direct mapping「使用 main memory address 实现直接映射」
bear in mind that Cache line = Memory Block (they contain the same number of words)「请记住，缓存行=内存块（它们包含相同数量的字）」

注意：这并不违背地址编号的相关知识。原因是s+w并不会占满所有的比特位。

Cache interpret

How would the cache interpret「解释」 main memory address?

Example：s = 4 bits, r = 2 bits

01 11 — 11 determines which line you should go to
00 11, 01 11, 10 11, 11 11: the four blocks ending with 11 will go to the same line; they are differentiated by the tag (first two bits)
01 11 mod 4 = 3 — the line with line number 3

Summary

Summary of Mapping Function:

given a memory address ( s + w) bits，we know which block it is through the s bits「我们通过s位知道它是哪个块」
- among the s bits, there are r bits, which tell us the line number in Cache that stores this block「在s位中，有r位，它们告诉我们存储此块的Cache中的行号」
- As there are more blocks than lines (s > r), the remaining (s - r) bits serve as a tag for the Cache line, telling us which block this line is currently storing

Multiple Blocks to One Line：address s + w bits

cache will interprete it as three fields: tag (s-r), line (r), word (w)
- line (r) determines which line in cache the block will map to
- tag (s-r) determines the current block that line is storing
  CPU can directly check that line, and compare the tag field「比较标签字段」 to determine whether the block is now in the cache

注：tag不是自带的，是可以被写入的。

How does CPU access a word?

Given an address (s+w) bits
Use r bits to locate the line in Cache
Compare the (s-r) bits with the tag:
- if (s-r) match, the line stores the desired block; use w bits to identify the desired word
- if (s-r) no match, access the memory, copy the block to the line

Example

Settings：

word = one byte, block = 4 words (4 bytes)
main memory has 16 Mbytes (note: 1 MB = 1024 KB; 1 KB = 1024B);
- that is, 2^{24} bytes -> 24 bits for the address
- we have 2^{22} blocks -> s = 22 bits, w = 2 bits
cache has 64 KB = 2^{16} bytes

R = 14 bits,How many lines are there in the Cache?

The number of lines are determined by the number of address.

2^{14} lines -> among the 22 bits, r = 14 bits, 8 bits for the tag

Pros and Cons

simple and inexpensive to implement
main disadvantage: muliptle blocks in memory are mapped to a fixed Cache line
- if a program happends to access words repeatedly from two different blocks that map into the same line
- the two blocks will be continually swapped in the cache -- low hit ratio

Associative Mapping

Key difference from direct mapping:

a block in memory can be mapped to any line in cache「内存中的一个块可以映射到缓存中的任何行」
cache will interpret the address as two fields Tag and Word「缓存会将地址解释为Tag和Word两个字段」
- address = s + w bits; 2^s blocks in memory
- s serves as the tag, and it is stored together with the data「s作为标签，它与数据一起存储」
note: there is no field to determine the line number「注意：没有用于确定行号的字段」

Associative Mapping Example

Settings:

memory has 2^{22} blocks, each block has 4 bytes (data = 32 bits)
address is 24 bits = 22 bits (s) + 2 bits (w)
Tag: 22 bits

Implementation

A block can be mapped to any line in cache! What if the cache is full?

For direct mapping, there is no choice.
For associative mapping, we need to decide which block should be replaced --- replacement algorithms

The main disadvantage of associative mapping is the complex circuitry required to examine the tags of all cache lines in parallel「associative mapping 的主要缺点是需要并行检查所有 cache 行的标签的复杂电路」

Replacement Algorithms

Least Recently Used (LRU)——Most popular replacement algorithm

replace the block that has been in the cache longest with no reference to it
there requires some extra index to record the time when a line is referenced – the price need to pay
Simple for two-way set-associative mapping: use a single bit USE to indicate the most recently used line in the set

Other common replacement algorithms

First-in-first-out (FIFO): replace that block in the set that has been in the cache longest (what's the difference from LRU)

Least Frequently Used (LFU): associate a counter to each line

Random (not based on usage): randomly pick one to replace – some papers show that Random is only slightly worse than previous usage-based algorithms

Set-Associative Mapping

not required

Write Policy

Why does inconsistency happen?

multiple CPUs「比如在服务器中」: each CPU has its own cache, which may store the same block in memory
I/O can directly access main memory

we need a policy to deal with such inconsistency「我们需要一项政策来应对这种矛盾」

Write Policy I - Write Through

All writes go to cache as well as main memory

multiple CPUs need to monitor main memory traffic to keep local (with respect to CPU) cache up to date
lots of traffic
slows down writes

Write Policy II - Write Back

Updates initially made in cache only, then memory
use an extra bit along with each cache line to indicate whether there's update in this line (set the bit if there's update)
when this cache line needs to be replaced, check this bit, write back to memory if the bit is set (such that updates are not lost)

requirement: I/O has to access cache, not main memory directly

Exploiting locality

Hints to software developers

Caching leverages locality, good locality make good use of cache「缓存利用了局部性，良好的局部性充分利用了高速缓存」
General principles to write programs with good locality「编写具有良好本地性的程序的一般原则」
- Focus your attention on inner loops, where the CPU spends most of the time「将注意力集中在CPU大部分时间都在其中的内循环上」
- Try to maximize the spatial locality by reading data objects sequentially, with stride 1, in the order they are stored in memory「尝试通过跨步读取数据对象（步长为1），以将它们存储在内存中的顺序来最大化空间局部性
- Try to maximize the temporal locality by using a data object as often as possible once it has been read from memory「从内存中读取数据对象后，应尽可能频繁地使用它来尝试最大化时间局部性」

Managing caches

Design considerations of caches

Block size
- Bigger block size exploits spatial locality「更大的块大小利用空间局部性」
- Too big, bringing in many data that will not be used, waste of space and time「太大了，带来了很多无法使用的数据，浪费了空间和时间」
Who is in, who is out?
- Replacement policy: if cache is full, new data going in, evict which data?「替换策略：如果缓存已满，则有新数据进入，逐出哪些数据？」
  - Intuition「直觉」: keep the data being used in the near future in cache「直觉：将数据在不久的将来保存在缓存中」But, how do we know which data will be used in the near future?「但是，我们如何知道不久的将来将使用哪些数据？」

To efficiently find a data item in the cache
- Find 1 data in 1000 candidates? Takes too long!
- Partition the cache into groups, map different data into different groups
- To find a data item in a smaller group can be efficient, but space may not be fully utilized

Impacts to HW design

对硬件设计的影响–存储层次结构

Conceptually「从概念上讲」, level K can be viewed as a cache of level K+1, storing a subset of K'data in level K+1
If caching policies are smartly designed, most of the time, cache accesses will be hit
Pretty much like we have a memory system that works at a speed of the highest level, but have the storage space of the lowest level, with reasonably low price「从概念上讲，可以将级别K视为级别K + 1的缓存，将数据的子集存储在级别K + 1中
如果对缓存策略进行了精心设计，那么在大多数情况下，缓存访问将受到影响
就像我们有一个内存系统可以以最高级别的速度工作，但拥有最低级别的存储空间并且价格却相当低」

Replacement Policy

LRU Replacement

FIFO Replacement

先进先出（FIFO）

The cache is managed with the FIFO replacement policy: when all the 3 cache blocks are used up and a new data block will be loaded into cache, one data block out of the 3 blocks will be replaced out of the cache. The data block that was earliest loaded into the cache will be replaced. We assume that at the very beginning, the cache is empty.「高速缓存通过FIFO替换策略进行管理：当3个高速缓存块全部用完并且新数据块将被加载到高速缓存中时，这3个块中的一个数据块将从高速缓存中替换。最早加载到缓存中的数据块将被替换。我们假设一开始缓存就为空。」

使用此算法，高速缓存按添加块的顺序逐出该块，而无需考虑它们之前被访问过的频率或次数。

LIFO Replacement

后进先出（LIFO）

使用此算法，与FIFO队列的行为恰好相反。高速缓存逐出最先添加的块，而与之前访问它的频率或次数无关。

Computer and Cache Memory

# Computer and Cache Memory

# Computer components revisited

# Computer Memory

# Motivation

# Characteristics of Memory

# Location

# Access method

# Sequential Access

# Direct Access/Random Access

# Access Time (latency)

# Memory Cycle Time

# Physical materials/technologies

# Volatile vs. Non-volatile

# Erasable vs. Non-erasable

# Memory Hierarchy

# Access latency

# Locality

# Clustering effect

# Concept

# Example: Two-Level Memory

# Examples of locality

# Examples of temporal locality

# To measure locality

# Cache Memory

# Concept

# Example

# Application

# Structure of a Cache/Main-Memory System

# Mapping Function

# Direct mapping

# Direct mapping Example

# Implementation

# Cache interpret

# Summary

# Example

# Pros and Cons

# Associative Mapping

# Associative Mapping Example

# Implementation

# Replacement Algorithms

# Set-Associative Mapping

# Write Policy

# Write Policy I - Write Through

# Write Policy II - Write Back

# Exploiting locality

# Hints to software developers

# Managing caches

# Impacts to HW design

# Replacement Policy

# LRU Replacement

# FIFO Replacement

# LIFO Replacement

Computer and Cache Memory

Computer components revisited

Computer Memory

Motivation

Characteristics of Memory

Location

Access method

Sequential Access

Direct Access/Random Access

Access Time (latency)

Memory Cycle Time

Physical materials/technologies

Volatile vs. Non-volatile

Erasable vs. Non-erasable

Memory Hierarchy

Access latency

Locality

Clustering effect

Concept

Example: Two-Level Memory

Examples of locality

Examples of temporal locality

To measure locality

Cache Memory

Concept

Example

Application

Structure of a Cache/Main-Memory System

Mapping Function

Direct mapping

Direct mapping Example

Implementation

Cache interpret

Summary

Example

Pros and Cons

Associative Mapping

Associative Mapping Example

Implementation

Replacement Algorithms

Set-Associative Mapping

Write Policy

Write Policy I - Write Through

Write Policy II - Write Back

Exploiting locality

Hints to software developers

Managing caches

Impacts to HW design

Replacement Policy

LRU Replacement

FIFO Replacement

LIFO Replacement