Notes@HKU by Jax

Memory

Memory hierarchy

Memory comes with varying attributes. Our goal is to on capacity, cost, access time, and access frequency. The memory hierarchy is a way to organize memory in a computer system to optimize these attributes.

Memory hierarchy

Type of memoryUsageCategory
RegistersInstructionsInbound memory
On-chip cacheInside CPUInbound memory
Cache memoryMotherboardInbound memory
Main memoryRAMInbound memory
Secondary storageHard disk, SSDOutbound / Off-line storage

Going from top of bottom, the memory hierarchy has the following trends:

  • Capacity \downarrow
  • Cost \downarrow
  • Access time \uparrow
  • Access frequency \uparrow

Principle of locality

Programs do not access all memory locations uniformly. Some memory locations have higher tendency to be accessed than others.

Types of locality

There are two types of locality:

  • Temporal locality: If a memory location is accessed, it is likely to be accessed again in the near future.

  • Spatial locality: If a memory location is accessed, it is likely that nearby memory locations will be accessed in the near future.

Example of locality

Consider the following program:

for x in arr:
    s += x

The memory at s is accessed in every loop (temporal), while each element in arr is accessed sequentially (spatial).

Memory organization

Assuming memory stores in units of 8 bits (1 byte), multi-byte data is stored differently in memory on different architectures.

  • Big endian: Stored from most to least signficant byte (left to right).
  • Little endian: Stored from least to most signficant byte (right to left).
Example storage

To store 0x11223344 in memory:

  • Big endian: 0x11 0x22 0x33 0x44
  • Little endian: 0x44 0x33 0x22 0x11

Unit of transfer

In modern days, CPU reads from cache memory. CPU reads **word by word, and data is transferred in blocks (usually 4 KB), containing multiple words.

Access methods

Access methodData accessed by
SequentialBeginning to end
RandomIn any order. Has constant latency
AssociativeData is accessed by content, not by address. Used in cache memory & Content-Addressable Memory (CAM)

Internal memory

Volatility

Volatile memory loses its contents when power is turned off.

Comparison of internal memory types

The following table compares the different types of internal memory:

Memory TypeCategoryErasureWrite MechanismVolatility
RAMRead-writeElectrically, byte-levelElectricallyVolatile
ROMRead-onlyNot possibleMasksNonvolatile
PROMElectrically
EPROMRead-mostlyUV light, chip-level
EEPROMElectrically, byte-level
FlashElectrically, block-level

Flash memory is primarily used for USB drives and SSD. They have faster write times than EEPROM, but limited number of write cycles.

Types of RAM

CharacteristicDynamic RAMStatic RAM
Storage TechnologyUse transistors to store electric charges.Use logic gates (latches).
RefreshingRequired (every few ms due to leaking charge)Not required
SpeedSlower (delay due to capacitance)Faster
UsageMain memoryCache memory
CostCheapVery expensive

Memory performance benchmarking

We are interested in the following key metrics:

  1. Access time: Time to read/write data
  2. Transfer time: Time to transfer data
  3. Memory cycle time: Access + Transfer time

Error Detection and Correction

Errors may arise from various sources such as spike in voltage (lightning), electromagnetic interference (cosmic ray), power supply problems, etc.

Extra bits are required to detect errors. (Usually in secondary storage devices but not in main memory as it is volatile and less likely to have errors.)

Common error detection and correction methods include:

  • Parity Bit: Extra bit added to data to make number of 1s even/odd. (Only detection of odd number of errors, no correctionCannot be used to correct errors. )
  • Hamming Code
  • Repetition Code

Cache memory

Cache memory is hidden software and managed by hardware. It stores copies of frequently accessed data to speed up future access to them.

Cache memory organization

Consider the word-addressing main memory, with each address of length nn bits. By organizing the memory into blocks (groups) of KK words, we have a total of M=2nKM=\frac{2^n}K blocks.

Generally, caches will container m<<Mm << M lines (groups) of the MM blocks.

On each cache line, it contains a tag (identifier) and a word.

Main memory
TagBlock
02n/k2n - 2n/k
12n/K + 12n - 2n/k + 1
2n/k - 12 × 2n/k - 12n - 1
←Block length (k words)→
Cache Memory
TagWord
0m(k-1)·m
1m + 1(k-1)·m + 1
m - 12m - 1k·m - 1
←Word length (k words)→

Overview of cache access process

address = get_address() # get address from CPU
value, cache_position = address_mapping(address) # find address in cache
 
# Cache hit
return value if value is not None
 
# Cache miss
value = memory[address] # access from main memory
replaced_value = replacement_algorithm(value, cache_position) # replace value in cache
 
if write_policy == "write-back" and replaced_value is not None:
    memory[address] = replaced_value # put discarded value back to memory
 
def replacement_algorithm(value, cache_position):
    if address_mapping == "direct":
        cache[cache_position] = value
        return None
    
    # associative mapping
    if cache is not "full":
        > "append at the end"
    else:
        > "use selected algorithm, and return replaced value"

Replacement algorithms

The following are common replacement algorithms:

  1. Random replacement: Randomly select a block to replace.
  2. Least Recently Used (LRU): Replace the block that has not been used for the longest time.
  3. Least Frequently Used (LFU): Replace the block that has been used the least number of times.
  4. First In First Out (FIFO): Replace the block that has been in the cache the longest.

Address mapping

Address mapping inteprets the address of a word in memory, and maps it to a cache block.

Interpreting the main memory address

The cache memory splits the given memory address into 3 parts:

[ Tag ][ Number ]s bits[ Offset ]k bitsmemory sizeBlock size=Kwords=2kbytes\underbrace{[\text{ Tag }]\underbrace{[\text{ Number }]}_\text{s bits}\underbrace{[\text{ Offset }]}_\text{k bits}}_\text{memory size}\\ \text{Block size} = K \text{words} = 2^k \text{bytes}\\

The specific number of bits for Number is different for each cache mapping method.

Offset is used to identify, from left to right, the index of the word in block.

Mapping TypeValue of sUsage of NumberUsage of TagAdv.Disadv.
Direct2s2^s = no. of cache linesDetermines cache line directlyEnsures consistencyFastMultiple memory block can map to same cache line
Full Associatives=0s = 0-Matches across all cache lines in parallelHeavy processing (parallel)
Set Associative2s2^s = no. of cache setsDetermines cache setMatches within the set in parallel

For set associative, a n-way associative set will divide the cache by nn, where each set will have nn blocks.

Example set associative

Given address in binary: 1100 1001 0001 0010 1111, cache size: 1024 bytes, block size 128 word, word size 4 bytes. For a two-way set associative mapping, what is the tag, set number and offset?

Block size 128×4=512128 \times 4=512 bytes, so 29=2k    k=92^9=2^k \implies k=9 bits.

Number of blocks in cache 1024128=8\frac{1024}{128} = 8, number of set 82=4\frac{8}{2}=4, so 22=2s    s=22^2=2^s \implies s = 2 bits.

The rest is the tag. Therefore, from left to right, tag is 1100 1001 0, set number 00, offset 1 0010 1111

Performance

Performance

Cache miss penalty (CC): Time to transfer value from main memory to cache (in nanosecs)

Cache hit time (HH): Time to access the cache

Cache hit rate (P(hit)P(\text{hit})): N(hit)N(access)=1N(miss)N(access)\frac{N(\text{hit})}{N(\text{access})} = 1 - \frac{N(\text{miss})}{N(\text{access})}

Average memory access time: H+P(miss)CH + P(\text{miss}) \cdot C

Unified and split cache

  1. Split Cache: The cache is divided into instruction cache and data cache. Size for each cache is fixed. No pipeline hazard. The main trend is to use split cache.
  2. Unified Cache: The cache is used for both instructions and data. Instructions and data are automatically balanced. Has contention problem on parallel and pipeline execution that imposes bottleneck on performance.

On this page