Cache Optimizations

2026-02-07 21:48

Cache Optimization

Examples of system caching:

Processor caches
- Administered by the processor itself
- Moves DRAM to SRAM cache memory
Buffer cache
- Administered by the kernel
- Moves from secondary storage devices to main DRAM memory
sdio cache
- Administered at the application level
- Moves data from secondary storage to application buffer cache
Browser cache
- Administered by the browser (application level)
- Brings remote websites over network to storage devices

Expense of storage

As we discussed in Memory Hierarchy, the battle is between financial feasibility and performance provided by storage devices.

Access cost framework

$C_{N}$ : Cost of accessing N units of data at once ( $N >= 1$ )
- Cost is often measured in terms of time (latency)
- $C_{N} = \infty$ if it's impossible to access those units at once
As an estimate, many access costs are approximated by $C_{n} = R + N U$
- R is the cost per request, independent of the number of units accessed by said request
- U is the cost per unit

System calls can be a really good example. One can consider the cost of a read(fd, buf, sz) system call.
Assuming that a read roughly obeys the rule stated above, will the cost per request be higher than the cost per unit or vice versa?

More than cost

Other metrics we have considered in Memory Hierarchy are the latency and throughput which are the inverse of each other for some storage technologies.

For example, on a true random access device, the cost framework involves a cost per request of ~0 which makes the cost of accessing N units just be NU.

For other technologies (like SSDs), latency and throughput can diverge!

Assume per-request cost is much smaller than per-unit cost (R≪UR≪U)
- Example: main memory access
- Then $C_{N} \approx N U$
- Throughput per unit with N-unit requests = N/CN≈1
- Latency = C1≈U
Assume per-unit cost is much smaller than per-request cost (U≪R)
- Examples: system calls, reading/writing a disk drive
- Then CN≈R
- Throughput per unit with N-unit requests = N/CN≈N/R
- Latency = C1≈R
Assume per-unit and per-request costs are comparable (neither dominates)
- Example: network communication
- Then $C_{N}$ depends on both R and U

Optimization strats

Reducing number of requests
- Avoiding redundant requests
- Do the same work in fewer requests
Batching (read and write)
Prefetching (read)
Write coalescing (write)
- See eMMCs
Parallel access (read and write)

Batching

Combine multiple requests into a single one
Reading 1KB block at once instead of reading 1B 1000 times.
Reduce the total per-request cost R

Prefetching

Read data into cache before we need it
Reduces the number of requests
Often can involve some bit of context and complexity to do this

Write coalescing

Delay writes to underlying storage unit until several writes to cache
EXT4 Filesystem does some automatic coalescing when interacting with flash devices
- Although this kinda sucks on eMMC setup I have in mind...

Parallel or background access

This would benefit from more hardware capabilities
Can also involve some added abstractions in the application level

References

Memory Hierarchy

Tags:

Software
Engineering