- Introduction
- Architecture Overview
- Core Components
- Utility Components
- Algorithms & Policies
- Implementation Details
- Testing Strategy
- Future Extensions
The Cache Simulator is a comprehensive simulation tool for studying memory hierarchy behavior in computer systems. It models a two-level cache hierarchy with configurable parameters and advanced features like prefetching, cache coherence, and detailed statistics tracking. The simulator is designed to be highly customizable, allowing researchers and students to experiment with different cache configurations and analyze their performance.
The simulator takes as input a memory access trace that describes a sequence of read and write operations with their corresponding memory addresses. It then simulates how these accesses would flow through the cache hierarchy, tracking hits, misses, replacements, and other events. Various statistics are collected during the simulation, which can be used to evaluate the effectiveness of different cache designs and policies.
This document describes the design of the Cache Simulator, including its architecture, components, algorithms, and implementation details.
The Cache Simulator employs a modular, layered architecture that separates core cache functionality from utilities and user interface components:
┌─────────────────────────────────────────────────┐
│ Applications │
│ (Main Program, Tests, Trace Generator, Scripts) │
└───────────────────────┬─────────────────────────┘
│
┌───────────────────────▼─────────────────────────┐
│ Memory Hierarchy API │
│ (MemoryHierarchy, Configuration, Stats) │
└───────────────────────┬─────────────────────────┘
│
┌───────────────────────▼─────────────────────────┐
│ Core Components │
│ (Cache, StreamBuffer, Prefetchers, MESI, etc.) │
└───────────────────────┬─────────────────────────┘
│
┌───────────────────────▼─────────────────────────┐
│ Utility Components │
│ (Trace Parsing, Logging, Visualization, etc.) │
└─────────────────────────────────────────────────┘
The design follows several key architectural principles:
- Separation of Concerns: Each component has a focused responsibility.
- Modularity: Components can be developed, tested, and replaced independently.
- Extensibility: New policies, algorithms, and features can be easily added.
- Configurability: Extensive runtime customization options are available.
- Modern C++ Design: Leveraging C++17 features for safer, more expressive code.
The foundational data structures in the Cache Simulator are CacheBlock
and CacheSet
:
-
CacheBlock
represents a single block in the cache, containing:- Valid and dirty flags
- Tag bits for address matching
- MESI state for cache coherence
- Optional data storage (for functional simulation)
-
CacheSet
represents a set of blocks in a set-associative cache:- Vector of CacheBlocks
- LRU ordering information for replacement
These structures are designed to be memory-efficient while still providing all necessary state information for accurate simulation.
The Cache
class is a central component representing a single cache level (L1 or L2). It manages:
- Cache organization (size, associativity, block size)
- Block lookup and placement
- Replacement policy implementation
- Prefetching integration
- Hit/miss statistics collection
- MESI protocol state transitions
The Cache
class provides a clean access()
method that handles all the complex operations of a cache access while maintaining detailed statistics.
The MemoryHierarchy
class orchestrates the entire cache hierarchy, managing:
- Multiple cache levels (L1, L2)
- Memory access flow through the hierarchy
- Prefetching configuration and coordination
- Statistics aggregation
- Trace file processing
- Performance metrics calculation
This class serves as the main API for applications interacting with the simulator, providing methods to access memory, process traces, and retrieve statistics.
The StreamBuffer
class implements a simple hardware prefetching mechanism:
- Prefetches sequential blocks following a cache miss
- Maintains a buffer of prefetched addresses
- Tracks hits in the buffer
- Shifts out consumed entries
- Provides statistics on buffer effectiveness
Stream buffers are a simple but effective prefetching technique that works well for sequential access patterns.
The StridePredictor
class implements a more sophisticated prefetching mechanism:
- Detects constant stride patterns in memory accesses
- Maintains a table of recent addresses and strides
- Predicts future accesses based on detected strides
- Provides confidence levels for predictions
- Tracks prediction accuracy statistics
Stride prediction is effective for regular access patterns with non-unit strides, such as array traversals in multidimensional arrays.
The AdaptivePrefetcher
class implements an advanced prefetching mechanism:
- Dynamically selects between prefetching strategies
- Adjusts prefetch distance based on effectiveness
- Tracks different metrics to guide adaptation
- Supports multiple concurrently active strategies
- Maintains detailed effectiveness statistics
Adaptive prefetching can deliver good performance across a wider range of access patterns compared to fixed-strategy prefetchers.
The MESIProtocol
class implements the Modified-Exclusive-Shared-Invalid cache coherence protocol:
- Manages state transitions for cache blocks
- Handles local reads and writes
- Processes remote cache operations
- Coordinates writebacks for modified blocks
- Tracks protocol state transition statistics
The MESI protocol ensures coherent memory views in systems with multiple caches accessing shared memory.
The TraceParser
class handles trace file input:
- Parses different trace file formats
- Validates memory access operations
- Provides detailed error information
- Offers both single-operation and bulk parsing
- Uses efficient, streaming file reading
This component decouples trace file handling from the core simulation logic.
The Statistics
class manages simulation metrics:
- Collects and organizes various numerical metrics
- Calculates derived statistics
- Provides data export capabilities
- Supports metric comparison between runs
- Offers customizable reporting formats
This utility enables detailed analysis of simulation results.
The Visualization
class provides text-based visualization:
- Generates ASCII visualizations of cache state
- Renders histograms and heatmaps of memory access patterns
- Supports colorized output where available
- Exports data for external visualization tools
- Provides progress indicators for long simulations
These visualizations aid in understanding simulator behavior.
The ConfigManager
class manages configuration parameters:
- Loads configurations from files (JSON, INI)
- Parses command-line arguments
- Validates configuration parameters
- Generates parameter sweeps for benchmarking
- Supports configuration comparison
This utility simplifies experimentation with different simulator settings.
The Logger
class provides diagnostic information:
- Supports multiple log levels
- Offers both file and console output
- Provides thread-safe logging
- Allows component-specific named loggers
- Enables selectively verbose output
Logging helps diagnose issues and understand simulator behavior.
The Benchmark
class supports performance measurement:
- Times the execution of simulation functions
- Compares multiple configurations
- Calculates statistical measures of performance
- Provides formatted benchmark reports
- Supports repeated runs for variance analysis
This utility is useful for simulator development and optimization.
The MemoryProfiler
class analyzes memory access patterns:
- Tracks memory access frequencies
- Detects common access patterns
- Profiles locality characteristics
- Identifies hot regions in memory
- Provides insights for optimization
This tool helps understand workload behavior and aids in prefetcher design.
The simulator implements several cache block replacement policies:
-
LRU (Least Recently Used): Replaces the block that has not been accessed for the longest time. This is the default policy and is implemented using an ordering vector in each cache set.
-
Pseudo-LRU: A tree-based approximation of LRU that requires less state information but still provides good performance.
-
Random: Randomly selects a block for replacement, which can be effective for certain workloads and eliminates pathological cases that affect deterministic policies.
-
FIFO (First-In-First-Out): Replaces the oldest block regardless of access frequency.
-
NRU (Not Recently Used): A simple approximation of LRU using a single bit per block to indicate recent usage.
The policy can be configured on a per-cache basis.
The simulator supports several prefetching algorithms:
-
Sequential Prefetching: Prefetches a configurable number of sequential blocks after a miss. Implemented by the
StreamBuffer
class. -
Stride Prefetching: Detects regular strides in access patterns and prefetches accordingly. Implemented by the
StridePredictor
class. -
Adaptive Prefetching: Dynamically adjusts prefetching strategy and aggressiveness based on effectiveness. Implemented by the
AdaptivePrefetcher
class. -
Next-N-Line Prefetching: A simple variation that prefetches the next N blocks on any miss.
-
PC-based Prefetching: Uses program counter information (if available in the trace) to predict access patterns specific to different code regions.
These algorithms can be combined in various ways and customized with different parameters.
The MESI protocol implements cache coherence:
-
Modified (M): The block is present only in the current cache and has been modified. It must be written back to memory when evicted.
-
Exclusive (E): The block is present only in the current cache and matches memory.
-
Shared (S): The block may be present in other caches and is clean.
-
Invalid (I): The block is not present or valid in the cache.
The protocol handles transitions between these states based on local and remote operations, ensuring coherent memory views across caches.
The simulator leverages several C++17 features:
-
std::optional: Used for representing potentially missing values, such as nullable parameters and results.
-
std::variant: Used for type-safe unions, particularly in return values that can represent different outcomes.
-
std::string_view: Used for non-owning references to strings, improving performance in text processing.
-
std::filesystem: Used for portable file operations in trace handling and result saving.
-
Structured Bindings: Used to simplify code that returns multiple values, like tag and set index extraction.
-
if constexpr: Used for compile-time conditional code, improving performance in template functions.
-
Inline Variables: Used for cleaner declaration of constants and static members.
-
[[nodiscard]]: Used to ensure return values are properly handled for functions that shouldn't be called for side effects.
-
constexpr if: Used for compile-time conditional compilation without macros.
-
auto return type deduction: Simplifies function declarations while maintaining type safety.
These features improve code safety, clarity, and performance.
Several optimizations are employed:
-
Memory Pool Allocation: Cache blocks are allocated from a memory pool to reduce dynamic allocation overhead.
-
Bit Manipulation: Fast bit operations are used for address decomposition and tag comparison.
-
Vectorized Processing: Common operations are vectorized when processing multiple blocks.
-
Cache-Friendly Data Structures: Data is organized to maximize spatial locality.
-
Custom Hash Functions: Optimized hash functions for address lookup.
-
String Interning: Reuse of common strings to reduce memory usage.
-
Streaming I/O: Efficient trace file reading without loading entire files.
-
Compiler Hints: Strategic use of likely/unlikely hints for branch prediction.
These optimizations significantly improve simulation speed, especially for large traces.
The code is organized following modern C++ practices:
-
Namespace Usage: All components are within the
cachesim
namespace to avoid pollution. -
Header-Only When Appropriate: Small utilities are implemented as header-only for simplicity.
-
Separation of Interface and Implementation: Larger components separate declaration and definition.
-
PIMPL Pattern: Used for components with complex internal details to improve compilation speed.
-
Consistent Naming: Camel case for class names, snake case for methods and variables.
-
Const Correctness: Rigorous use of const for parameters and methods.
-
Forward Declarations: Used to minimize include dependencies.
-
Explicit Memory Management: Clear ownership semantics with smart pointers.
-
Modular Design: Each component has a single responsibility.
-
Documentation: Comprehensive comments for all public interfaces.
This organization enhances maintainability, readability, and compile times.
The simulator employs a multi-level testing approach:
-
Unit Tests: Verify individual components in isolation.
- Test each cache operation separately
- Verify replacement policy behavior
- Validate prefetcher predictions
- Check MESI state transitions
-
Integration Tests: Verify interactions between components.
- Test L1-L2 interactions
- Verify prefetcher-cache coordination
- Check coherence protocol across caches
-
Validation Tests: Verify the simulator against known outcomes.
- Test with hand-crafted traces with predictable results
- Compare against simplified analytical models
- Validate against academic benchmark results
-
Property-Based Tests: Verify invariants and properties.
- Ensure cache coherence is maintained
- Verify inclusion properties when applicable
- Check conservation of blocks
-
Performance Tests: Measure and optimize performance.
- Benchmark with large traces
- Profile memory usage
- Analyze scalability with increasing parameters
This comprehensive testing strategy ensures correctness and robustness.
Planned extensions include:
-
Additional Cache Levels: Support for L3 cache and beyond.
-
Victim Cache: Implementation of a small fully-associative victim cache.
-
Non-Inclusive Policies: Support for non-inclusive and exclusive cache hierarchies.
-
Replacement Policy Framework: Pluggable framework for custom replacement policies.
-
Advanced Prefetching Algorithms: Implementation of more sophisticated prefetchers like GHB and Markov predictors.
-
GPU Cache Modeling: Extensions for modeling GPU-specific cache architectures.
-
Multi-Core Simulation: Support for simulating multiple cores with shared caches.
-
Graphical Interface: Development of a GUI for visualization and configuration.
-
Dynamic Trace Generation: Integration with instruction-level simulators.
-
Machine Learning Integration: Using ML techniques for prefetching and replacement.
These extensions will enhance the simulator's capabilities and keep it relevant for future research.