Cache is one of the most important features of the processor and is a piece of information that manufacturers always provide us with. Of course, you have never thought about what a cache is and what it does, so we have prepared this article, which we try to explain in a very simple and understandable way, to clear these doubts.
The purpose of this article is not to explore the concepts related to this memory system, but to explain it in a clear and understandable way.
What is CPU cache?
Before we can see what a cache is, we need to clarify how a processor works in our computers. We can summarize it very simply by saying that it gets the data and instructions it needs from RAM memory to run.
When the processor needs to access information to perform its tasks, this information is requested from the RAM and provides this information to the processor as quickly as possible. This process is not instantaneous, but takes time, a few clock cycles, only a few nanoseconds due to latency, but although it may seem like nothing to us, it is actually a world and the processor loses the opportunity to do other calculations and operations while you wait for it.
To solve this problem, the cache was created, it is nothing more than a small amount of memory inside the processor, the purpose of the cache is to provide access to information by the processor as fast as possible. The cache is inside the processor, so the information has to travel very little to get to where it is processed, so the access time is much shorter than in the case of RAM.
The amount of this memory is very small compared to RAM, a current high-end PC may have GB or more RAM-level capacity, but the maximum amount of cache is usually in KB and MB. This is also important because the access time to data in memory is proportional to the amount of memory. Therefore, accessing the cache is much faster than accessing RAM, and it is also closer and data has to travel less distance.
Processor cache levels
Processor cache is organized in several levels, most current processors have three levels of this memory, known as L1, L2 and L3 cache. The lower levels are the fastest, but have less capacity, while the higher levels are a little further from the controller and take a few more cycles to access, but have greater capacity.
In addition, the L1 cache is usually only allocated to instruction L1 where instructions are stored and data only to L1 for data, while the L2 and higher levels are combined, i.e. shared by data and instructions.
The highest levels on multi-core processors may or may not be shared. For example, an 8-core processor may have eight L1s for each core, but still share L2 for the two cores, or perhaps L3 is shared by all eight.
Finally, the cache leverages two key concepts to speed up processing: locality and speed. That is, to reduce latency, it is located close to the processing units or cores to seek it somewhere in RAM or the virtual memory of the system. On the other hand, they are very expensive memories, so they do not have very large capacities, because they will have very high costs, because memory cells have very fast access times compared to other cells of other types of memory.
To give you an idea, accessing main memory or RAM can take about 100 clock cycles, which is equivalent to about 50 ns. It may not seem like much, but there are 100 cycles where the processor may have to execute dozens of instructions and wait. On the other hand, it can no longer make 3-5 cycles to access L1 due to low latency, while it can go between 8-20 cycles for L2 and 30-80 cycles for L3. This means a time saver that translates into a pretty significant performance boost.
How does the memory hierarchy work?
When the cache contains a data or instruction that the CPU is looking for, a hit is said to occur, otherwise a miss (loss) is said to occur. Depending on the hit rate and miss rate of a system, it will have more or less performance. Note that the larger the cache capacity, the more likely a searched information is there to be there and not deleted to store other different information. This depends not only on the size of the cache, but also on the policies and algorithms used, etc. it depends.
Let’s imagine a CPU with 3 levels of cache and that CPU wants to find an instruction and a piece of data to execute them. The following may be:
- L1d and L1i: The CPU will search in L1 first, if what it is looking for is found, a hit is generated and the data and instruction are obtained quickly because it is there from previous applications. If not found, a malfunction or loss occurs.
- L2: When L1 fails, the next thing the CPU will do is search L2 if data and instructions are found (remember this memory is unified). If found here, a hit is generated and obtained for execution. Otherwise, L2 will fail and move to the next level.
- L3 (LLC): The CPU will look for the information it needs at this other level, if a hit is found and obtained, a loss is produced if not found.
- Main memory or RAM & I/O: If what is called is not found in CPU LLC, the CPU will choose to look for it in RAM, which will take more cycles due to higher latency. It’s very likely to be there. However, this may not be the case…
- Secondary memory (SWAP or virtual memory): then S.O. You must pass the process from swap memory or virtual memory to RAM, with data and instruction sought at a higher priority. Once loaded into RAM, the CPU will be able to access it. Therefore, this step involves the largest number of wait cycles and is to be avoided.
Cache types can be cataloged taking into account various parameters such as:
Depending on its use
Depending on what the cache is used for, the following situations may occur:
- Scratchpad Memory: Some data or calculation operations etc. It is a very fast type of local memory for temporary storage. This memory is similar to an L1 and is located next to the ALU.
- Victim buffer: A small size, fully associative small cache in L1 that acts as a repository for discarded data and instructions.
- Auxiliary cache: similar to the previous one, fully relational and with a FIFO substitution algorithm.
- Trace cache or L0: these are very small memories close to the instruction decoding unit to store micro-operations created by the last instructions so that they do not have to decode them the next time they are needed.
According to the replacement algorithm
It is the way data is eliminated in the cache to make room for others:
- Random or RR: pseudo-randomly selected.
- LRU: relates to the temporal locality principle that replaces the least current.
- FIFO: first in, first out.
- MRU: Replaces last accessed.
- PLRU: This so-called LRU is implemented in associated caches, usually in more than 4 ways.
- SLRU: It is divided into test segment and protected segment, the lines of the protected one are sorted from oldest to newest, and the lines of one segment are passed to the end of the protected line, thus a second opportunity for access.
- LFU: Eliminates the least accessed first with a counter.
By location policy
Depending on the level of correspondence or organization:
- Direct Map: A specific cache frame is assigned to each block of main memory.
- Full Associative: Any main memory location can be stored in any cache location.
- n-way Related: Of course, some processors’ descriptions state that the cache is 4-way, 8-way, etc. You’ve seen that it is, because these are the ways. If unidirectional, each location of a main memory block can only be stored in a specific location of the cache. In more ways, a set-level direct mapping and a block-level associative map are mixed.
According to write policy
- Write: replaces both cache and main memory so that there is no expired data, thus maintaining consistency.
- Write Back: Only cache is written, and if the line written is to be changed, it is also modified in memory from higher hierarchical levels before it is modified.
In summary, and in conclusion, we can say that the processor cache is the solution to the memory system performance problem, it is responsible for speeding up the reads and writes that the processor must do on the main memory system to achieve higher overall performance.