This techerature is about cache memories. If
you ever wondered what are cache memories, why are cache memories
needed and/or how caches work? You are at the right place.
Motivation for this Techerature
There are tons of
information on Cache Memories littered everywhere on the Internet.
So why one more? The problem with all that is very common. These
are usually very high level descriptions, which ordinary readers
may not be able to understand. The other problem is, that these
will usually fail to let the user appreciate objectively what is
the need of such memories. Even though the need is established,
the working mechanism of these fails to connect to the readers
with no or little prior knowledge on the subject. This techerature
will hopefully make the average reader fully understand the need,
structure, functionality of these wonderful devices. And this
techrature is written in plain English, no jargon.
Cache-Memories - 1 : Need of Cache
Memory, what is cache memory, what is cache-hit what is
cache-miss.
First of all the tutorial brings about the need of cache memories.
Why are cache memories needed at all,
with a very simple example.
It then attempts to explain the more complex cache memories in
very simple language, with detailed diagrams and their working
principles.
Consider the following example of a very simple System on Chip
design with a microprocessor such as an ARM Cortex-M3.
The SoC will have software for the processor
to execute. Let us say that this software is stored in a
non-volatile memory, say a Flash Memory, as shown in the above
diagram.
The Problem: The Flash Memory can be either integrated
within the SoC (e-flash or embedded flash as it is called) or for
smaller technology nodes i.e smaller than 28nm, the system
Software is usually stored off the main SoC/Microcontroller in a
separate Flash Memory IC. In either cases the time taken by the
processor to access the memory is significant. This is because
Flash memories are quite slow as compared to SRAMs for example.
And if the flash is off the main SoC then the access time are even
worse. Due to these large access times, the processor's
performance is being compromised. This is the problem statement
here, fast processor is being let down by slow memory.
More elaborate example:
The Processor: The processor on the SoC is ARM
Cortex-M3 processor which can run say at 100 MHz. This means a
clock cycle period of 10 ns. This means that the processor is able
to execute 1 instruction every 10 ns.
The Memory: The way the SPI-Flash memory works is, it needs a
'ritual' of things to be done, before it can send back read data.
First the chip select must be asserted, followed by a series of
steps, i.e sending command, address etc, and finally the SPI-Flash
starts sending the data corresponding to the address. In summary
it can be 100s of ns before the SPI-Flash will send the data
corresponding to the address. The initial 'ritual' once performed,
the data can be fetched sequentially from the start address to any
length. This means it is efficient to read multiple data bytes
once the initial 'ritual' is performed. On the other hand if we
just read one byte or word at a time, and de-assert the
chip-select, the whole 'ritual' must be performed again, before
the next data byte/word may be fetched.
A bit more about
SPI/QSPI is here.
The processor is running a lot faster, the memory is a lot slow.
When the processor requests one 32 bit word at a time, it will be
desirable to get few more 32 bit 'anticipatory' data words from
SPI-Flash, and store them somewhere, such that, if the following
request from processor is a sequential fetch from next address
location, the SPI-Flash 'ritual' can be avoided. This arrangement
is shown in the figure below: Notice the '256 bit Register +
Control Logic' between the processor and SPI-controller,
Figure-2: Simple System with Processor, SPI-Memory Controller,
and SPI Flash Memory with 256-bit register+control Logic
Using the scheme of Figure 2 shown above, when the processor sends
its first access i.e. ACCESS0 @ address 0, it waits for a long
time to get the data back, as the full SPI 'ritual' must be
performed. However if the following accesses from the processor
e.g. ACCESS1, ACCESS2, ACCESS3 are at address 1, address 2 and
address 3 sequentially, then for these requests, the processor
will not wait as much as it has to wait for ACCESS0, because the
'256 bit reg + controller' logic fetches the locations at these
addresses in anticipation and stores them in the 256 bit
registers, and hence the SPI initial 'ritual' is avoided for these
subsequent accesses. Though the processor will still 'stall'
between accesses, but the 'stall' times are clearly reduced.
Worked Out example:
SPI initial 'ritual' time (Tr) = 1000 ns
SPI time to fetch 1 32 bit data word following ritual (Ta) = 200
ns
Time needed to get 8 32 bit data words in Figure 1 (Ttf1)
Ttf1 = 8 x (Tr + Ta) = 8
x (1000 + 200) ns =
9600 ns.
Time needed to get 8 32 bit data words in Figure 2 (Ttf2)
Ttf2 = Tr + 8 x Ta = 1000 + 8
* 200 =
2600 ns.
It is very clear how the introduction of a 256 bit
register + controller has helped the processor to reduce average
access times, and run a lot faster. The controller is the one
which fetches the 'anticipatory' data words following the first
request from the processor, and stores it in 256 bit register.
Consider one more aspect. Once this 256 bit register has fetched
the contents from SPI flash, it may save it for future use. Now
any time the processor fetches data from those addresses
corresponding to which this 256 bit register has the data,
external SPI accesses could be completely avoided. The access
times to access 8 32 bit words in this case will be 10 ns x 8 =
80 ns. This is in-crrrrreee-diiiiib-le.
In a hypothetical example, if the code processor executes takes
less than 256 bits to store, then the entire code can be stored in
this 256 bit register and the processor will work very fast.
Hence, there are 2 advantages of this '256 bit register +
controller'.
i). It helps in reducing average access times form the Off Chip
SPI-Flash.
ii). If the code that the processor intends to run can be found in
this 256 bit register, the external fetches can be avoided for the
portion of the code that is stored in this 256 bit register, and
this portion of the code will run extremely fast.
Note that the second advantage is the one
which is exploited rather more frequently, and this is the one
cache-memory typically tries to address. And in practice, the
cache will be a lot larger then just 256 bits, as will be seen
in later chapters.
This brings a lot of questions E.g. which 256 bits should be kept
in the register, when these 256 bits be replaced/renewed from the
SPI-Flash etc.etc. These will be addressed in later chapters.
Now the question is, how and when the 'control logic' around this
256 bit register should issue a read request to the Flash memory.
The answer is: when it does not have the data corresponding to the
Address the processor has asked for. So the control logic first
must determine if it has the data corresponding to the address the
processor has issued. This means in addition to maintain the 256
bits data, the control logic must also store the addresses (or a
part of the address bits as we will later learn), for which it has
data. Upon receiving a transaction request (i.e. a read data
request or a write data operation) from the processor, the control
logic makes a search for the transaction address within its
address storage. If the control logic determines that the 256 bit
register has the data corresponding to the address issued by the
processor, it returns that data, if the 256 bit register does not
have corresponding data it can issue a read request to the Flash
memory, and get 8 32 bit words.
In the above example, the 256 bit
register is a simple 'cache memory', the control logic is a
simple cache controller. The storage space of
addresses corresponding to which the cache memory has data is
called 'tag' memory (see
cache
organization chapter to learn more on TAG and TAG memory.).
At the moment the functionality of cache controller is not being
discussed, it will come later.
Why do we need cache memory?
It is clear from the above example and explanation cache memory is
needed to help speed up processor execution from slow memories,
while keeping the costs under control. Theoretically it is
possible to have a RAM or even a ROM of the size of the flash
memory, not have a cache at all, and store everything in that
RAM/ROM which is on chip, close to the processor. This will make
the processing very fast, but the cost will be way too high.
Remember the cost of on-chip ROM/RAM per bit is much higher than
the cost of per bit storage in off chip flash. The cache memory is
a compromise, which would help the processor to run faster, let
the system have lots of low cost-storage (e.g. SPI-Flash), and
will help in keeping the total system costs checked.
Cache Hit
If the process of searching the addresses for which corresponding
data may be present in the cache results in a +ive, that is, the
data is present in cache memory it is called a 'Cache Hit'.
Cache Miss.
On the other hand if the process of searching the addresses for
which corresponding data may be present in cache results in a
-ive, that is, the data is not present in the cache memory it is
called a 'Cache Miss'. A 'Miss' will then trigger a read of the
Flash memory, or the slower memory.
Note that a cache memory has storage for data and storage of all
corresponding addresses (or part of addresses as it will be
evident later) for which it has data. Hence a cache memory is just
a couple of RAMs , i.e the 'DATA RAM' and the 'TAG RAM' organized
in a way which enables the functionality of the cache memory.
While the data ram stores the data, the tag ram stores the
addresses or a part of addresses.
This example explains a very simple cache memory, a 256 bit cache
memory. For more elaborate example(s) and more cache fundamentals
see
Next
chapters.
Conclusion: What is a cache memory?
A cache memory is an intermediate memory
which is located between the processor and a relatively slower
memory, to help the processor speed up its execution. In the
example above the slower memory is Flash Memory, but slower
memory does not have to be Flash, it can be any memory which is
slower than the memory which is more close to the processor,
e.g. the slow memory can be DRAM.