At the end of the last chapter, we did a lot of work that wasn’t actually writing kernel code. So let’s review what we’re up to:
- GRUB loaded our kernel, and started running it.
- We’re currently running in ‘protected mode’, a 32-bit environment.
- We want to transition to ‘long mode’, the 64-bit environment.
- In order to do that, we have to do some work.
We’re on step four. More specifically, here’s what we have to do:
- Set up something called ‘paging’.
- Set up something called a ‘GDT’.
- Jump to long mode.
This section covers step one. The next two will cover the other two steps. Afterwards, we’ll be ready to stop writing assembly and start writing Rust!
By the way...
There’s something we’re going to skip here, which we’d want to do in a more serious kernel: check to make sure that our hardware can actually do this! We’re going to just assume that our ‘hardware’ can run in 64-bit mode, because we’re running our OS in QEMU, which supports all of these operations. But if we were to run our OS on a real 32-bit computer, it would end up crashing. We could check that it’s possible, and then print a nice error message instead. But, we won’t cover that here. It’s not particularly interesting, and we know that it will never fail. But it might be something you want to explore on your own, for extra credit.
So, step one: set up ‘paging’. What is paging? Paging is a way of managing memory. Our computer has memory, and we can think of memory as being a big long list of cells:
Each location in memory has an address, and we can use the address to distinguish between the cells: the value at cell zero, the value at cell ten.
But how many cells are there? This question has two answers: The first answer is how much physical memory (RAM) do we have in our machine? This will vary per machine. My machine has 8 gigabytes of memory or 8,589,934,592 bytes. But maybe your machine has 4 gigabytes of memory, or sixteen gigabytes of memory.
The second answer to how many cells there are is how many addresses can be used to refer to cells of memory? To answer that we need to figure out how many different unique numbers we can make. In 64-bit mode, we can create as many addresses as can be expressed by a 64-bit number. So that means we can make addresses from zero to (2^64) - 1. That’s 18,446,744,073,709,551,616 addresses! We sometimes refer to a sequence of addresses as an ‘address space’, so we might say “The full 64-bit address space has 2^64 addresses.”
So now we have an imbalance. We have only roughly 8.5 billion actual physical memory slots in an 8GB machine but quintillions of possible addresses we can make.
How can we resolve this imbalance? We don't want to be able to address memory that doesn't exist!
Here’s the strategy: we introduce two kinds of addresses: physical addresses and virtual addresses. A physical address is the actual, real value of a location in the physical RAM in the machine. A virtual address is an address anywhere inside of our 64-bit address space: the full range. To bridge between the two address spaces, we can map a given virtual address to a particular physical address. So we might say something like “virtual address 0x044a maps to the physical address 0x0011.” Software uses the virtual addresses, and the hardware uses physical addresses.
Mapping each individual address would be extremely inefficient; we would need to keep track of literally every memory address and where it points to. Instead, we split up memory into chunks, also called ‘pages’, and then map each page to an equal sized chunk of physical memory.
By the way... In the future we'll be using paging to help us implement something called "virtual memory". Besides helping us always be able to map a 64-bit number to a real place in physical memory, "virtual memory" is useful for other reasons. These reasons don't really come into play at this point, so we'll hold off on discussing them. For now, it's just important to know that we need paging to enter 64-bit long mode and that it's a good idea for many reasons including helping us resolve the fact the we have way less actual memory than possible addresses to refer to that memory.
Paging is actually implemented by a part of the CPU called an ‘MMU’, for ‘memory management unit’. The MMU will translate virtual addresses into their respective physical addresses automatically; we can write all of our software with virtual addresses only. The MMU does this with a data structure called a ‘page table’. As an operating system, we load up the page table with a certain data structure, and then tell the CPU to enable paging. This is the task ahead of us; it’s required to set up paging before we transition to long mode.
How should we do our mapping of physical to virtual addresses? You can make this easy, or complex, and it depends on exactly what you want your OS to be good at. Some strategies are better than others, depending on the kinds of programs you expect to be running. We’re going to keep it simple, and use a strategy called ‘identity mapping’. This means that every virtual address will map to a physical address of the same number. Nothing fancy.
Let’s talk more about the page table. In long mode, the page table is four levels deep, and each page is 4096 bytes in size. What do I mean by levels? Here are the official names:
- the Page-Map Level-4 Table (PML4),
- the Page-Directory Pointer Table (PDP),
- the Page-Directory Table (PD),
- and the Page Table (PT).
I’ve most commonly heard them referred to as a “level x page table”, where
goes from four to one. So the PML4 is a “level four page table,” and the PT is
a “level one page table.” They’re called ‘levels’ because they decend in order:
each entry in a level 4 page table points to a level 3 page table entry. Each
level 3 page table entry points at a level 2 page table entry, and each level 2
page table entry points at a level 1 page table entry. That entry then contains
the address. Whew! To get started, we only need one entry of each table.
Creating the page table
So here’s the strategy: create a single entry of each of these tables, then point them at each other in the correct way, then tell the CPU that paging should be enabled.
Creating page table entries
To create space for these page table entries, open up
boot.asm and add these
lines at the bottom:
section .bss align 4096 p4_table: resb 4096 p3_table: resb 4096 p2_table: resb 4096
We introduce a new section, ‘bss’. It stands for ‘block started by symbol’, and was introduced in the 1950s. The name doesn’t make much sense anymore, but the reason we use it is because of its behavior: entries in the bss section are automatically set to zero by the linker. This is useful, as we only want certain bits set to 1, and most of them set to zero.
resb directive reserves bytes; we want to reserve space for each entry.
align directive makes sure that we’ve aligned our tables properly. We
haven’t talked much about alignment yet: the idea is that the addresses here
will be set to a multiple of 4096, hence ‘aligned’ to 4096 byte chunks. We’ll
eventually talk more about alignment and why it’s important, but it doesn’t
matter a ton right now.
After this has been added, we have a single valid entry for each level. However, because our page four entry is all zeroes, we have no valid pages. That’s not super useful. Let’s set things up properly.
Pointing the entries at each other
In order to do this setup, we need to write some more assembly code! Open up
boot.asm. You can either leave in printing code, or remove it. If you do leave
it in, add this code before it: that way, if you see your message print out, you
know it ran successfully.
global start section .text bits 32 start: ; Point the first entry of the level 4 page table to the first entry in the ; p3 table mov eax, p3_table or eax, 0b11 mov dword [p4_table + 0], eax
If you recall,
; are comments. Leaving yourself excessive comments in assembly
files is a good idea. Let’s go over each of these lines:
mov eax, p3_table
This copies the contents of the first third-level page table entry into the
eax register. We need to do this because of the next line:
or eax, 0b11
We take the contents of
or it with
0b11, the result is written in
eax. First, let’s talk about
what this does, and then we’ll talk about why we want to do it.
When dealing with binary,
or is an operation that returns
1 if either value
0 if both are
0. In other words, if
b are a single
|or a b||0||1||1||1|
You’ll see charts like this a lot when talking about binary stuff. You can read
this chart from top to bottom, each column is a case. So the first column says
a is zero and
b is zero,
or a b will be zero.” The second column says
a is one and
b is zero,
or a b will be one.” And so on.
Now, we defined
p3_table in the BSS section, which means that it will start as
all zeroes. So when we
0b11, it means that the first two bits will
be set to one, keeping the rest as zeroes.
Okay, so now we know what we are doing, but why? Each entry in a page table contains an address, but it also contains metadata about that page. The first two bits are the ‘present bit’ and the ‘writable bit’. By setting the first bit, we say “this page is currently in memory,” and by setting the second, we say “this page is allowed to be written to.” There are a number of other settings we can change this way, but they’re not important for now.
Now that we have an entry set up properly, the next line is of interest:
mov dword [p4_table + 0], eax
mov instruction, but this time, copying
eax, where we’ve been
setting things up, into... something in brackets.
 means, “I will be giving
you an address between the brackets. Please do something at the place this
address points.” In other words,
 is like a dereference operator.
Now, the address we’ve put is kind of funny looking:
p4_table + 0. What’s up
+ 0? It’s not strictly needed: adding zero to something keeps it the
same. However, it’s intended to convey to the reader that we’re accessing the
zeroth entry in the page table. We’re about to see some more code later where we
will do something other than add zero, and so putting it here makes our code
look more symmetric overall. If you don’t like this style, you don’t have to put
These few lines form the core of how we’re setting up these page tables. We’re going to do the same thing over again, with slight variations.
Here’s the full thing again:
; Point the first entry of the level 4 page table to the first entry in the ; p3 table mov eax, p3_table or eax, 0b11 ; mov dword [p4_table + 0], eax
Once you feel like you’ve got a handle on that, let’s move on to pointing the page three table to the page two table!
; Point the first entry of the level 3 page table to the first entry in the ; p2 table mov eax, p2_table or eax, 0b11 mov dword [p3_table + 0], eax
The code is the same as above, but with
p3_table instead of
p4_table. Nothing more than that.
We have one last thing to do: set up the level two page table to have valid references to pages. We’re going to do something we haven’t done yet in assembly: write a loop!
Here’s the basic outline of loop in assembly:
- Create a counter variable to track how many times we’ve looped
- make a label to define where the loop starts
- do the body of the loop
- add one to our counter
- check to see if our counter is equal to the number of times we want to loop
- if it’s not, jump back to the top of the loop
- if it is, we’re done
It’s a little more detail-oriented than loops in other languages. Usually, you have curly braces or indentation to indicate that the body of the loop is separate, but we don’t have any of those things here. We also have to write the code to increment the counter, and check if we’re done. Lots of little fiddly bits. But that’s the nature of what we’re doing!
Let’s get to it!
; point each page table level two entry to a page mov ecx, 0 ; counter variable
In order to write a loop, we need a counter.
ecx is the usual loop counter
register, that’s what the
c stands for: counter. We also have a comment
indicating what we’re doing in this part of the code.
Next, we need to make a new label:
As we mentioned above, this is where we will loop back to when the loop continues.
mov eax, 0x200000 ; 2MiB
We’re going to store 0x200000 in
eax, or 2,097,152 which is equivalent to 2 MiB.
Here’s the reason: each page is two megabytes in size. So in order to get the
right memory location, we will multiply the number of the loop counter by 0x200000:
And so on. So our pages will be all next to each other, and 2,097,152 bytes in size.
Here’s that multiplication!
mul takes just one argument, which in this case
ecx counter, and multiplies that by
eax, storing the result in
eax. This will be the location of the next page.
or eax, 0b10000011
Next up, our friend
or. Here, we don’t just or
0b11: we’re also setting
another bit. This extra
1 is a ‘huge page’ bit, meaning that the pages are
2,097,152 bytes. Without this bit, we’d have 4KiB pages instead of 2MiB pages.
mov [p2_table + ecx * 8], eax
Just like before, we are now writing the value in
eax to a location. But
instead of it being just
p2_table + 0, we’re adding
ecx * 8 Remember,
is our loop counter. Each entry is eight bits in size:
0b10000011. So we need
to multiply the counter by eight, and then add it to
p2_table. Let’s take a
closer look: let’s assume
p2_table is zero, to make the math easier:
|ecx * 8||0||8||16||24||32|
|p2_table + ecx * 8||0||8||16||24||32|
We skip eight spaces each time, so we have room for all eight bits of the page table.
That’s the body of the loop! Now we need to see if we need to keep looping or not:
inc ecx cmp ecx, 512 jne .map_p2_table
inc instruction increments the register it’s given by one.
ecx is our loop
counter, so we’re adding to it. Then, we ‘compare’ with
cmp. We’re comparing
ecx with 512: we want to map 512 page entries overall. This will give us 512 *
2 mebibytes: one gibibyte of memory. It’s also why we wrote the loop: writing
out 512 entries by hand is possible, theoretically, but is not fun. Let’s make
the computer do the math for us.
jne instruction is short for ‘jump if not equal’. It checks the result of
cmp, and if the comparison says ‘not equal’, it will jump to the label
map_p2_table points to the top of the loop.
That’s it! We’ve written our loop and mapped our second-level page table. Here’s the full code of the loop:
; point each page table level two entry to a page mov ecx, 0 ; counter variable .map_p2_table: mov eax, 0x200000 ; 2MiB mul ecx or eax, 0b10000011 mov [p2_table + ecx * 8], eax inc ecx cmp ecx, 512 jne .map_p2_table
And, with this, we’ve now fully mapped our page table! We’re one step closer to being in long mode. Here’s the full code, all in one place:
; Point the first entry of the level 4 page table to the first entry in the ; p3 table mov eax, p3_table or eax, 0b11 ; mov dword [p4_table + 0], eax ; Point the first entry of the level 3 page table to the first entry in the ; p2 table mov eax, p2_table or eax, 0b11 mov dword [p3_table + 0], eax ; point each page table level two entry to a page mov ecx, 0 ; counter variable .map_p2_table: mov eax, 0x200000 ; 2MiB mul ecx or eax, 0b10000011 mov [p2_table + ecx * 8], eax inc ecx cmp ecx, 512 jne .map_p2_table
Now that we’ve done this, we have a valid initial page table. Time to enable paging!
Now that we have a valid page table, we need to inform the hardware about it. Here’s the steps we need to take:
- We have to put the address of the level four page table in a special register
- enable ‘physical address extension’
- set the ‘long mode bit’
- enable paging
These four steps are not particularly interesting, but we have to do them. First, let’s do the first step:
; move page table address to cr3 mov eax, p4_table mov cr3, eax
So, this might seem a bit redundant: if we put
eax, and then
cr3, why not just put
cr3? As it turns out,
cr3 is a special register, called a ‘control register’, hence the
cr registers are special: they control how the CPU actually works. In our
cr3 register needs to hold the location of the page table.
Because it’s a special register, it has some restrictions, and one of those is
that when you
cr3, it has to be from another register. So we need the
mov to set
p4_table in a register before we can set
Step one: done!
Next, enabling ‘physical address extension’:
; enable PAE mov eax, cr4 or eax, 1 << 5 mov cr4, eax
In order to set PAE, we need to take the value in the
cr4 register and
modify it. So first, we
mov it into
eax, then we use
or to change the
value. What about
1 << 5? The
<< is a ‘left shift’. It might be easier to
show you with a table:
See how the 1 moves left? So
1 << 5 is
100000 (or 2^5 if you like maths; incidentally 1<<n = 2^n). If you only need to set one
bit, this can be easier than writing out
100000 itself, as you don’t need to
count the zeroes.
After we modify
eax to have this bit set, we
mov the value back into
PAE has been set! Why is this what you need to do? It just is. The details are
not really in the scope of this tutorial.
Okay, so we have step two done. Time for step three: setting the long mode bit:
; set the long mode bit mov ecx, 0xC0000080 rdmsr or eax, 1 << 8 wrmsr
wrmsr instructions read and write to a ‘model specific
msr. This is just what you have to do to set this up. Again,
we won’t get into too much detail, as it’s not very interesting. Boilerplate.
Finally we are all ready to enable paging!
; enable paging mov eax, cr0 or eax, 1 << 31 or eax, 1 << 16 mov cr0, eax
cr0 is the register we need to modify. We do the usual “move to
some bits, move back to the register” pattern. In this case, we set bit 31 and
Once we’ve set these bits, we’re done! Here’s the full code listing:
; move page table address to cr3 mov eax, p4_table mov cr3, eax ; enable PAE mov eax, cr4 or eax, 1 << 5 mov cr4, eax ; set the long mode bit mov ecx, 0xC0000080 rdmsr or eax, 1 << 8 wrmsr ; enable paging mov eax, cr0 or eax, 1 << 31 or eax, 1 << 16 mov cr0, eax
... are we in long mode yet?
So, technically after paging is enabled, we are in long mode. But we’re not in real long mode; we’re in a special compatibility mode. To get to real long mode, we need a data structure called a ‘global descriptor table’. Read the next section to find out how to make one of these.