NANDHOO.

ARM Kernel Development

Chapter 16: ARM Kernel Development


Introduction


ARM processors power billions of devices - smartphones, tablets, embedded systems, and increasingly, servers and desktops. ARM kernel development differs significantly from x86/x64 due to RISC architecture, different boot process, and varied hardware platforms. This chapter guides you through creating a kernel for ARM architecture.


Why This Matters


ARM is everywhere. From Raspberry Pi to Apple M-series chips, from IoT devices to automotive systems, ARM dominates mobile and embedded computing. Understanding ARM kernel development opens opportunities in mobile OS development, embedded systems, and the growing ARM server market.


How to Study This Chapter


  1. Understand RISC principles - ARM is simpler than x86 in many ways
  2. Target specific hardware - ARM has many variants (Raspberry Pi, Versatile, etc.)
  3. Use device trees - ARM systems describe hardware via device trees
  4. Test in QEMU - Start with emulation before real hardware
  5. Read ARM manuals - ARMv7/ARMv8 architecture reference manuals

ARM Boot Process


ARM vs x86 Boot


Aspectx86/x64ARM
FirmwareBIOS/UEFIU-Boot/Vendor bootloader
Entry Mode16-bit real mode32/64-bit mode (depends on variant)
Entry Point0xFFFFFFF0Platform-specific
Boot StandardMBR/GPTPlatform-specific
Device InfoPCI enumerationDevice tree

Typical ARM Boot Sequence


1. Power On
   ↓
2. Boot ROM (SoC-specific, in silicon)
   ↓
3. First-stage bootloader (U-Boot SPL)
   ↓
4. Second-stage bootloader (U-Boot)
   ↓
5. Load kernel + device tree
   ↓
6. Jump to kernel entry (with parameters)
   ↓
7. Kernel initializes and runs

Project Setup for ARM


Directory Structure


arm-kernel/
├── boot/
│   └── boot.s            # ARM entry point
├── kernel/
│   ├── main.c            # Kernel main
│   ├── uart.c            # Serial driver
│   ├── mmu.c             # Memory management
│   └── interrupts.c      # Exception/interrupt handling
├── include/
│   └── types.h
├── linker.ld
└── Makefile

Cross-Compilation Toolchain


# Install ARM cross-compiler (Ubuntu/Debian)
sudo apt-get install gcc-arm-none-eabi gdb-multiarch

Or for Linux userspace:

sudo apt-get install gcc-arm-linux-gnueabi


Verify installation

arm-none-eabi-gcc --version


Makefile for ARM


# Makefile for ARM kernel (bare metal)

CC = arm-none-eabi-gcc LD = arm-none-eabi-ld OBJCOPY = arm-none-eabi-objcopy QEMU = qemu-system-arm


For Versatile PB (ARM926EJ-S)

CFLAGS = -mcpu=arm926ej-s -mfloat-abi=soft -nostdlib -ffreestanding
-Iinclude -Wall -Wextra -O2


LDFLAGS = -T linker.ld


SOURCES = boot/boot.o kernel/main.o kernel/uart.o kernel/mmu.o kernel/interrupts.o TARGET = kernel.elf BINARY = kernel.bin


all: $(BINARY)


boot/boot.o: boot/boot.s (CC)(CC) (CFLAGS) -c -o @@ <


%.o: %.c (CC)(CC) (CFLAGS) -c -o @@ <


(TARGET):(TARGET): (SOURCES) linker.ld (LD)(LD) (LDFLAGS) -o @@ (SOURCES)


(BINARY):(BINARY): (TARGET) (OBJCOPY)Obinary(OBJCOPY) -O binary < $@


run: (BINARY)(BINARY) (QEMU) -M versatilepb -m 128M -kernel $(TARGET) -serial stdio -nographic


debug: (BINARY)(BINARY) (QEMU) -M versatilepb -m 128M -kernel (TARGET) -serial stdio -s -S & gdb-multiarch (TARGET)
-ex "target remote :1234"
-ex "break kernel_main"
-ex "continue"


clean: rm -f boot/.o kernel/.o (TARGET)(TARGET) (BINARY)


.PHONY: all run debug clean


ARM Boot Code (ARMv7)


Linker Script


linker.ld:

ENTRY(_start)

SECTIONS { . = 0x10000; /* Kernel load address for Versatile */


.text : {
    *(.text.boot)
    *(.text)
}

.rodata : {
    *(.rodata)
}

.data : {
    *(.data)
}

.bss : {
    __bss_start = .;
    *(.bss)
    *(COMMON)
    __bss_end = .;
}

. = ALIGN(8);
. = . + 0x1000; /* 4KB stack */
stack_top = .;

}


Boot Assembly (ARMv7)


boot/boot.s:

.section .text.boot
.global _start

_start: @ We enter in supervisor mode


@ Set up stack pointer
ldr sp, =stack_top

@ Clear BSS section
ldr r0, =__bss_start
ldr r1, =__bss_end
mov r2, #0

clear_bss: cmp r0, r1 bge clear_done str r2, [r0], #4 b clear_bss


clear_done: @ Jump to C code bl kernel_main


@ Hang if kernel returns

hang: wfe b hang


UART Driver (Serial Output)


ARM platforms use memory-mapped UART (not port I/O like x86).


kernel/uart.c:

#include "types.h"

// UART0 base address for Versatile PB #define UART0_BASE 0x101f1000


#define UART0_DR (*(volatile uint32_t )(UART0_BASE + 0x00)) // Data register #define UART0_FR ((volatile uint32_t *)(UART0_BASE + 0x18)) // Flag register


// Flag register bits #define UART_FR_TXFF (1 << 5) // Transmit FIFO full #define UART_FR_RXFE (1 << 4) // Receive FIFO empty


void uart_putc(char c) { // Wait until transmit FIFO not full while (UART0_FR & UART_FR_TXFF);


UART0_DR = c;

}


void uart_puts(const char *str) { while (*str) { if (*str == '\n') { uart_putc('\r'); // Add carriage return } uart_putc(*str++); } }


char uart_getc(void) { // Wait until data available while (UART0_FR & UART_FR_RXFE);


return UART0_DR & 0xFF;

}


void uart_init(void) { // UART is already initialized by QEMU // On real hardware, you'd configure baud rate, etc. }


Kernel Main


kernel/main.c:

#include "types.h"

extern void uart_init(void); extern void uart_puts(const char *);


void kernel_main(void) { uart_init();


uart_puts("ARM Kernel Starting...\n");
uart_puts("Hello from ARM!\n");

// Hang
while (1) {
    asm volatile("wfe");  // Wait for event
}

}


include/types.h:

#ifndef TYPES_H
#define TYPES_H

typedef unsigned char uint8_t; typedef unsigned short uint16_t; typedef unsigned int uint32_t; typedef unsigned long long uint64_t;


typedef signed char int8_t; typedef signed short int16_t; typedef signed int int32_t; typedef signed long long int64_t;


typedef uint32_t size_t; typedef uint8_t bool;


#define true 1 #define false 0 #define NULL ((void*)0)


#endif


Testing the Basic Kernel


make
make run

Expected output:

ARM Kernel Starting...
Hello from ARM!

ARM MMU (ARMv7)


Setting Up Page Tables


kernel/mmu.c:

#include "types.h"

extern void uart_puts(const char *);


// First-level page table (16KB aligned) static uint32_t page_table[4096] attribute((aligned(16384)));


// Section descriptor bits #define PT_SECTION (1 << 1) #define PT_B (1 << 2) // Bufferable #define PT_C (1 << 3) // Cacheable #define PT_AP_RW (3 << 10) // Access: read/write #define PT_DOMAIN(x) ((x) << 5) #define PT_XN (1 << 4) // Execute never


void mmu_section(uint32_t virt, uint32_t phys, uint32_t flags) { uint32_t idx = virt >> 20; // 1 MB sections page_table[idx] = (phys & 0xFFF00000) | flags | PT_SECTION; }


void mmu_init(void) { uart_puts("Initializing MMU...\n");


// Clear page table
for (int i = 0; i < 4096; i++) {
    page_table[i] = 0;
}

// Identity map first 128 MB (device memory and RAM)
for (uint32_t addr = 0; addr < 0x8000000; addr += 0x100000) {
    mmu_section(addr, addr, PT_AP_RW | PT_DOMAIN(0) | PT_B | PT_C);
}

// Set domain 0 to manager mode
uint32_t dacr = 0x3;  // Domain 0: manager
asm volatile("mcr p15, 0, %0, c3, c0, 0" : : "r"(dacr));

// Set translation table base
asm volatile("mcr p15, 0, %0, c2, c0, 0" : : "r"(page_table));

// Enable MMU
uint32_t sctlr;
asm volatile("mrc p15, 0, %0, c1, c0, 0" : "=r"(sctlr));
sctlr |= 0x1;  // Enable MMU (M bit)
sctlr |= (1 << 12);  // Enable I-cache
sctlr |= (1 << 2);   // Enable D-cache
asm volatile("mcr p15, 0, %0, c1, c0, 0" : : "r"(sctlr));

uart_puts("MMU enabled\n");

}


ARM Exception Handling


Vector Table


boot/boot.s (updated):

.section .text.boot
.global _start

_start: @ Set up exception vector table ldr pc, =reset_handler ldr pc, =undefined_handler ldr pc, =swi_handler ldr pc, =prefetch_abort_handler ldr pc, =data_abort_handler nop @ Reserved ldr pc, =irq_handler ldr pc, =fiq_handler


reset_handler: @ Set up stack pointer ldr sp, =stack_top


@ Copy vector table to 0x00000000
ldr r0, =_start
mov r1, #0x0000
ldmia r0!, {r2-r9}
stmia r1!, {r2-r9}
ldmia r0!, {r2-r9}
stmia r1!, {r2-r9}

@ Clear BSS
ldr r0, =__bss_start
ldr r1, =__bss_end
mov r2, #0

clear_bss: cmp r0, r1 bge clear_done str r2, [r0], #4 b clear_bss


clear_done: @ Jump to C code bl kernel_main


hang: wfe b hang


@ Exception handlers undefined_handler: b undefined_handler


swi_handler: @ System call handler push {r0-r12, lr} bl syscall_handler pop {r0-r12, pc}^


prefetch_abort_handler: b prefetch_abort_handler


data_abort_handler: b data_abort_handler


irq_handler: push {r0-r3, r12, lr} bl irq_dispatcher pop {r0-r3, r12, lr} subs pc, lr, #4


fiq_handler: b fiq_handler


Interrupt Controller


kernel/interrupts.c:

#include "types.h"

extern void uart_puts(const char *);


// Versatile Interrupt Controller #define VIC_BASE 0x10140000 #define VIC_INTENABLE (*(volatile uint32_t )(VIC_BASE + 0x10)) #define VIC_INTDISABLE ((volatile uint32_t *)(VIC_BASE + 0x14))


// Timer base address #define TIMER0_BASE 0x101E2000 #define TIMER_LOAD (*(volatile uint32_t )(TIMER0_BASE + 0x00)) #define TIMER_VALUE ((volatile uint32_t )(TIMER0_BASE + 0x04)) #define TIMER_CONTROL ((volatile uint32_t )(TIMER0_BASE + 0x08)) #define TIMER_INTCLR ((volatile uint32_t *)(TIMER0_BASE + 0x0C))


#define TIMER_EN (1 << 7) #define TIMER_PERIODIC (1 << 6) #define TIMER_INTEN (1 << 5) #define TIMER_32BIT (1 << 1)


static uint32_t tick_count = 0;


void irq_dispatcher(void) { // For simplicity, assume timer interrupt tick_count++;


if (tick_count % 100 == 0) {
    uart_puts("Tick\n");
}

// Clear timer interrupt
TIMER_INTCLR = 1;

}


void timer_init(void) { uart_puts("Initializing timer...\n");


// Set timer to fire every 10ms (assuming 1MHz clock)
TIMER_LOAD = 10000;

// Enable timer (periodic, 32-bit, interrupts enabled)
TIMER_CONTROL = TIMER_EN | TIMER_PERIODIC | TIMER_INTEN | TIMER_32BIT;

// Enable timer interrupt in VIC (IRQ 4 for timer 0/1)
VIC_INTENABLE = (1 << 4);

// Enable IRQs in CPU
uint32_t cpsr;
asm volatile("mrs %0, cpsr" : "=r"(cpsr));
cpsr &= ~(1 << 7);  // Clear I bit (enable IRQ)
asm volatile("msr cpsr_c, %0" : : "r"(cpsr));

uart_puts("Timer enabled\n");

}


AArch64 (64-bit ARM) Differences


Boot Code (AArch64)


.section .text.boot
.global _start

_start: // Check processor ID (multi-core systems) mrs x0, mpidr_el1 and x0, x0, #0xFF cbz x0, primary_cpu b hang


primary_cpu: // Set up stack ldr x0, =stack_top mov sp, x0


// Clear BSS
ldr x0, =__bss_start
ldr x1, =__bss_end
mov x2, #0

clear_bss: cmp x0, x1 b.ge clear_done str x2, [x0], #8 b clear_bss


clear_done: // Jump to kernel main bl kernel_main


hang: wfe b hang


AArch64 MMU


// 4KB granule, 48-bit virtual address
#define PT_PAGE      (3 << 0)   // Page descriptor
#define PT_BLOCK     (1 << 0)   // Block descriptor
#define PT_TABLE     (3 << 0)   // Table descriptor
#define PT_VALID     (1 << 0)
#define PT_AF        (1 << 10)  // Access flag
#define PT_SH_INNER  (3 << 8)   // Inner shareable
#define PT_ATTR(x)   ((x) << 2) // Memory attributes

void mmu_init_aarch64(void) { // Set up page tables (simplified) // Real implementation would set up 4-level paging


// Configure MAIR_EL1 (Memory Attribute Indirection Register)
uint64_t mair = 0xFF;  // Normal memory
asm volatile("msr mair_el1, %0" : : "r"(mair));

// Configure TCR_EL1 (Translation Control Register)
uint64_t tcr = 0;
tcr |= (16 << 0);   // T0SZ: 48-bit address space
tcr |= (1 << 8);    // Inner shareable
tcr |= (1 << 10);   // Outer shareable
tcr |= (0 << 14);   // 4KB granule
asm volatile("msr tcr_el1, %0" : : "r"(tcr));

// Set TTBR0_EL1 (page table base)
// asm volatile("msr ttbr0_el1, %0" : : "r"(page_table));

// Enable MMU
uint64_t sctlr;
asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr));
sctlr |= (1 << 0);  // M bit (MMU enable)
sctlr |= (1 << 2);  // C bit (data cache)
sctlr |= (1 << 12); // I bit (instruction cache)
asm volatile("msr sctlr_el1, %0" : : "r"(sctlr));
asm volatile("isb");

}


Device Tree


ARM systems use device trees to describe hardware.


Example device tree snippet:

/ {
    compatible = "arm,versatile-pb";
    model = "ARM Versatile PB";

memory {
    device_type = "memory";
    reg = <0x00000000 0x08000000>;  // 128 MB at 0x0
};

uart0: serial@101f1000 {
    compatible = "arm,pl011", "arm,primecell";
    reg = <0x101f1000 0x1000>;
    interrupts = <12>;
};

timer0: timer@101e2000 {
    compatible = "arm,sp804", "arm,primecell";
    reg = <0x101e2000 0x1000>;
    interrupts = <4>;
};

};


Parsing device tree (simplified):

struct fdt_header {
    uint32_t magic;
    uint32_t totalsize;
    // ... more fields
} __attribute__((packed));

void parse_device_tree(void *fdt) { struct fdt_header *header = (struct fdt_header *)fdt;


if (header->magic != 0xd00dfeed) {  // FDT magic (big-endian)
    uart_puts("Invalid device tree\n");
    return;
}

uart_puts("Device tree found\n");
// Parse nodes and properties...

}


Raspberry Pi Specific


Raspberry Pi 3 Boot


Raspberry Pi uses GPU bootloader:


1. GPU loads bootcode.bin
2. GPU loads start.elf (GPU firmware)
3. GPU loads kernel8.img (64-bit kernel)
4. GPU starts ARM cores
5. Kernel runs

config.txt for bare metal:

kernel=kernel8.img
arm_64bit=1

Raspberry Pi UART


// BCM2837 (Raspberry Pi 3) Mini UART
#define AUX_ENABLES     (*(volatile uint32_t *)(0x3F215004))
#define AUX_MU_IO_REG   (*(volatile uint32_t *)(0x3F215040))
#define AUX_MU_LSR_REG  (*(volatile uint32_t *)(0x3F215054))

void rpi_uart_init(void) { AUX_ENABLES = 1; // Enable mini UART }


void rpi_uart_putc(char c) { while (!(AUX_MU_LSR_REG & 0x20)); // Wait for TX ready AUX_MU_IO_REG = c; }


Key Concepts


  • ARM boot starts in supervisor mode (ARMv7) or EL2/EL1 (AArch64)
  • UART is memory-mapped, not port-based
  • MMU uses different page table format than x86
  • Exception vectors must be at 0x00000000 or 0xFFFF0000
  • VIC (Vectored Interrupt Controller) manages interrupts
  • Device tree describes platform hardware
  • AArch64 uses 4-level page tables similar to x64
  • No BIOS - bootloader responsibilities differ

Common Mistakes


  1. Wrong base addresses - Each platform has different peripheral addresses
  2. Endianness confusion - ARM can be little or big endian
  3. Cache coherency - Not invalidating caches after MMU setup
  4. Alignment - ARM requires aligned memory access
  5. Missing memory barriers - ARM has relaxed memory model
  6. Wrong exception return - Use subs pc, lr, #4 for IRQ
  7. Forgetting device tree - Real hardware needs proper device enumeration

Debugging Tips


  • Use UART early - First thing to get working
  • QEMU is your friend - Test before real hardware
  • GDB multiarch - Use gdb-multiarch for ARM
  • Check alignment - ARM faults on unaligned access
  • Memory barriers - Use dmb, dsb, isb appropriately
  • Read manuals - ARM Architecture Reference Manual is essential
  • Start with QEMU - Versatile PB is well-supported

Mini Exercises


  1. Create a basic ARM kernel that prints to UART
  2. Implement simple printf for UART
  3. Set up MMU with identity mapping
  4. Create exception handlers for all vectors
  5. Initialize timer interrupt
  6. Implement basic keyboard/UART input
  7. Parse device tree to find UART address
  8. Port kernel to Raspberry Pi
  9. Implement AArch64 boot code
  10. Add multi-core support (boot secondary cores)

Review Questions


  1. How does ARM boot process differ from x86?
  2. What is a device tree and why is it used?
  3. How do you enable the MMU on ARMv7?
  4. What are the ARM exception vectors?
  5. How does UART differ between ARM and x86?

Reference Checklist


By the end of this chapter, you should be able to:

  • Set up ARM cross-compilation toolchain
  • Write ARM boot assembly code
  • Initialize UART for serial output
  • Set up ARM MMU (ARMv7)
  • Handle ARM exceptions and interrupts
  • Initialize interrupt controller (VIC)
  • Set up timer interrupts
  • Understand device trees
  • Port kernel between ARM platforms
  • Use QEMU for ARM kernel testing

Next Steps


With both x86/x64 and ARM kernel experience, the next chapter explores Unix, Linux, and shell scripting. You'll learn Linux system programming, shell scripting for automation, and how to interact with the Linux kernel from user space.




Key Takeaway: ARM kernel development differs from x86 in boot process, memory management, and peripheral access. Understanding these differences and using device trees enables you to write kernels for the vast ARM ecosystem.