Virtual Memory#

Virtual memory is the operating-system layer that gives each process its own address space. It lets the kernel separate process memory from physical memory and decide which pages should be resident, shared, protected, or backed by storage.

What is Virtual Memory?#

Virtual memory gives a process a private view of memory that does not directly match the machine’s physical memory.

This supports several operating-system goals:

  • Programs can run even when their total virtual memory use is larger than available physical memory.

  • File I/O can be buffered and cached in memory.

  • Processes can share pages, such as executable code or shared library text.

  • The kernel can enforce memory protection between processes.

  • Virtual addresses can be translated to physical addresses by hardware and kernel-managed tables.

Memory Management Units#

The memory management unit, or MMU, is the hardware that translates virtual addresses into physical addresses.

The MMU also enforces memory protection and works with structures such as page tables and the translation lookaside buffer. In modern systems, the MMU is usually part of the CPU. When a process accesses memory, the CPU and MMU use the current process’s page tables to decide where the access goes and whether it is allowed.

Pages and Page Tables#

A page is a fixed-size block of virtual memory. A page table records how virtual pages map to physical pages.

Every process appears to have access to a large address space, but only some of that address space is actually mapped and resident at any given time. The operating system maintains page tables so the MMU can translate virtual addresses and enforce permissions.

Page Table Size#

Page size affects page-table overhead.

If a 32-bit system used 128-byte pages, it would need more than 33 million page entries to describe a 4 GB address space. Even with very compact entries, that overhead would be too large for each process.

With 4 KB pages, a 4 GB address space has 1,048,576 virtual pages. On 32-bit x86, each page-table entry is 32 bits, so a flat page table would still require about 4 MB per process. That is too much overhead for a small process.

Page Table Entries#

A page-table entry stores the physical page number and attributes for a virtual page.

For a 4 KB page in a 32-bit address space, 20 bits identify the page and 12 bits remain for flags and other information. Common attributes include whether the page is present, writable, executable, and modified.

Multi-Level Page Tables#

Multi-level page tables reduce memory overhead by allocating lower-level tables only for address ranges that are actually used.

In a two-level page table, part of the virtual address selects a first level entry, and another part selects a second level entry. The first level covers large regions. The second level is created only when a process actually uses pages in that region.

Two-Level Page Table#

The two-level page-table diagram shows how a virtual address can be split into indexes and an offset.

Two-level page table

MMU Address Translation Algorithm#

Address translation first checks the TLB. If the translation is not cached, the MMU and kernel page-table structures are used to find the physical page.

physical_address translate(virtual_address v_addr) {
   physical_address addr;
   if(tlb.contains(v_addr)) {
      addr = tlb[v_addr];
   } else {
      first10 = (v_addr >> 0x16) & 0xa;
      second10 = (v_addr >> 0xc) & 0xa;
      level_1 = page_table_1[first10];
      entry = level_1[second10];
      physical_page = (v_addr >> 0xc) & 0x14;
      if(physical_page >> RESIDENT_OFFSET & 0x01 == 0) {
         generate page fault;
      }
      addr = (physical_page << 0xc) & (v_addr & 0xc);
      tlb[v_addr] = addr;
   }
   return addr;
}

Key points:

  • The TLB caches recent virtual-to-physical translations.

  • A TLB miss requires a page-table lookup.

  • Page-table entries contain both page addresses and permission or status bits.

  • A nonresident page causes a page fault.

  • Once translation succeeds, the result can be cached in the TLB.

Page Faults#

A page fault is an exception raised when a memory reference cannot be completed using the current page-table entry.

Common causes include accessing a page that is not resident in physical memory, writing to a read-only page, or executing from a non-executable page. The operating system decides whether the fault can be repaired or whether the process must be notified or terminated.

Page Faults in UNIX#

UNIX systems usually handle nonresident pages by loading the page and retrying the instruction.

If the page cannot be loaded because memory is exhausted, the process may fail or the system may take stronger action. If the access violates permissions, the kernel normally sends SIGSEGV to the process.

Page Faults in Windows#

Windows also handles nonresident pages by loading the page and retrying the instruction when possible.

If the fault cannot be repaired, Windows raises an exception in the faulting process. Permission faults are also reported through the exception mechanism.

Page Replacement and Swapping#

Page replacement is the policy for choosing which physical page to reuse when memory is needed.

The swapper moves pages between physical memory and slower backing storage. It handles heap and stack pages, and it may also demand-page program text so execution can begin before the full program is loaded.

When the Swapper Runs#

The swapper runs when a needed virtual page is not resident or when the system needs more free physical pages.

The operating system may also reclaim memory when a page would be more useful for another process, the filesystem cache, or another kernel purpose. The swapper is therefore part of both memory protection and overall system performance.

Physical Page Contenders#

Many uses compete for physical pages.

Important contenders include the filesystem cache, shared memory regions, memory mapped files, executable and library text, process stacks, process heaps, and device DMA buffers. Some DMA regions must not be swapped because a device is using the physical address directly.

Swapper Algorithms#

A page replacement algorithm tries to minimize expensive page faults.

Useful measures include the total number of page faults, the page faults an optimal algorithm would have produced, and the working set of pages a process is actively using. No real algorithm can predict the future, so replacement policies approximate likely future behavior from recent history.

Page Classification#

Many processors record whether a page has been referenced or modified.

These bits divide pages into four classes:

  • Not referenced and not modified.

  • Not referenced and modified.

  • Referenced and not modified.

  • Referenced and modified.

The operating system can clear reference bits periodically to estimate which pages have been used recently.

Not Recently Used#

Not Recently Used, or NRU, evicts pages from the lowest available page class.

NRU is simple. It prefers pages that have not been referenced and have not been modified. It is inexpensive, but it is only a rough estimate of future page use.

FIFO Replacement#

First In First Out, or FIFO, evicts the oldest resident page.

FIFO assumes the oldest page is least likely to be used again. That assumption is often wrong, so FIFO is rarely used directly in production systems.

Second-Chance FIFO#

Second-chance FIFO improves FIFO by checking the reference bit before evicting a page.

If the oldest page has not been referenced, it can be evicted. If it has been referenced, the algorithm clears the reference bit and gives the page another chance. This avoids evicting heavily used pages simply because they are old.

Clock Replacement#

The clock algorithm is an efficient implementation of second-chance replacement.

It stores pages in a circular list and keeps a pointer to the next candidate. If the pointed-to page has a clear reference bit, it is evicted. If the bit is set, the algorithm clears it and advances the pointer.

Least Recently Used#

Least Recently Used, or LRU, evicts the page that has not been used for the longest time.

LRU is often close to optimal because programs tend to reuse pages they used recently. A perfect LRU implementation is expensive because the kernel would need to update ordering information on every memory access.

Not Frequently Used#

Not Frequently Used, or NFU, approximates LRU with counters.

At each clock interrupt, the operating system scans pages and increments a counter for each page whose reference bit is set. On a page fault, a page with a low counter is a candidate for eviction.

NFU is not forgetful enough. A page that was used heavily in the past can keep a high counter long after it stops being useful.

Aging#

Aging improves NFU by making old references matter less over time.

At each clock interrupt, the operating system shifts each counter right. If the page was referenced, it sets the most significant bit. The result is a small history of recent use. Aging still has limited precision, but it tracks recency better than plain NFU.

Belady’s Anomaly#

Belady’s anomaly is the surprising result that adding more physical pages can increase page faults for some replacement algorithms and access patterns.

For example, FIFO with the reference string 3 2 1 0 3 2 4 3 2 1 0 4 can produce 9 page faults with 3 physical pages and 10 page faults with 4 physical pages. More memory does not always help FIFO.

Belady’s Anomaly Diagram#

The diagram shows a reference string where FIFO performs worse with more physical pages.

Belady's anomaly

Modeling Page Replacement#

Page replacement algorithms can be modeled with a reference string and a fixed number of physical pages.

The reference string is the ordered list of page accesses. The model tracks which pages are in physical memory and which pages have been referenced but are not resident. On each reference, the algorithm either finds the page in memory or handles a page fault.

Modeling LRU#

LRU has the stack property: with more physical pages, the set of pages in memory is always a superset of the set that would be present with fewer physical pages.

LRU reference string

This property means LRU is not subject to Belady’s anomaly. Adding more physical pages cannot make LRU perform worse for the same reference string.

Distance Strings#

A distance string records how far a referenced page is from the top of the algorithm’s conceptual stack.

Pages not yet referenced have infinite distance. The distance string can estimate page faults for different physical memory sizes by counting how often the distance exceeds the number of available frames.

Fm = Sum(k = m+1, n, Ck) + Cinf

Here, Ck is the number of times distance k occurs, Cinf is the number of infinite distances, m is the number of physical pages, and n is the number of virtual pages.

Distance String Example#

The distance-string counts can be read from the LRU model.

LRU reference string

In this example, C1 = 4, C2 = 2, C3 = 1, C4 = 3, C5 = 2, C6 = 2, C7 = 1, and Cinf = 8. Therefore:

F1 = 2 + 1 + 3 + 2 + 2 + 1 + 8 = 19
F2 = 1 + 3 + 2 + 2 + 1 + 8 = 17
F5 = 2 + 1 + 8 = 11
F6 = 1 + 8 = 9

Design Considerations for Paging Systems#

A naive paging system could start a process with none of its pages in memory, but that would cause many page faults at startup.

Real systems try to avoid unnecessary faults. They use locality, read-ahead, working-set estimates, and page replacement policies to keep the pages a process is likely to use in memory.

Working Sets#

A working set is the set of pages a process is actively using during a period of execution.

Many programs show locality of reference. If a process is using one page now, it is likely to use the same page or nearby pages soon. Stacks, heaps, loops, and data structures often create these patterns.

Taking Advantage of Locality#

Operating systems use locality to reduce page faults.

They may load adjacent pages from executable files or libraries, continue loading pages asynchronously after a fault, and prefer evicting pages far from the current working set. These techniques trade a little extra work now for fewer faults later.

Costs of Paging Page Classes#

Not all page evictions cost the same amount.

Unmodified pages can usually be dropped and reloaded later from their original backing store. Modified pages must be written to swap or another backing store before their physical page can be reused.

Reducing Paging Cost#

The kernel can reduce future paging cost by cleaning modified pages in the background.

When the disk is otherwise idle, the system may write dirty pages that are likely eviction candidates. This makes later eviction cheaper because the page is already clean. Replacement policies also prefer clean pages when the performance tradeoff is reasonable.

Local and Global Paging#

Page replacement can be local or global.

In global replacement, a faulting process may cause the kernel to evict a page from any process. In local replacement, the kernel prefers to evict pages owned by the process that faulted. Local replacement can improve fairness, while global replacement can improve overall memory use.

Page Locking#

Page locking pins a page in physical memory so the swapper cannot evict it.

Pinned pages are needed for operations such as DMA, where a device reads from or writes to a specific physical memory region. During the DMA operation, the kernel must ensure that the physical page remains present and stable.

Copy on Write#

Copy on write, or COW, lets related processes share pages until one of them writes to a shared page.

After fork() or clone(), the kernel can copy page-table entries instead of copying every page. The shared entries are marked read-only. When the parent or child writes to one of those pages, the write causes a page fault. The kernel then copies the physical page, updates the faulting process’s page table, and marks the new page writable.

Backing Store#

Backing store is the persistent or recoverable location that backs a virtual page.

For heap, stack, and data pages, the backing store is usually a swap file or swap partition. Simpler systems often prefer swap partitions because the kernel can address the disk directly. Modern systems can also use swap files efficiently because the filesystem can provide stable block locations.

Hibernation#

Hibernation stores the machine’s memory state so the system can power down and later resume.

To hibernate, the operating system writes used physical pages to a hibernation file or swap area. On the next boot, the kernel detects the hibernated state, restores enough core state to resume, rebuilds page tables, and pages data back in as needed.

Hot Memory#

Hot memory is free memory kept available so the system can satisfy small bursts of memory demand without immediate page replacement.

Modern systems often keep some physical pages free even though using all memory for cache might seem optimal. This reserve reduces jitter when processes allocate memory, files are read, or the filesystem cache grows and shrinks.

Hot Memory Example#

The following diagram shows memory use on a long-running Linux system.

Hot memory

The example includes filesystem cache, software memory, buffers, swapped pages, and a small amount of free or recently freed memory. That free area gives the kernel room to respond quickly to new allocation requests.

Summary: Page Fault Handling#

Page fault handling is the path from a faulting memory instruction back to a runnable process.

The usual sequence is:

  1. The hardware traps into the kernel and saves the process state.

  2. The kernel identifies the faulting address and the reason for the fault.

  3. If the access is invalid, the kernel signals or terminates the process.

  4. If the access is valid, the kernel finds or creates a free physical page.

  5. If a dirty page must be evicted, the kernel schedules it to be written to backing storage.

  6. The kernel loads the needed page from backing storage if needed.

  7. The page-table entry is updated to mark the page resident with the correct permissions.

  8. The saved registers are restored.

  9. The faulting instruction is retried, and the process becomes runnable again.

Linux Kernel Module Case Study#

This case study shows the smallest structure of a Linux loadable kernel module.

 1
 2
 3#include <linux/module.h>
 4
 5
 6
 7/* Defines the license for this LKM */
 8
 9MODULE_LICENSE("GPL");
10
11
12
13/* Init function called on module entry */
14
15int my_module_init( void )
16
17{
18
19    printk(KERN_INFO "my_module_init called.  Module is now loaded.\n");
20
21
22
23    return 0;
24
25}
26
27
28
29/* Cleanup function called on module exit */
30
31void my_module_cleanup( void )
32
33{
34
35    printk(KERN_INFO "my_module_cleanup called.  Module is now unloaded.\n");
36
37
38
39    return;
40
41}
42
43
44
45/* Declare entry and exit functions */
46
47module_init( my_module_init );
48
49module_exit( my_module_cleanup );

Key points:

  • MODULE_LICENSE() declares the module license to the kernel.

  • module_init() registers the function called when the module loads.

  • module_exit() registers the function called when the module unloads.

  • printk() writes messages to the kernel log rather than standard output.

Linux /proc Module Case Study#

This case study shows a Linux kernel module that exposes state through a /proc entry.

  1#include <linux/module.h>
  2#include <linux/kernel.h>
  3#include <linux/proc_fs.h>
  4#include <linux/string.h>
  5#include <linux/vmalloc.h>
  6#include <asm/uaccess.h>
  7
  8/*
  9 * original code by:
 10 * http://www.ibm.com/developerworks/linux/library/l-proc/index.html
 11 *
 12 * fixes by: Joe Kaylor
 13 *
 14 */
 15MODULE_LICENSE("GPL");
 16MODULE_DESCRIPTION("Fortune Cookie Kernel Module");
 17MODULE_AUTHOR("M. Tim Jones");
 18
 19#define MAX_COOKIE_LENGTH       PAGE_SIZE
 20
 21static struct proc_dir_entry *proc_entry;
 22static char *cookie_pot;  // Space for fortune strings
 23static int cookie_index;  // Index to write next fortune
 24static int next_fortune;  // Index to read next fortune
 25
 26static const struct file_operations fops =
 27{
 28    .owner = THIS_MODULE
 29};
 30
 31ssize_t fortune_write(struct file *filp, const char __user *buff, unsigned long len, void *data )
 32{
 33    int space_available = (MAX_COOKIE_LENGTH-cookie_index)+1;
 34
 35    if (len > space_available)
 36    {
 37        printk(KERN_INFO "fortune: cookie pot is full!\n");
 38        return -ENOSPC;
 39    }
 40
 41    if (copy_from_user( &cookie_pot[cookie_index], buff, len ))
 42    {
 43        return -EFAULT;
 44    }
 45
 46    cookie_index += len;
 47
 48    cookie_pot[cookie_index-1] = 0;
 49
 50    return len;
 51}
 52
 53int fortune_read( char *page, char **start, off_t off, int count, int *eof, void *data )
 54{
 55    int len;
 56    if (off > 0)
 57    {
 58        *eof = 1;
 59        return 0;
 60    }
 61
 62    /* Wrap-around */
 63
 64    if (next_fortune >= cookie_index)
 65    {
 66        next_fortune = 0;
 67    }
 68    len = sprintf(page, "%s\n", &cookie_pot[next_fortune]);
 69    next_fortune += len;
 70
 71    return len;
 72}
 73
 74int init_fortune_module( void )
 75{
 76    int ret = 0;
 77    cookie_pot = (char *)vmalloc( MAX_COOKIE_LENGTH );
 78
 79    if (!cookie_pot)
 80    {
 81        ret = -ENOMEM;
 82    }
 83    else
 84    {
 85        memset( cookie_pot, 0, MAX_COOKIE_LENGTH );
 86
 87        proc_entry = create_proc_entry( "fortune", 0666, NULL );
 88
 89        if (proc_entry == NULL)
 90        {
 91            ret = -ENOMEM;
 92            vfree(cookie_pot);
 93            printk(KERN_INFO "fortune: Couldn't create proc entry\n");
 94        }
 95        else
 96        {
 97            cookie_index = 0;
 98            next_fortune = 0;
 99            proc_entry->read_proc = fortune_read;
100            proc_entry->write_proc = fortune_write;
101            proc_entry->size = MAX_COOKIE_LENGTH;
102            printk(KERN_INFO "fortune: Module loaded.\n");
103        }
104    }
105    return ret;
106}
107
108void cleanup_fortune_module( void )
109{
110    remove_proc_entry("fortune", &proc_entry);
111    vfree(cookie_pot);
112    printk(KERN_INFO "fortune: Module unloaded.\n");
113}
114
115
116module_init( init_fortune_module );
117module_exit( cleanup_fortune_module );
118

Key points:

  • The module allocates kernel memory for stored fortune strings with vmalloc().

  • create_proc_entry() creates the /proc/fortune interface.

  • copy_from_user() copies data safely from a user buffer into kernel memory.

  • The read handler returns one stored fortune at a time.

  • Cleanup removes the proc entry and frees the allocated memory.

Linux Character Device Case Study#

This case study shows how a Linux character device exposes kernel code through the file interface.

  1/*
  2 *  example from htttp://tldbp.org/LDP/lkmpg/2.6/html/x569.html
  3 *
  4 *  Various fixes and updates by Joe Kaylor
  5 *
  6 */
  7
  8
  9/*
 10 *  chardev.c: Creates a read-only char device that says how many times
 11 *  you've read from the dev file
 12 */
 13
 14#include <linux/kernel.h>
 15#include <linux/module.h>
 16#include <linux/fs.h>
 17#include <asm/uaccess.h>	/* for put_user */
 18
 19/*
 20 *  Prototypes - this would normally go in a .h file
 21 */
 22int init_module(void);
 23void cleanup_module(void);
 24static int device_open(struct inode *, struct file *);
 25static int device_release(struct inode *, struct file *);
 26static ssize_t device_read(struct file *, char *, size_t, loff_t *);
 27static ssize_t device_write(struct file *, const char *, size_t, loff_t *);
 28
 29#define SUCCESS 0
 30#define DEVICE_NAME "chardev"	/* Dev name as it appears in /proc/devices   */
 31#define BUF_LEN 80		/* Max length of the message from the device */
 32
 33/*
 34 * Global variables are declared as static, so are global within the file.
 35 */
 36
 37static int Major;		/* Major number assigned to our device driver */
 38static int Device_Open = 0;	/* Is device open?
 39				 * Used to prevent multiple access to device */
 40static char msg[BUF_LEN];	/* The msg the device will give when asked */
 41static char *msg_Ptr;
 42
 43static struct file_operations fops =
 44{
 45    .read = device_read,
 46    .write = device_write,
 47    .open = device_open,
 48    .release = device_release
 49};
 50
 51/*
 52 * This function is called when the module is loaded
 53 */
 54int init_module(void)
 55{
 56    Major = register_chrdev(0, DEVICE_NAME, &fops);
 57
 58    if (Major < 0)
 59    {
 60        printk(KERN_ALERT "Registering char device failed with %d\n", Major);
 61        return Major;
 62    }
 63
 64    printk(KERN_INFO "I was assigned major number %d. To talk to\n", Major);
 65    printk(KERN_INFO "the driver, create a dev file with\n");
 66    printk(KERN_INFO "'mknod /dev/%s c %d 0'.\n", DEVICE_NAME, Major);
 67    printk(KERN_INFO "Try various minor numbers. Try to cat and echo to\n");
 68    printk(KERN_INFO "the device file.\n");
 69    printk(KERN_INFO "Remove the device file and module when done.\n");
 70
 71    return SUCCESS;
 72}
 73
 74/*
 75 * This function is called when the module is unloaded
 76 */
 77void cleanup_module(void)
 78{
 79    /*
 80     * Unregister the device
 81     */
 82    /*int ret = */unregister_chrdev(Major, DEVICE_NAME);
 83    /*if (ret < 0)
 84    	printk(KERN_ALERT "Error in unregister_chrdev: %d\n", ret);*/
 85}
 86
 87/*
 88 * Methods
 89 */
 90
 91/*
 92 * Called when a process tries to open the device file, like
 93 * "cat /dev/mycharfile"
 94 */
 95static int device_open(struct inode *inode, struct file *file)
 96{
 97    static int counter = 0;
 98
 99    if (Device_Open)
100        return -EBUSY;
101
102    Device_Open++;
103    sprintf(msg, "I already told you %d times Hello world!\n", counter++);
104    msg_Ptr = msg;
105    try_module_get(THIS_MODULE);
106
107    return SUCCESS;
108}
109
110/*
111 * Called when a process closes the device file.
112 */
113static int device_release(struct inode *inode, struct file *file)
114{
115    Device_Open--;		/* We're now ready for our next caller */
116
117    /*
118     * Decrement the usage count, or else once you opened the file, you'll
119     * never get get rid of the module.
120     */
121    module_put(THIS_MODULE);
122
123    return 0;
124}
125
126/*
127 * Called when a process, which already opened the dev file, attempts to
128 * read from it.
129 */
130static ssize_t device_read(struct file *filp,	/* see include/linux/fs.h   */
131                           char *buffer,	/* buffer to fill with data */
132                           size_t length,	/* length of the buffer     */
133                           loff_t * offset)
134{
135    /*
136     * Number of bytes actually written to the buffer
137     */
138    int bytes_read = 0;
139
140    /*
141     * If we're at the end of the message,
142     * return 0 signifying end of file
143     */
144    if (*msg_Ptr == 0)
145        return 0;
146
147    /*
148     * Actually put the data into the buffer
149     */
150    while (length && *msg_Ptr)
151    {
152
153        /*
154         * The buffer is in the user data segment, not the kernel
155         * segment so "*" assignment won't work.  We have to use
156         * put_user which copies data from the kernel data segment to
157         * the user data segment.
158         */
159        put_user(*(msg_Ptr++), buffer++);
160
161        length--;
162        bytes_read++;
163    }
164
165    /*
166     * Most read functions return the number of bytes put into the buffer
167     */
168    return bytes_read;
169}
170
171/*
172 * Called when a process writes to dev file: echo "hi" > /dev/hello
173 */
174static ssize_t
175device_write(struct file *filp, const char *buff, size_t len, loff_t * off)
176{
177    printk(KERN_ALERT "Sorry, this operation isn't supported.\n");
178    return -EINVAL;
179}

Key points:

  • register_chrdev() asks the kernel for a major device number.

  • The file_operations structure connects file operations to driver functions.

  • device_open() prepares the message returned by the device.

  • put_user() copies bytes from kernel memory to the user buffer.

  • unregister_chrdev() removes the device registration when the module unloads.

Linux System Call Case Study#

This case study shows the pieces needed to add and test a simple Linux system call.

 1diff -r linux-source-2.6.38-orig//arch/x86/ia32/ia32entry.S linux-source-2.6.38//arch/x86/ia32/ia32entry.S
 2853a854
 3> 	.quad sys_addsyscall
 4diff -r linux-source-2.6.38-orig//arch/x86/include/asm/unistd_32.h linux-source-2.6.38//arch/x86/include/asm/unistd_32.h
 5348a349
 6> #define __NR_addsyscall		341
 7352c353
 8< #define NR_syscalls 341
 9---
10> #define NR_syscalls 342
11diff -r linux-source-2.6.38-orig//arch/x86/include/asm/unistd_64.h linux-source-2.6.38//arch/x86/include/asm/unistd_64.h
12671a672,673
13> #define __NR_addsyscall				303
14> __SYSCALL(__NR_addsyscall, sys_addsyscall)
15diff -r linux-source-2.6.38-orig//arch/x86/kernel/syscall_table_32.S linux-source-2.6.38//arch/x86/kernel/syscall_table_32.S
16342a343,344
17> 	.long sys_addsyscall
18> 

Key points:

  • The patch adds a system call number for the new call.

  • The syscall table is updated so the kernel can dispatch to the new handler.

  • The public syscall declaration is added to syscalls.h.

  • The implementation adds two integer arguments and returns the result.

 1#include <unistd.h>
 2#include <stdio.h>
 3
 4#define __NR_addsyscall (303)
 5
 6long addsyscall(int x, int y)
 7{
 8    return syscall(__NR_addsyscall, x, y);
 9}
10
11int main(int argc, char* argv[])
12{
13    int result = (int)addsyscall(2, 3);
14    printf("result = %d\n", result);
15    if(result == (-1))
16    {
17        perror("error");
18    }
19    return 0;
20}

Key points:

  • The test program defines the syscall number used by the patched kernel.

  • syscall() invokes the kernel entry directly.

  • The wrapper function gives the test a normal C function shape.

  • perror() reports failure if the syscall returns -1.

Minix Startup Case Study#

This case study shows a minimal Minix program using the System Event Framework startup path.

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <minix/syslib.h>
 4
 5int main(int argc, char **argv)
 6{
 7    sef_startup();
 8
 9    printf("Hello, World!\n");
10    return EXIT_SUCCESS;
11}

Key points:

  • sef_startup() initializes the Minix service environment.

  • The program then runs ordinary user code.

  • The example is useful as a minimal starting point for Minix service examples.

Minix Time Driver Case Study#

This case study shows a small Minix character driver that exposes the current time from CMOS through a device interface.

1#ifndef __TIME_H
2#define __TIME_H
3
4#define TIME_MAJOR (17)
5
6#endif /* __TIME_H */

Key points:

  • TIME_MAJOR defines the major number for the time device.

  • The header keeps the device number separate from the driver implementation.

  1#include <minix/drivers.h>
  2#include <minix/driver.h>
  3#include <unistd.h>
  4#include <errno.h>
  5#include <time.h>
  6#include <stdio.h>
  7#include <stdlib.h>
  8#include <minix/ds.h>
  9#include <machine/cmos.h>
 10#include "time.h"
 11
 12/*
 13 *
 14 *  Some code borrowed from /usr/src/drivers/readclock
 15 *
 16 */
 17
 18
 19FORWARD _PROTOTYPE( char * time_name,   (void) );
 20FORWARD _PROTOTYPE( int time_open,      (struct driver *d, message *m) );
 21FORWARD _PROTOTYPE( int time_close,     (struct driver *d, message *m) );
 22FORWARD _PROTOTYPE( struct device * time_prepare, (int device) );
 23FORWARD _PROTOTYPE( void time_from_cmos, (char* buffer, int size));
 24FORWARD _PROTOTYPE( int time_transfer,  (int procnr, int opcode,
 25                    u64_t position, iovec_t *iov,
 26                    unsigned nr_req) );
 27FORWARD _PROTOTYPE( void time_geometry, (struct partition *entry) );
 28
 29/* SEF functions and variables. */
 30FORWARD _PROTOTYPE( void sef_local_startup, (void) );
 31FORWARD _PROTOTYPE( int sef_cb_init, (int type, sef_init_info_t *info) );
 32
 33/* Entry points to the time driver. */
 34PRIVATE struct driver time_tab =
 35{
 36    time_name, time_open, time_close,
 37    nop_ioctl, time_prepare, time_transfer,
 38    nop_cleanup, time_geometry, nop_alarm,
 39    nop_cancel, nop_select, nop_ioctl, do_nop,
 40};
 41
 42/** Represents the /dev/time device. */
 43PRIVATE struct device time_device;
 44
 45PRIVATE char * time_name(void)
 46{
 47    printf("time_name()\n");
 48    return "time";
 49}
 50
 51PRIVATE int time_open(d, m)
 52struct driver *d;
 53message *m;
 54{
 55    printf("time_open()\n");
 56    return OK;
 57}
 58
 59PRIVATE int time_close(d, m)
 60struct driver *d;
 61message *m;
 62{
 63    printf("time_close()\n");
 64    return OK;
 65}
 66
 67PRIVATE struct device * time_prepare(dev)
 68int dev;
 69{
 70    time_device.dv_base.lo = 0;
 71    time_device.dv_base.hi = 0;
 72    time_device.dv_size.lo = 0;
 73    time_device.dv_size.hi = 0;
 74    return &time_device;
 75}
 76
 77void write_register(int reg_addr, int value)
 78{
 79    if(sys_outb(RTC_INDEX, reg_addr) != OK)
 80    {
 81        printf("cmos: outb failed of %x\n", RTC_INDEX);
 82        exit(1);
 83    }
 84    if(sys_outb(RTC_IO, value) != OK)
 85    {
 86        printf("cmos: outb failed of %x (index %x)\n", RTC_IO, reg_addr);
 87        exit(1);
 88    }
 89}
 90
 91int bcd_to_dec(int n)
 92{
 93    return ((n >> 4) & 0x0F) * 10 + (n & 0x0F);
 94}
 95
 96int dec_to_bcd(int n)
 97{
 98    return ((n / 10) << 4) | (n % 10);
 99}
100
101int read_register(int reg_addr)
102{
103    u32_t r;
104
105    if(sys_outb(RTC_INDEX, reg_addr) != OK)
106    {
107        printf("cmos: outb failed of %x\n", RTC_INDEX);
108        exit(1);
109    }
110    if(sys_inb(RTC_IO, &r) != OK)
111    {
112        printf("cmos: inb failed of %x (index %x) failed\n", RTC_IO, reg_addr);
113        exit(1);
114    }
115    return r;
116}
117
118int nflag = 0;
119int wflag = 0;
120int Wflag = 0;
121int y2kflag = 0;
122
123void get_time(struct tm *t)
124{
125    int osec, n;
126
127    do
128    {
129        osec = -1;
130        n = 0;
131        do
132        {
133            /* Clock update in progress? */
134            if (read_register(RTC_REG_A) & RTC_A_UIP) continue;
135
136            t->tm_sec = read_register(RTC_SEC);
137            if (t->tm_sec != osec)
138            {
139                /* Seconds changed.  First from -1, then because the
140                 * clock ticked, which is what we're waiting for to
141                 * get a precise reading.
142                 */
143                osec = t->tm_sec;
144                n++;
145            }
146        }
147        while (n < 2);
148
149        /* Read the other registers. */
150        t->tm_min = read_register(RTC_MIN);
151        t->tm_hour = read_register(RTC_HOUR);
152        t->tm_mday = read_register(RTC_MDAY);
153        t->tm_mon = read_register(RTC_MONTH);
154        t->tm_year = read_register(RTC_YEAR);
155
156        /* Time stable? */
157    }
158    while (read_register(RTC_SEC) != t->tm_sec
159            || read_register(RTC_MIN) != t->tm_min
160            || read_register(RTC_HOUR) != t->tm_hour
161            || read_register(RTC_MDAY) != t->tm_mday
162            || read_register(RTC_MONTH) != t->tm_mon
163            || read_register(RTC_YEAR) != t->tm_year);
164
165    if ((read_register(RTC_REG_B) & RTC_B_DM_BCD) == 0)
166    {
167        /* Convert BCD to binary (default RTC mode). */
168        t->tm_year = bcd_to_dec(t->tm_year);
169        t->tm_mon = bcd_to_dec(t->tm_mon);
170        t->tm_mday = bcd_to_dec(t->tm_mday);
171        t->tm_hour = bcd_to_dec(t->tm_hour);
172        t->tm_min = bcd_to_dec(t->tm_min);
173        t->tm_sec = bcd_to_dec(t->tm_sec);
174    }
175    t->tm_mon--;  /* Counts from 0. */
176
177    /* Correct the year, good until 2080. */
178    if (t->tm_year < 80) t->tm_year += 100;
179
180    if (y2kflag)
181    {
182        /* Clock with Y2K bug, interpret 1980 as 2000, good until 2020. */
183        if (t->tm_year < 100) t->tm_year += 20;
184    }
185}
186
187PRIVATE void time_from_cmos(buffer,size)
188char *buffer;
189int size;
190{
191    struct tm tVal;
192
193    get_time(&tVal);
194
195
196    snprintf(buffer, size,
197             "%04d-%02d-%02d %02d:%02d:%02d\n",
198             tVal.tm_year + 1900, tVal.tm_mon + 1,
199             tVal.tm_mday, tVal.tm_hour, tVal.tm_min,
200             tVal.tm_sec);
201}
202
203PRIVATE int time_transfer(proc_nr, opcode, position, iov, nr_req)
204int proc_nr;
205int opcode;
206u64_t position;
207iovec_t *iov;
208unsigned nr_req;
209{
210    int bytes, ret;
211    char buffer[1024];
212
213    printf("time_transfer()\n");
214
215    time_from_cmos(buffer, sizeof(buffer));
216
217    bytes = strlen(buffer) - position.lo < iov->iov_size
218            ? strlen(buffer) - position.lo
219            : iov->iov_size;
220    if(bytes <= 0)
221    {
222        return OK;
223    }
224    switch(opcode)
225    {
226    case DEV_GATHER_S:
227        ret = sys_safecopyto(proc_nr, iov->iov_addr, 0, (vir_bytes)(buffer+position.lo),bytes,D);
228        iov->iov_size -= bytes;
229        break;
230    default:
231        return EINVAL;
232    }
233    return ret;
234}
235
236PRIVATE void time_geometry(entry)
237struct partition *entry;
238{
239    printf("time_geometry()\n");
240    entry->cylinders = 0;
241    entry->heads     = 0;
242    entry->sectors   = 0;
243}
244
245PRIVATE void sef_local_startup()
246{
247    /*
248     * Register init callbacks. Use the same function for all event types
249     */
250    sef_setcb_init_fresh(sef_cb_init);
251    sef_setcb_init_lu(sef_cb_init);
252    sef_setcb_init_restart(sef_cb_init);
253
254    /*
255     * Register live update callbacks.
256     */
257    /* - Agree to update immediately when LU is requested in a valid state. */
258    sef_setcb_lu_prepare(sef_cb_lu_prepare_always_ready);
259    /* - Support live update starting from any standard state. */
260    sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard);
261
262    /* Let SEF perform startup. */
263    sef_startup();
264}
265
266PRIVATE int sef_cb_init(int type, sef_init_info_t *info)
267{
268    int do_announce_driver = TRUE;
269    switch(type)
270    {
271    case SEF_INIT_FRESH:
272        printf("startup\n");
273        break;
274    case SEF_INIT_LU:
275        printf("new version\n");
276        do_announce_driver = FALSE;
277        break;
278    case SEF_INIT_RESTART:
279        printf("restart\n");
280        break;
281    }
282    if(do_announce_driver)
283    {
284        driver_announce();
285    }
286    return OK;
287}
288
289PUBLIC int main(int argc, char **argv)
290{
291    /*
292     * Perform initialization.
293     */
294    sef_local_startup();
295
296    /*
297     * Run the main loop.
298     */
299    driver_task(&time_tab, DRIVER_STD);
300    return OK;
301}
302

Key points:

  • The driver table connects Minix driver operations to local functions.

  • read_register() and write_register() access CMOS registers through kernel calls.

  • get_time() reads a stable RTC value and converts BCD fields.

  • time_transfer() copies formatted time data to the requesting process.

  • The SEF callbacks handle fresh startup, live update, and restart.