Files and I/O#

UNIX systems use files as a common interface for many resources. Regular files, directories, devices, pipes, and sockets have different behavior, but they are all named in the filesystem and many of them support the same basic operations: open(), read(), write(), close(), and select().

Common attributes of all (UNIX) files#

Files live in the filesystem namespace under /. UNIX paths do not use drive letters. Each file has a name, an owning user, an owning group, and permission bits for the owner, group, and other users.

Many filesystems also store extended attributes, creation time, last access time, and modification time. Beyond these shared attributes, file types vary widely in structure and behavior.

Types of Files in Unix#

UNIX file types include regular files, symbolic links, directories, block device files, character device files, named pipes, UNIX domain sockets, and, on Solaris, doors.

Regular Files#

Regular files persist program data in a filesystem. They have ownership, permissions, and a defined size. They can normally be accessed sequentially or with random access through seek operations.

Sparse files are an important exception to the simple size model. A filesystem may represent large unwritten regions without allocating disk blocks for every byte.

Folders#

Directories map names to filesystem entries. Early UNIX implementations represented directories as special files that listed other files. Modern systems still expose some of that model, but programs modify directories through directory-specific system calls rather than by writing directory bytes directly.

Directories do not behave like regular files. The execute bit controls whether a process can traverse the directory and use it in path resolution. Read permission controls whether entries can be listed.

Block Device Files#

Block device files expose devices that operate in fixed-size blocks. Common examples are disks, optical drives, USB media, and RAM disks. Block devices usually support random access and buffered I/O.

Operating systems expose block devices through special filesystems or device-management tools. Linux has used several mechanisms over time, including device files and special filesystems such as devfs and sysfs.

Character Device Files#

Character device files expose stream-oriented devices. Common examples include terminals, serial ports, modems, network interfaces, sound devices, video devices, and tape drives.

Most character devices do not support random access. When they do, seeking is often expensive or device-specific.

Named Pipes/FIFOs#

Named pipes are pipes with names in the filesystem. They let unrelated processes communicate even when the processes do not share a parent-child relationship. We cover pipes in more detail in the IPC chapter.

Unix Domain Sockets#

UNIX domain sockets are sockets with names in the filesystem. They are similar to named pipes, but can support stream or datagram communication. Unlike network sockets, they do not use TCP/IP or UDP/IP.

Filesystem System Calls#

Most UNIX systems have many file-oriented system calls. The basic pattern is simple: open a resource, perform operations on its file descriptor, and close the descriptor when finished.

Filesystem System Calls#

Filesystem Calls#

Function

Description

open()

opens or creates files and returns a file descriptor

creat()

creates new files

close()

closes a file descriptor

lseek()

updates a file descriptor’s current file offset

read()

reads data from a file descriptor into a buffer

write()

writes data from a buffer to a file descriptor

dup()

duplicates one file descriptor

dup2()

updates a file descriptor to point to another one

fcntl()

changes file properties such as asynchronous I/O and file locks

ioctl()

handles device-specific control operations

stat()

returns permissions, size, timestamps, and other metadata

access()

tests for read, write, execute, or existence of a file

umask()

updates the file creation mask

chmod()

updates permission bits

More Filesystem System Calls#

Filesystem Calls#

Function

Description

chown()

changes file user or group ownership

truncate()

changes the length of a file

link()

creates a hard link

unlink()

removes a name from the filesystem

rmdir()

deletes an empty directory

remove()

combines unlink() and rmdir() behavior

rename()

renames a file, possibly moving it to another directory

symlink()

creates a symbolic link

readlink()

reads the value stored in a symbolic link

utime()

updates access and modification times

mkdir()

creates a directory

opendir()

opens a directory for reading

readdir()

reads the next directory entry

rewinddir()

resets directory reading to the beginning

closedir()

closes a directory descriptor

chdir()

changes the current working directory

getcwd()

gets the current working directory

sync()

flushes filesystem buffers to disk

Opening Files with open()#

open() opens a file and returns a file descriptor.

int open(const char *pathname, int flags, mode_t mode);
int open(const char *pathname, int flags);

pathname is the path to the file. flags controls how the file is opened. Common flags include O_APPEND, O_ASYNC, O_CREAT, O_DIRECT, O_SYNC, and O_TRUNC.

When O_CREAT is used, mode supplies permission bits. These are usually written in octal. For example, 0700 gives the owner read, write, and execute permission, while group and other users receive no permissions. 0660 gives the owner and group read/write permission.

The return value is a file descriptor on success and -1 on error.

Closing files with close()#

close() releases a file descriptor.

int close(int fd);

The argument is a descriptor returned by calls such as open(), dup(), or pipe(). The return value is 0 on success and -1 on failure.

Writing to a File#

write() copies bytes from a user buffer to a file descriptor.

ssize_t write(int fd, const void *buf, size_t count);

fd is the file descriptor, buf is the source buffer, and count is the number of bytes to write. A return value of -1 means an error occurred. A successful call usually returns count, but code that needs to be robust should handle partial writes.

Writing Example#

The systems-code-examples/read-write-demo directory contains small separate examples for direct read() and write() calls.

 1#include <sys/types.h>
 2#include <sys/stat.h>
 3#include <fcntl.h>
 4#include <unistd.h>
 5#include <string.h>
 6
 7int main(int argc, char* argv[])
 8{
 9    const char *data = "foobar";
10    int fd = open("file", O_CREAT | O_TRUNC | O_RDWR, 0666);
11    size_t length = strlen(data), offset = 0;
12    while(length > 0)
13    {
14        size_t written = write(fd, data + offset, length);
15        offset += written;
16        length -= written;
17    }
18    close(fd);
19}

Key points:

  • open() creates or truncates the file named file.

  • strlen() is used to compute the number of bytes to write.

  • The loop handles the possibility that write() writes fewer bytes than requested.

  • offset advances through the source string as bytes are written.

  • close() releases the descriptor after the write loop finishes.

Reading from a File#

read() copies bytes from a file descriptor into a user buffer.

ssize_t read(int fd, void *buf, size_t count);

The return value is -1 on error, 0 at end of file, and otherwise the number of bytes read. The number of bytes returned may be smaller than count.

Reading Example#

 1#include <sys/types.h>
 2#include <sys/stat.h>
 3#include <fcntl.h>
 4#include <unistd.h>
 5#include <string.h>
 6
 7int main(int argc, char* argv[])
 8{
 9
10    int fd = open("file", O_RDONLY, 0666);
11    size_t length;
12    char buffer[5];
13    while((length = read(fd, &buffer[0], 5)) != 0)
14    {
15        write(1, &buffer[0], length);
16    }
17    close(fd);
18
19}

Key points:

  • open() opens the existing file for reading.

  • The buffer is deliberately small so the loop has to run more than once for longer input.

  • read() returns 0 at end of file, which terminates the loop.

  • write(1, ...) writes the bytes to standard output.

  • This example omits detailed error handling so the read/write pattern is easy to see.

Complete File I/O Example#

The systems-code-examples/file_rw example combines creation, writing, reading, closing, and removing a file.

 1#include <unistd.h>
 2#include <fcntl.h>
 3#include <stdio.h>
 4#include <string.h>
 5
 6const char *file_data = "1234567890abcdefghijklmnopqrstuvwxyz";
 7const char *file_name = "temp.dat";
 8
 9int main(int argc, char* argv[])
10{
11
12    int fd = open(file_name, O_CREAT | O_TRUNC | O_RDWR, 0666);
13    if(fd == (-1))
14    {
15        printf("open returned (-1)\n");
16        return (-1);
17    }
18
19    size_t length = strlen(file_data);
20    size_t offset = 0;
21    while(length > 0)
22    {
23        size_t written = write(fd, file_data + offset, length);
24        if(written == (-1))
25        {
26            printf("write returned (-1)\n");
27            return (-1);
28        }
29        printf("wrote %ld bytes\n", written);
30        offset += written;
31        length -= written;
32    }
33
34    if(close(fd) == (-1))
35    {
36        printf("close returned (-1)\n");
37        return (-1);
38    }
39
40    printf("file contents written.\n");
41
42    fd = open(file_name, O_RDONLY, 0666);
43    if(fd == (-1))
44    {
45        printf("open returned (-1)\n");
46        return (-1);
47    }
48
49    char buffer[5];
50    //read in a loop until zero bytes are read
51    //zero return value means EOF
52    while((length = read(fd, &buffer[0], 5)) != 0)
53    {
54        if(length == (-1))
55        {
56            printf("read returned (-1)\n");
57            return (-1);
58        }
59        write(1, &buffer[0], length);
60    }
61    printf("\n");
62
63    if(close(fd) == (-1))
64    {
65        printf("close returned (-1)\n");
66        return (-1);
67    }
68
69    if(unlink(file_name) == (-1))
70    {
71        printf("unlink returned (-1)\n");
72        return (-1);
73    }
74
75    return 0;
76}
77
78

Key points:

  • The first open() creates temp.dat, truncating any existing file with the same name.

  • The write loop keeps going until all bytes from file_data are written.

  • The file is closed before it is reopened for reading.

  • The read loop writes each chunk to standard output as it is read.

  • unlink() removes the file’s name from the filesystem at the end of the example.

Seeking within a File#

Not all files support seeking. Regular files usually do. Pipes, sockets, and many character devices do not.

lseek() changes the current file offset for a descriptor.

off_t lseek(int fd, off_t offset, int whence);

whence is usually SEEK_SET for the beginning of the file, SEEK_CUR for the current offset, or SEEK_END for the end of the file. Seeking beyond the current end of a file can create a sparse file on filesystems that support sparse allocation.

Standard File Descriptors#

Every process starts with three standard file descriptors.

stdin

Standard input. The default descriptor value is 0.

stdout

Standard output. The default descriptor value is 1.

stderr

Standard error. The default descriptor value is 2.

The parent process may redirect any of these before the child program runs.

Duplicating File Descriptors#

dup() duplicates a file descriptor and returns a new descriptor number.

int dup(int fd);

Programs use duplicated descriptors when they need another reference to the same open file. They are also useful in shell-style redirection.

Redirecting File Descriptors#

dup2() redirects one descriptor to refer to another open file.

int dup2(int oldfd, int newfd);

If newfd is already open, dup2() closes it first. After the call, newfd refers to the same open file description as oldfd. This is how shells redirect standard input, output, and error.

Reading Folders#

Directories are read with directory-specific library calls such as opendir(), readdir(), and closedir().

Directory Traversal in C#

The systems-code-examples/dir_list example lists entries in the root directory.

 1#include <stdio.h>
 2#include <sys/types.h>
 3#include <dirent.h>
 4#include <errno.h>
 5
 6int main(int argc, char* argv[])
 7{
 8    const char *dir = "/";
 9
10    DIR *d = opendir(dir);
11
12    struct dirent *de;
13    while((de = readdir(d)) != NULL)
14    {
15        printf("name: %s\n", de->d_name);
16    }
17
18    closedir(d);
19    return 0;
20}
21
22

Key points:

  • opendir() returns a DIR* handle for reading directory entries.

  • readdir() returns one struct dirent at a time.

  • The loop ends when readdir() returns NULL.

  • closedir() releases the directory handle.

  • The example prints entry names, not full paths.

Directory Traversal in C++#

The systems-code-examples/dir_list_cpp example performs the same basic operation with C++17 filesystem support.

 1#include <iostream>
 2#include <filesystem>
 3
 4int main() {
 5    std::string dir = "/";
 6
 7    try {
 8        for (const auto& entry : std::filesystem::directory_iterator(dir)) {
 9            std::cout << "name: " << entry.path().filename().string() << "\n";
10        }
11    } catch (std::filesystem::filesystem_error& e) {
12        std::cout << "An error has occurred: " << e.what() << '\n';
13    }
14
15    return 0;
16}

Key points:

  • std::filesystem::directory_iterator wraps the directory traversal pattern in a C++ range-like interface.

  • The exception handler catches filesystem errors, such as permission problems.

  • entry.path().filename() extracts just the entry name.

  • The underlying work still depends on operating system directory operations.

Text Input#

Text-oriented programs often read a line, split it into words, and then process those words. C gives you low-level tools for this, which means programs must be careful about buffer sizes, terminators, and ownership of allocated memory.

Bounded Line Input Example#

The systems-code-examples/better_gets example implements a bounded line input function. The surrounding program also demonstrates signal and child-process handling, but the relevant part for this chapter is better_gets().

  1#include <stdio.h>
  2#include <stdlib.h>
  3#include <string.h>
  4#include <signal.h>
  5#include <unistd.h>
  6#include <sys/wait.h>
  7#include <sys/types.h>
  8
  9#define MAX_PID (1000)
 10
 11int childProcesses[MAX_PID];
 12
 13void addProcess(int childPid)
 14{
 15    printf("add(%d)\n", childPid);
 16    for(int i = 0; i < MAX_PID; i++)
 17    {
 18        if(childProcesses[i] == (-1))
 19        {
 20            childProcesses[i] = childPid;
 21            printf("added child pid = %d\n", childPid);
 22            return;
 23        }
 24    }
 25}
 26
 27void removeProcess(int childPid)
 28{
 29    printf("remove(%d)\n", childPid);
 30    for(int i = 0; i < MAX_PID; i++)
 31    {
 32        if(childProcesses[i] == childPid)
 33        {
 34            childProcesses[i] = (-1);
 35            printf("removed child pid = %d\n", childPid);
 36        }
 37    }
 38}
 39
 40int countProcesses()
 41{
 42    int count = 0;
 43    for(int i = 0; i < MAX_PID; i++)
 44    {
 45        if(childProcesses[i] >= 0)
 46        {
 47            count += 1;
 48        }
 49    }
 50    return count;
 51}
 52
 53void clean_up_child_process(int signal_number)
 54{
 55    printf("SIGCHLD recieved.\n");
 56    int status;
 57    pid_t pid;
 58    while((pid = waitpid(-1, &status, WNOHANG)) > 0)
 59    {
 60        if(pid > 0)
 61        {
 62            printf("pid %d exited\n", pid);
 63            removeProcess((int)pid);
 64        }
 65    }
 66    return;
 67}
 68
 69char input[12];
 70
 71int read_from_stdin()
 72{
 73    char buffer;
 74    int error;
 75    while((error = read(0, &buffer, 1)) == (-1))
 76    {
 77        if(error == 0)
 78        {
 79            return EOF;
 80        }
 81    }
 82    return buffer;
 83}
 84
 85void better_gets(char *buff, int len)
 86{
 87    int i;
 88    for(i = 0; i < len-1; i++)
 89    {
 90        int val = read_from_stdin();
 91        if(val == EOF)
 92        {
 93            break;
 94        }
 95        buff[i] = val;
 96        if(buff[i] == '\n')
 97        {
 98            break;
 99        }
100    }
101    buff[i] = (char)0;
102    //printf("read: [%s]\n", buff);
103}
104
105void parent()
106{
107    printf("parent continuing....\n");
108    for(int i = 0; i < 5; i++)
109    {
110        printf(">");
111        better_gets(&input[0], 12);
112        printf("you typed: %s\n", &input[0]);
113    }
114    while(countProcesses() > 0) {}
115    printf("parent exiting\n");
116}
117
118void child()
119{
120    printf("child starting.\n");
121    sleep(5);
122    printf("child exiting.\n");
123    exit(EXIT_SUCCESS);
124}
125
126int main(int argc, char** argv)
127{
128    for(int i = 0; i < MAX_PID; i++)
129    {
130        childProcesses[i] = (-1);
131    }
132
133    struct sigaction sigchld_action;
134    memset (&sigchld_action, 0, sizeof (struct sigaction));
135    sigchld_action.sa_handler = &clean_up_child_process;
136    sigaction (SIGCHLD, &sigchld_action, NULL);
137
138    for(int i = 0; i < 5; i++)
139    {
140        int childProcess = fork();
141        if(childProcess == 0)
142        {
143            child();
144            return 0;
145        }
146        else if(childProcess > 0)
147        {
148            addProcess(childProcess);
149        }
150        else
151        {
152            printf("fork failed\n");
153        }
154    }
155    parent();
156
157    return 0;
158}
159
160

Key points:

  • better_gets() accepts both a buffer and a maximum length.

  • The loop stops before the buffer is full so there is room for the null terminator.

  • Input stops at newline, end of file, or the configured length.

  • The implementation reads from descriptor 0, which is standard input.

  • The rest of the example shows why reusable input helpers matter in larger systems programs.

Tokenizing Input with strtok()#

The systems-code-examples/strtok example splits a string into tokens using whitespace delimiters.

 1#include <stdio.h>
 2#include <string.h>
 3#include <stdlib.h>
 4
 5int SplitString(char* str, char* tokens[], const int maxTokens);
 6void testSplitString();
 7void assertEqualsStr(const char *expected, const char *actual);
 8void assertEqualsInt(const int expected, const int actual);
 9
10int main(int argc, char* argv[])
11{
12    if(argc == 2 && strcmp(argv[1], "--test") == 0)
13    {
14        testSplitString();
15        printf("tests pass!\n");
16    }
17    return 0;
18}
19
20void testSplitString()
21{
22    const int maxTokens = 20;
23    char testStr[] = "hello\tworld testing 123\nhello again";
24    char* tokens[maxTokens];
25
26    int tokenCount = SplitString(testStr, tokens, maxTokens);
27
28    assertEqualsInt(6, tokenCount);
29    assertEqualsStr("hello", tokens[0]);
30    assertEqualsStr("world", tokens[1]);
31    assertEqualsStr("testing", tokens[2]);
32    assertEqualsStr("123", tokens[3]);
33    assertEqualsStr("hello", tokens[4]);
34    assertEqualsStr("again", tokens[5]);
35}
36
37void assertEqualsStr(const char* expected, const char* actual)
38{
39    if(strcmp(expected, actual) != 0)
40    {
41        printf("expected <%s> but was <%s>\n", expected, actual);
42        exit(1);
43    }
44}
45
46void assertEqualsInt(const int expected, const int actual)
47{
48    if(expected != actual)
49    {
50        printf("expected <%d> but was <%d>\n", expected, actual);
51        exit(1);
52    }
53}
54
55int SplitString(char* str, char* tokens[], const int maxTokens)
56{
57    int i = 0;
58    for(i = 0; i < maxTokens; i++)
59    {
60        char* token = strtok(str, " \t\n");
61        str = NULL;
62        tokens[i] = token;
63        if(token == NULL)
64        {
65            break;
66        }
67    }
68    return i;
69}

Key points:

  • strtok() modifies the input string by replacing delimiters with null terminators.

  • The first call receives the original string. Later calls pass NULL to continue tokenizing the same string.

  • The example stores pointers into the original string; it does not copy each token.

  • The test function makes the expected token sequence explicit.

Dynamic String Buffer#

The systems-code-examples/strbuffer_t example builds a small dynamic string buffer. It supports appending characters and returning a heap-allocated C string.

 1/*
 2 * Many students come to C from Java.
 3 *
 4 * This is a demonstration of how to create a StringBuffer lite in C.
 5 * The buffer is used by declaring a strbuffer_t, which acts as a handle for all operations.
 6 */
 7
 8#ifndef _STRINGBUFFER_H_
 9#define _STRINGBUFFER_H_
10
11typedef struct _strbuffer_t
12{
13    char* data;
14    int capacity;
15    int increment;
16    int pos;
17} strbuffer_t;
18
19#define MIN_CAPACITY 80
20#define MIN_INCREMENT 10
21
22extern void strbuffer_new(strbuffer_t* buffer);
23
24extern void strbuffer_init(strbuffer_t* buffer, int capacity, int increment);
25
26extern void strbuffer_resize_if_needed(strbuffer_t* buffer);
27
28extern void strbuffer_append(strbuffer_t* buffer, char c);
29
30extern char* strbuffer_tostring(strbuffer_t* buffer);
31
32extern void strbuffer_reset(strbuffer_t* buffer);
33
34extern char* strbuffer_getline(strbuffer_t* buffer, int *eof);
35
36
37#endif
 1#include "strbuffer.h"
 2
 3#include <stdio.h>
 4#include <string.h>
 5#include <stdlib.h>
 6
 7void strbuffer_new(strbuffer_t* buffer)
 8{
 9    strbuffer_init(buffer, MIN_CAPACITY, MIN_INCREMENT);
10}
11
12void strbuffer_init(strbuffer_t* buffer, int capacity, int increment)
13{
14    if (capacity < MIN_CAPACITY)
15        buffer->capacity = MIN_CAPACITY;
16    else buffer->capacity = capacity;
17
18    if (increment < MIN_INCREMENT)
19        buffer->increment = MIN_INCREMENT;
20    else buffer->increment = increment;
21
22    buffer->data = (char*) malloc(buffer->capacity * sizeof(char));
23    buffer->pos = 0;
24}
25
26
27void strbuffer_resize_if_needed(strbuffer_t* buffer)
28{
29    if (buffer->pos < buffer->capacity)
30        return;
31
32    int new_capacity = buffer->capacity + buffer->increment;
33    //printf("Resizing to %d\n", new_capacity);
34    buffer->data = (char*) realloc(buffer->data, new_capacity * sizeof(char));
35    buffer->capacity = new_capacity;
36}
37
38void strbuffer_append(strbuffer_t* buffer, char c)
39{
40    strbuffer_resize_if_needed(buffer);
41    buffer->data[buffer->pos++] = c;
42}
43
44char* strbuffer_tostring(strbuffer_t* buffer)
45{
46    char* repr = (char*)malloc((buffer->pos+1) * sizeof(char));
47    memcpy(repr, buffer->data, buffer->pos * sizeof(char));
48    repr[buffer->pos] = 0;
49    return repr;
50}
51
52void strbuffer_reset(strbuffer_t *buffer)
53{
54    buffer->pos=0;
55}
56
57
58char* strbuffer_getline(strbuffer_t* buffer, int *eof)
59{
60    int c;
61    *eof = 0;
62    strbuffer_reset(buffer);
63    for(;;)
64    {
65        c = getchar();
66        if (c == '\n' || c == '\r' || c == 0 || c == EOF)
67            break;
68        strbuffer_append(buffer, c);
69    }
70    if (c == EOF) *eof = 1;
71    return  strbuffer_tostring(buffer);
72}
73

Key points:

  • strbuffer_t stores the backing array, capacity, resize increment, and current position.

  • strbuffer_append() checks whether the buffer needs to grow before storing the next character.

  • realloc() is used to grow the buffer while preserving existing contents.

  • strbuffer_tostring() returns a separate heap allocation, so callers must eventually free it.

  • strbuffer_getline() builds a line one character at a time.

String Buffer Input Example#

The systems-code-examples/strbuffer-demo directory shows how the string buffer can be used to read lines and split them into words.

 1#include <stdio.h>
 2#include <string.h>
 3#include <stdlib.h>
 4
 5/*
 6 * This example shows a strbuffer_t-enabled getline() like function
 7 * and how to use strtok() to separate text with whitespace delimiters.
 8 */
 9
10#include "strbuffer.h"
11
12#define DELIMS " \n\r\t"
13
14int main(int argc, char* argv[])
15{
16    strbuffer_t buffer;
17    strbuffer_init(&buffer, 80, 10);
18    for (;;)
19    {
20        int eof;
21        char* line = strbuffer_getline(&buffer, &eof);
22        if (eof) break;
23        char* next_word = strtok(line, DELIMS);
24        while (next_word != NULL)
25        {
26            printf("word: %s\n", next_word);
27            next_word = strtok(NULL, DELIMS);
28        }
29        free(line);
30    }
31}

Key points:

  • The example uses strbuffer_getline() so line length is not fixed by a small stack buffer.

  • Each returned line is tokenized with strtok().

  • The loop exits when the input function reports end of file.

  • Each heap-allocated line is freed after it is processed.

Regular Expression Example#

The systems-code-examples/regex-demo example reads text and extracts matches using the POSIX regular expression API.

 1/* Credit: https://www.lemoda.net/c/unix-regex/ */
 2/* Credit: GNU C docs */
 3/* Credit to self for fixing some problems with handling on OS X and Linux */
 4
 5
 6#define WORD_REGEX "(\\w+)"
 7#define ALT_WORD_REGEX "([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)" // not used (yet)
 8
 9#include <stdlib.h>
10#include <stdio.h>
11#include <string.h>
12#include <regex.h>
13
14#include "strbuffer.h"
15#include "regex-match.h"
16
17void process_regex(const char* find_text, const char* regex_text);
18
19int main (int argc, char ** argv)
20{
21    int eof;
22    const char * regex_text = WORD_REGEX;
23
24    // if no arguments read text from stdin; else use argv[1] as text
25    if (argc < 2)
26    {
27        strbuffer_t buffer;
28        strbuffer_init(&buffer, 80, 10);
29        char* line = strbuffer_getline(&buffer, &eof);
30        process_regex(line, regex_text);
31        if (line) free(line);
32    }
33    else
34    {
35        process_regex(argv[1], regex_text);
36    }
37    return 0;
38}
39
40void process_regex(const char* find_text, const char* regex_text)
41{
42    regex_t regex;
43    basic_matchlist_t match_list;
44    regex_compile (&regex, regex_text);
45    basic_matchlist_init(&match_list);
46    regex_match (&regex, find_text, &match_list);
47    basic_matchlist_print(&match_list);
48    basic_matchlist_delete(&match_list);
49    regfree (&regex);
50}

Key points:

  • If the program receives an argument, it processes that argument as the input text.

  • If there is no argument, it reads a line from standard input with the string buffer helper.

  • regcomp() and related POSIX regex calls are wrapped in helper functions from the example.

  • The match list is explicitly initialized, printed, and deleted.

  • regfree() releases resources owned by the compiled regular expression.

Character List Case Study#

The systems-code-examples/charlist_t example stores characters in a tail queue. It is a small data-structure case study for text processing.

 1#include <stdio.h>
 2#include <stdlib.h>
 3
 4#include "charlist.h"
 5
 6void charlist_add_char(struct charlist_t* head_ptr, char c)
 7{
 8    struct charentry_t* my_entry = malloc(sizeof(struct charentry_t));
 9    my_entry->c = c;
10    TAILQ_INSERT_TAIL(head_ptr, my_entry, entries);
11}
12
13void charlist_add_string(struct charlist_t* head_ptr, char* text)
14{
15    for (char* next_char=text; *next_char != '\0'; next_char++)
16        charlist_add_char(head_ptr, *next_char);
17}
18
19void charlist_print(struct charlist_t* head_ptr)
20{
21    struct charentry_t* np;
22    TAILQ_FOREACH(np, head_ptr, entries)
23    {
24        printf("%c\n", np->c);
25    }
26}
27
28char* charlist_tostring(struct charlist_t* head_ptr)
29{
30    /* will cache size later */
31    int size = charlist_size_slow(head_ptr);
32    char* text = (char*)malloc(size * sizeof(char));
33    struct charentry_t* np;
34    int i=0;
35    TAILQ_FOREACH(np, head_ptr, entries)
36    {
37        text[i++] = np->c;
38    }
39    text[i] = '\0';
40    return text;
41}
42
43int charlist_size_slow(struct charlist_t* head_ptr)
44{
45    struct charentry_t* np;
46    int size = 0;
47    TAILQ_FOREACH(np, head_ptr, entries)
48    {
49        size++;
50    }
51    return size;
52}
53
54void charlist_delete(struct charlist_t* head_ptr)
55{
56    struct charentry_t* n1, *n2;
57    n1 = TAILQ_FIRST(head_ptr);
58    while (n1 != NULL)
59    {
60        n2 = TAILQ_NEXT(n1, entries);
61        free(n1);
62        n1 = n2;
63    }
64}
65
66void charlist_init(struct charlist_t* head_ptr)
67{
68    struct charlist_t head = TAILQ_HEAD_INITIALIZER(head);
69    *head_ptr = head;
70    /* create a tail queue */
71    TAILQ_INIT(head_ptr);
72}
73

Key points:

  • Each character is stored in a heap-allocated list entry.

  • TAILQ_INSERT_TAIL appends entries while preserving order.

  • charlist_tostring() converts the list back into a normal C string.

  • charlist_delete() walks the list and frees each entry.

  • This representation is flexible but much heavier than a contiguous string buffer.

Word Counting with Hash Tables#

The systems-code-examples/std-hsearch example uses the standard hsearch family to maintain word counts.

 1#include <stdio.h>
 2#include <stdlib.h>
 3
 4#include "wordtable.h"
 5
 6/*
 7 * This example is adapted from the man page for hsearch.
 8 * Annoyingly, that example has a single haswtable that made reuse of that code nearly impossible.
 9 * This one uses the _r() functions, which not only are able to work on different haswtables
10 * but are also reentrant. Sadly, these are a bit GNU specific, but who doesn't use gcc? LOL
11 */
12
13static char *data[] = { "alpha", "bravo", "charlie", "delta",
14                        "echo", "foxtrot", "golf", "hotel", "india", "juliet",
15                        "kilo", "lima", "mike", "november", "oscar", "papa",
16                        "quebec", "quebec", "romeo", "sierra", "tango", "uniform",
17                        "victor", "whisky", "whisky", "x-ray", "yankee", "zulu", "zulu"
18                      };
19
20
21int main(void)
22{
23    wordtable_t wtable;
24    wordtable_init(&wtable, 50);
25    int data_size = sizeof(data) / sizeof(char*);
26    int i;
27
28    /* insert all words to get word counts */
29    for (int i = 0; i < data_size; i ++)
30    {
31        wordentry_t* wp = wordtable_increment(&wtable, data[i]);
32        printf("new word i=%d, data=%s, count=%ld, id=%p\n", i, data[i], wp->count, wp);
33    }
34    printf("\n");
35
36    /* delete every 4th word */
37    for (int i = 0; i < data_size; i += 4)
38    {
39        wordentry_t* wp = wordtable_decrement(&wtable, data[i]);
40        printf("dec word i=%d, data=%s, count=%ld, id=%p\n", i, data[i], wp->count, wp);
41    }
42    printf("\n");
43
44    for (int i = 0; i < data_size; i++)
45    {
46        wordentry_t* wp = wordtable_lookup(&wtable, data[i]);
47        if (wp == NULL || wp->count == 0) continue;
48        printf("non-zero word= %s, count=%ld, id=%p\n", data[i], wp->count, wp);
49    }
50
51    printf("\n");
52    for (int i = 0; i < data_size; i++)
53    {
54        wordentry_t* wp = wordtable_lookup(&wtable, data[i]);
55        if (wp == NULL || wp->count > 0) continue;
56        printf("zero word= %s, count=%ld, id=%p\n", data[i], wp->count, wp);
57    }
58
59    for (int i = 0; i < data_size; i ++)
60    {
61        int result = wordtable_delete_entry(&wtable, data[i]);
62        printf("deleting word %s, result=%d\n", data[i], result);
63    }
64    wordtable_delete(&wtable);
65    exit(EXIT_SUCCESS);
66}

Key points:

  • wordtable_init() prepares a hash table with a fixed capacity.

  • Repeated words increment an existing count rather than creating a new logical word.

  • The example looks up words after insertion and after decrementing counts.

  • Cleanup is explicit because the table owns allocated entries.

  • Hash tables are useful when fast lookup matters more than sorted order.

Word Counting with Trees#

The systems-code-examples/std-tsearch example stores word information in a binary search tree.

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <string.h>
 4
 5#define __USE_GNU
 6
 7#define _GNU_SOURCE     /* Expose declaration of tdestroy() */
 8#include <search.h>
 9#include <time.h>
10
11
12static char *data[] = { "alpha", "bravo", "charlie", "delta",
13                        "echo", "foxtrot", "golf", "hotel", "india", "juliet",
14                        "kilo", "lima", "mike", "november", "oscar", "papa",
15                        "quebec", "quebec", "romeo", "sierra", "tango", "uniform",
16                        "victor", "whisky", "whisky", "x-ray", "yankee", "zulu", "zulu"
17                      };
18
19
20typedef struct _wordinfo_t
21{
22    char* word;
23    int count;
24} wordinfo_t;
25
26
27static int compare_by_word(const void *pa, const void *pb)
28{
29    const wordinfo_t* wi_a = pa, *wi_b = pb;
30
31    //printf("comparing %s to %s\n", wi_a->word, wi_b->word);
32    return strcmp(wi_a->word, wi_b->word);
33}
34
35static void action(const void *nodep, VISIT which, int depth)
36{
37    //int *datap;
38    wordinfo_t* datap;
39
40    switch (which)
41    {
42    case preorder:
43        break;
44    case postorder:
45        datap = *(wordinfo_t **) nodep;
46        printf("word %s count %d depth %d\n", datap->word, datap->count, depth);
47        break;
48    case endorder:
49        break;
50    case leaf:
51        datap = *(wordinfo_t **) nodep;
52        printf("word %s count %d depth %d\n", datap->word, datap->count, depth);
53        break;
54    }
55}
56
57int main(void)
58{
59    int i;
60    wordinfo_t* ptr;
61    wordinfo_t** val;
62    int data_size = sizeof(data) / sizeof(char*);
63    void *root = NULL;
64
65    for (i = 0; i < data_size; i++)
66    {
67        printf("insert word %s\n", data[i]);
68        ptr = (wordinfo_t*) malloc(sizeof(wordinfo_t));
69        ptr->word = data[i];
70        ptr->count = 0;
71        if (ptr == NULL) break;
72        val = (wordinfo_t**) tsearch((void *) ptr, &root, compare_by_word);
73        if (val != NULL)
74        {
75            wordinfo_t* entry = *val;
76            entry->count++;
77        }
78    }
79    twalk(root, action);
80    tdestroy(root, free);
81    exit(EXIT_SUCCESS);
82}

Key points:

  • tsearch() inserts or finds an entry using the comparison function.

  • The comparison function orders entries by word text.

  • Repeated words find the existing node and increment its count.

  • twalk() traverses the tree and calls a callback for each node.

  • tdestroy() releases the tree at the end.

Looking Ahead: I/O Performance#

Good I/O performance depends on choosing the right buffering strategy. Small buffers increase the number of system calls and lower throughput. Very large buffers may cause longer waits before processing can continue. Programs often need a balance.

Producer-consumer designs can improve throughput. One thread or process reads data while another processes it. Memory-mapped I/O is another approach, and we return to it in the IPC chapter.

Simple I/O Performance Experiment#

The dd command gives a quick view of how block size can affect throughput.

dd if=/dev/zero of=tmp.dat bs=1 count=1000000 - 671 kB/s
dd if=/dev/zero of=tmp.dat bs=10 count=100000 - 5.9 MB/s
dd if=/dev/zero of=tmp.dat bs=100 count=10000 - 38.9 MB/s
dd if=/dev/zero of=tmp.dat bs=1000 count=1000 - 244 MB/s
dd if=/dev/zero of=tmp.dat bs=10000 count=100 - 537 MB/s
dd if=/dev/zero of=tmp.dat bs=100000 count=10 - 834 MB/s
dd if=/dev/zero of=tmp.dat bs=1000000 count=1 - 461 MB/s

Increasing block size usually improves performance until some other limit dominates. These are single-run measurements, so system load and cache state can affect the numbers.

Reading/Writing Performance#

Vectored I/O, also called gather-scatter I/O, combines multiple buffers into one read or write operation. This reduces the number of system calls when a program has data split across several buffers.

Vectored I/O Example#

The systems-code-examples/vectored_io example writes three separate buffers with one writev() call.

 1#include <unistd.h>
 2#include <fcntl.h>
 3#include <stdio.h>
 4#include <string.h>
 5#include <sys/uio.h>
 6
 7char *file_data1 = "1234567890";
 8char *file_data2 = "abcdefghijk";
 9char *file_data3 = "lmnopqrstuvwxyz";
10const char *file_name = "temp.dat";
11
12int main(int argc, char* argv[])
13{
14
15    int fd = open(file_name, O_CREAT | O_TRUNC | O_RDWR, 0666);
16    if(fd == (-1))
17    {
18        printf("open returned (-1)\n");
19        return (-1);
20    }
21
22    struct iovec buffers[3];
23    buffers[0].iov_base = file_data1;
24    buffers[0].iov_len = strlen(file_data1);
25    buffers[1].iov_base = file_data2;
26    buffers[1].iov_len = strlen(file_data2);
27    buffers[2].iov_base = file_data3;
28    buffers[2].iov_len = strlen(file_data3);
29
30    int written = writev(fd, buffers, 3);
31    if(written == (-1))
32    {
33        printf("writev returned (-1)\n");
34        return (-1);
35    }
36    printf("wrote %d bytes\n", written);
37
38    close(fd);
39    return 0;
40}
41
42

Key points:

  • Each struct iovec entry describes one buffer and its length.

  • writev() writes the buffers in order as one logical operation.

  • This avoids three separate write() system calls.

  • Vectored I/O is useful for data with a header, body, and trailer stored in separate buffers.

  • The file descriptor is still opened and closed with the same calls used for ordinary file I/O.