Monday, January 5, 2009

Kernel basis(2): radix trees

Abstract: this article talks about the data structure radix trees and its usage in Linux kernel.

1. What is a radix tree?
As described by wikipedia, A radix tree, Patricia trie/tree, or crit bit tree is a specialized set data structure based on the trie that is used to store a set of strings.


2. Linux kernel radix tree internals
struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;
};

struct radix_tree_node {
unsigned int height; /* Height from the bottom */
unsigned int count;
struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

In Linux kernel, each radix tree node has 16/64 slots and can be indexed by a portion of the integer key. The leaf node points to the content. Empty slots contain a NULL pointer. For example, each node of the above three levels radix tree is indexed by 6 bits of the key. Nodes that have no children are not presented. Thus radix tree can be used to store sparse files.

3. Usage
In vanilla kernel, the PowPC architecture uses a radix tree to map real and virtual IRQ numbers. The NFS code uses a radix tree to index inode structures to keep track of outstanding requests. The address_space structure used to keep track of backing store contains a radix tree which tracks in-core pages tied to that mapping. Among other things, this tree allows the memory management code to quickly find pages which are dirty or under writeback.

4. How to use radix trees?
There are two ways to initialize a radix tree:
#include linux/radix-tree.h
then
RADIX_TREE(name, gfp_mask); /* Declare and initialize */
or
struct radix_tree_root my_tree;
INIT_RADIX_TREE(my_tree, gfp_mask);

The first approach is simply a wrapper of combination of the latter one.
Then all functions defined in linux/radix-tree.h can be used to manipulate the tree.

5. Notes
(1)Synchronization
The radix-tree API requires users to provide all synchronizations(with some specific exceptions).
For API usage, in general,
- any function _modifying_ the tree or tags (inserting or deleting
items, setting or clearing tags) must exclude other modifications, and
exclude any functions reading the tree.
- any function _reading_ the tree or tags (looking up items or tags,
gang lookups) must exclude modifications to the tree, but may occur
concurrently with other readers.

The notable exceptions to this rule are the following functions:
radix_tree_lookup
radix_tree_lookup_slot
radix_tree_tag_get
radix_tree_gang_lookup
radix_tree_gang_lookup_slot
radix_tree_gang_lookup_tag
radix_tree_gang_lookup_tag_slot
radix_tree_tagged
(1)Destroying a radix tree.
There is no function for destroying a radix tree. It is, evidently, assumed that radix trees will last forever. In practice, deleting all items from a radix tree will free all memory associated with it other than the root node, which can then be disposed of normally.

6. Reference:
http://lwn.net/Articles/175432/
http://en.wikipedia.org/wiki/Radix_tree

Thursday, January 1, 2009

Kernel basis(1): the container_of macro

Short version:
The following article explains how container_of macro works and how to use
it(based on http://www.kroah.com/log/linux/container_of.html).

Long version:
The container_of macro is defined in linux/kernel.h
/**
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
*/
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

Some fundamentals:
1. typeof:
This is one of GNU C extensions. In ANSI C and ISO C, this should be
__typeof__. The macro takes in two types of arguments: an expression or
a type.

Expression: typeof(x[0](1))
where x is an array of pointers to functions. The macro returns the value
of the function.

Type: typeof(int *)
This is the type of pointers to int.

2. offsetof(TYPE, MEMBER)
This is an ANSI C library feature defined in stddef.h. It evaluates to
offset(in bytes) of a given member within a struct or union type.
Typical implementations:
#define offsetof(TYPE, MEMBER) ((site_t) &((TYPE*)0)->MEMBER)

#define offsetof(TYPE, MEMBER) \
((size_t) ((char *)&((TYPE*)(0))->MEMBER - (char *)0))

gcc has buitin offsetof. To use it, do
#undef offsetof
#ifdef __compiler_offsetof
#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
#else
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#endif

Now, we come to the container_of macro at last. Really easy to understand.
The first line defines a pointer(__mptr) that points to *ptr. So we have
tow pointers(__mptr and ptr) both pointing to the same memory location.

The second line finds the real location in memory of the containing
structure that contains what ptr points to. The offsetof macro calculates
the memory offset of member starting from type. Then it dismiss the offset
from memory location __mptr and gives us the memory location of the
containing structure, type.

How to use it?
This is useful for calling back functions between different software layers.
 container_of_sample.c
/* test code to illustrate use of Linux kernel container_of macro
*
* Copyright (c) 2008 Cliff Brake, BEC Systems LLC
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
*/


#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* This program illustrates how the container_of macro works.
* The container of macro is very useful in multi layered
* software systems where you have progressivly more detailed
* software layers. Below is an example of a bus layer,
* and then a device layer where a number of different
* devices might register with the bus.
* The device registers itself with the bus subsystem, and
* then the bus subsystem makes a callback into the device.
* Normally if there are multiple devices registered, the
* bus subsystem must store and pass a device structure
* when making callbacks. With the container_of macro, this is
* no longer necessary, and the bus subsystem only has to
* know about one generic device structure, and does not need visibility
* into lots of different device structures, or do tricks
* by casting void pointers, etc. With the container_of macro
* we can backcast from the generic data structure, to the containing
* datastructure. This forces good separation of code in that
* that bus layer cannot modifiy data structures that are specific
* to the device layer.
*
*/



/**
* (from Linux kernel source)
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
*/

#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)


/*==========================================================
* BUS layer code
*==========================================================*/


/* generic bus device structure */
struct bus_device
{
int general_device_param_a;
int general_device_param_b;
void (*device_callback)(struct bus_device * bd);
};

/* the following is a global list of
* devices that have registered with the
* bus subsystem. Normally this would
* be something like a dynamic linked list.
*/

struct bus_device * bd_list[5];

/* function to register a device with the bus */
void register_with_bus(struct bus_device * bd)
{
/* since this example only deals with one
* device, will put it in slot 0
*/


bd_list[0] = bd;
}

void start_bus()
{
int i;
struct bus_device * bd;

/* make callbacks to all devices on bus */
for (i=0;i<sizeof(bd_list)/sizeof(bd_list[0]);i++) {
bd = bd_list[i];
if (!bd) continue;
/* call device callback with generic
* bus device structure
*/

bd->device_callback(bd);
}
}

/*==========================================================
* device X specific code
* this would normally be in a different module
*==========================================================*/


/* structure that holds device X specific stuff, as well as
* generic bus_device structure
*/

struct device_x
{
int device_x_specific_param_a;
int device_x_specific_param_b;
struct bus_device bd;
};

void device_x_callback(struct bus_device * bd)
{
/* if we know the structure type that contains the bus_device structure,
* we can extract a pointer to the containing structure using the container_of
* macro
*/


/* ptr type member */
struct device_x * devx = container_of(bd, struct device_x, bd);

/* the above statement expands to
* struct device_x * devx = (
* {
* const typeof( ((struct device_x *)0)->bd ) *__mptr = (bd);
* (struct device_x *)( (char *)__mptr - ((size_t) &((struct device_x *)0)->bd) );
* }
* );
*/


printf("device_x_callback called!, device_x_specific_param_a = %i\n",
devx->device_x_specific_param_a);
}

void device_x_init()
{
/* dynamically allocate structures */
struct device_x * devx = malloc(sizeof(*devx));
memset(devx, 0, sizeof(*devx));

/* set a parameter in the device_x structure so
* we can test for this in the callback
*/

devx->device_x_specific_param_a = 1001;

/* set up callback function */
devx->bd.device_callback = device_x_callback;

/* we register the generic bus device structure
* as the bus layer does not need to know
* about the device_x stucture. Note, the
* devx structure is not stored anywhere, yet
* its location is being preserved without
* specifically passing it to the bus
* layer.
*/

register_with_bus(&devx->bd);
}


int main()
{

/* test the above system */

/* first, initialize device_x */
device_x_init();

/* now, start the bus. This should make
* a callback into the device_x
*/

start_bus();
}

/* when run, this program returns:
* device_x_callback called!, device_x_specific_param_a = 1001
*/