Friday, December 4, 2009

Several filesystem related trees (0)

Coly has given an intro to the B_tree algorithm in Nov.'s seminar. Here I'm going to extend his introduction to a brief overview of several file system related trees.

Following is a list of trees I plan to write about(might be extended). I hope I can finish the list before the end of 2009 :)

1. B_tree
2. b+tree
3. Htree
4. Dancing tree
5. Hash tree
6. Tiger tree
7. rb-tree

Friday, September 18, 2009

glibc malloc implementation

I've been asked several time in interviews how libc implements malloc function. So here I give the a simple summary.

First, I'd like to copy some simple malloc implementation found from Internet.
simple explaintion of malloc:
struct mem_control_block {
int is_available;
int size;
1. If our allocator has not been initialized, initialize it.
2. Add sizeof(struct mem_control_block) to the size requested.
3. start at managed_memory_start.
4. Are we at last_valid address?
5. If we are:
A. We didn't find any existing space that was large enough
-- ask the operating system for more and return that.
6. Otherwise:
A. Is the current space available (check is_available from
the mem_control_block)?
B. If it is:
i) Is it large enough (check "size" from the mem_control_block)?
ii) If so:
a. Mark it as unavailable
b. Move past mem_control_block and return the pointer
iii) Otherwise:
a. Move forward "size" bytes
b. Go back go step 4
C. Otherwise:
i) Move forward "size" bytes
ii) Go back to step 4

However, the real-world in libc in much more complicated and thus much more efficient.
Since 2.3 release GNU C library (glibc) uses a modified ptmalloc2, based on "Doug Lea's Malloc" (dlmalloc), which is a lock free implementation.

Quoting glibc/malloc/malloc.c:
  The main properties of the algorithms are:
* For large (>= 512 bytes) requests, it is a pure best-fit allocator,
with ties normally decided via FIFO (i.e. least recently used).
* For small (<= 64 bytes by default) requests, it is a caching
allocator, that maintains pools of quickly recycled chunks.
* In between, and for combinations of large and small requests, it does
the best it can trying to meet both goals at once.
* For very large requests (>= 128KB by default), it relies on system
memory mapping facilities, if supported
The differences between memory allocated by sbrk() and mmap() are: 1). A released memory map does not create any "hole" that would need to be managed. 2). mmaped regions must be page-aligned. 3). invoking mmap and mfree is much slower than carving out an existing chunk of memory, because mmap will force kernel to zero out the memory it returns, which eats a lot of cache and memory bandwidth.

The basic memory structure in glibc malloc is:
struct malloc_chunk {

INTERNAL_SIZE_T prev_size; /* Size of previous chunk (if free). */
INTERNAL_SIZE_T size; /* Size in bytes, including overhead. */

struct malloc_chunk* fd; /* double links -- used only if free. */
struct malloc_chunk* bk;

/* Only used for large blocks: pointer to next larger size. */
struct malloc_chunk* fd_nextsize; /* double links -- used only if free. */
struct malloc_chunk* bk_nextsize;
There are two core elements in dlmalloc: Boundary tags and bins.

Basically, boundary tags are how memory chunks look like, and bins are how memory chunks are managed.
An allocated chunk looks like this (boundary tags):

chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if allocated | |
| Size of chunk, in bytes |M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here... .
. .
. (malloc_usable_size() bytes) .
. |
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk |

chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk |
`head:' | Size of chunk, in bytes |P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Forward pointer to next chunk in list |
| Back pointer to previous chunk in list |
| Unused space (may be 0 bytes long) .
. .
. |
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`foot:' | Size of chunk, in bytes |
chunk in a list is known to be preceeded and followed by either
inuse chunks or the ends of memory.

A single-linked LIFO array of lists holding recently freed small chunks.
Chunks in fastbins keep their inuse bit set. malloc_consolidate
releases all chunks in fastbins and consolidates them with
other free chunks.

the "unsorted" bin
chunks being placed on it in free (and malloc_consolidate),
and taken off (to be either used or placed in bins) in malloc.

a bitvector recording whether bins are definitely empty so they can
be skipped over during during traversal

Thursday, July 16, 2009

First kernel patch

My real first kernel patch. I am quite exciting about it. Finally I got a Signed-off-by in the kernel commit log :)

Luckily enough, the patch got merged just before 2.6.30-rc3 was released.

commit ac046f1d6121ccdda6db66bd88acd52418f489b2
Author: Peng Tao
Date: Mon Jul 13 09:30:17 2009 -0400

ext4: fix null handler of ioctls in no journal mode

The EXT4_IOC_GROUP_ADD and EXT4_IOC_GROUP_EXTEND ioctls should not
flush the journal in no_journal mode. Otherwise, running resize2fs on
a mounted no_journal partition triggers the following error messages:

BUG: unable to handle kernel NULL pointer dereference at 00000014
IP: [] _spin_lock+0x8/0x19
*pde = 00000000
Oops: 0002 [#1] SMP

Signed-off-by: Peng Tao
Signed-off-by: "Theodore Ts'o"

Wednesday, July 15, 2009

Valerie's Advice to FS Developpers

From an interview from Linux journal.
Keep it in mind...

Keep talking to other file system developers face to face, keep experimenting with new file systems, keep talking to people in research and academia, keep paying attention to hardware trends. The way to avoid the “file systems are a solved problem” echo chamber is to stay in touch with both each other and the outside world.

Monday, July 13, 2009

Content Addressable Storage -- The Next Step

Recently, I was writing an slightly modified version of implementation of Content Addressable Storage (CAS) for our on-line image servers. I found it a perfect choice to apply CAS to image servers, where images are uploaded once and never changed, and where disk access throughput is a main bottleneck to many large-scale on-line image service providers.

As explained on wikipedia, CAS is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval of fixed content.

CAS Characteristics:
1. Storage is identified by its content
2. Designed to make the searching for a given document content very quick
3. Works most efficiently on data that does not change often

Currently, most servers store images as common files and rely on filesystem's mechanism to optimize reading speed. However, filesystems are limited by POSIX standards and are optimized only for everyday usage. For large-scale image servers (or any application storing large amount of images), CAS will be a better choice (for the following reasons).

1. deduplication: both block-level CAS and file-level CAS achieve efficient storage utilization by storing same data only once.
2. Aggregation: log-structured extent-based on-disk object-stores, storing many images continuously in a single large file, resulting only one disk seek if index files are saved/managed in-memory (verses 2~3 disk seeks per read for common files).

My CAS implementation is written by the fact that most images are small files (~10MB) and there are huge amount of them on image servers. The implementation should speed up accessing of these image files. I am planing some benchmarks this week and hope to post them here soon.

After that, I will be writing a paper illustrating the opportunistic of applying CAS system to large-scale image servers. Then, I also want to explore the possibility of changing the interface (to, maybe dbus?) and make it runnable on common desktop, because current gnome/kde also read a lot of thumbnails. Maybe I can make it quicker somehow. Who knows :)

Monday, June 29, 2009

Easter egg in opensuse grub

Yesterday I booted my laptop while shaking it. First I saw some error messages saying something like "not enough memory". Then amazing things happened. A beautiful grub boot screen came into my eye. It's a snowy background with several penguins running around. I'm impressed! After booting into my system, I decided to find out what actually happened.

opensuse grub uses a gfxmenu option in menu.list to set the colorful boot menus. In my case,
gfxmenu (hd0,2)/message
So I need to open (hd0,2)/message to see what's inside.
$cp /boot/message ~/test/grub/
$cpio -i <>
Good, let's first see the about file
$cat pabout.txt
Penguin theme originally made by Raphael Quinet
Modernized for openSUSE by Steffen Winterfeldt.

Like it or hate it? Edit gfxboot.cfg in /boot/message
to have it always or to get rid of it.
So gfxboot.cfg next. I found an option setting penguin theme there.
; penguin theme likelihood (in percent, -1 = auto)
Wow, the likelihood is set 0, which explains why I didn't see the theme before. I must be so lucky that my shaking hands made grub to read the bit wrongly and presented the penguin theme to me. What a lucky dog I am!
I really like the penguin grub theme. So I decided to have it always by setting penguin likelihood to 100. and then reinstall the picture. Of course don't forget to delete the original message before archiving the new one.
$rm message
$find ./ -print | cpio -ov > message
835 blocks
$mv message /boot/

OK, let me reboot to see the new grub theme.

For those who want to try opensuse grub, please download from the official repository (look for grub here). To install grub in command line, see here. Of course root privilege is needed.

Friday, June 19, 2009

tesseract-ocr: jpeg support added

I just pushed some patches to the github host to add jpeg image support with libjpeg.

One drawback is that, because the image operation infrastructure in tesseract doesn't allow passing libjpeg private structures between open and read operations, I have to open the file stream twice in opening and reading jpeg images.

To test the patch, pull from git:// and build tesseract-ocr. Then run some tests, this time against jpeg images :)

github copy of tesseract-ocr

We need to modify tesseract-ocr to support more image formats. So I set up a project host at github and pulled the source code from tesseract-ocr's googlecode host.

So, to get the new code base, just type the following commands:
1. git clone git://
2. cd tesseract-ocr-copy
3. ./configure
4. make
5. make install

works perfect :)

Great news: we have passed 1st round evaluation

Just got a notification email saying that our online namecard ocr project has passed the first round of Nokia Innovation Contest.

Nokia is very generous to plan to sponsor us with a N95. Cool!

Tuesday, May 12, 2009

libevent and so on

Since we need a HTTP server to provide efficient service to all kinds of clients, I started to look into some lightweight open source solutions. The first item that jumps into my eyes is libevent, because I happen to read a blog of a facebook developer's, stating that facebook is using libevent as a HTTP server in their haystack photo storage service.

Libevent is a lightweight event driven library wildly used in many applications, such as memcached and tor. Libevent has simple but efficient HTTP support. Here is a sample code building a simple HTTP server with evhttp:
#include <sys/queue.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#include <string.h>
#include <stdio.h>

#include <evhttp.h>

void ocr_handler(struct evhttp_request *req, void *arg)
struct evbuffer *buf;
struct evkeyval *header;

TAILQ_FOREACH(header, req->input_headers, next) {
fprintf(stdout, "key:%s\tvalue:%s\n", header->key, header->value);

int main(int argc, char **argv)
int err;
struct evhttp *httpd;
struct event_base *evbase;
int port = 1234;

evbase = event_init();

fprintf(stderr, "event method: %s\n", event_base_get_method(evbase));

httpd = evhttp_new(evbase);
if ((err = evhttp_bind_socket(httpd, NULL, port)) < 0) {
fprintf(stderr, "error binding http server to port %d\n", port);
return -1;

/* set callback for "/" */
evhttp_set_cb(httpd, "/", ocr_handler, NULL);

/* Set a send callback for all other requests. */
evhttp_set_gencb(httpd, NULL, NULL);


return 0;

However, as I looked into the library in details and wrote some test programs, it turns out that life is not that easy. I tried to dynamically create threads to serve incoming HTTP requests, but the code didn't work as I thought. After searching for a while, I found the problem:

Steven Grimm:
What libevent doesn't support is sharing a libevent instance across threads. It works just fine to use libevent in a multithreaded process where only one thread is making libevent calls.

Therefore, to use libevent in a multi-threaded program, we should create each thread a event base when initialising the program and call ev_set_base() after ev_set() but before ev_add(). Then we will have a thread pool to serve HTTP requests. There will a main thread listening to all incoming HTTP requests. When a request comes, it passes the request to some thread from the thread pool and wakes it up to handle the request.

Although this will work, we somehow end up with a master/worker thread architecture, where the main thread handles all reads from netwrok. This will certainly be a bottleneck when there are thousands of clients(think the C10K problem). I don't know how the facebook guys deal with this problem(maybe they patched libevent?:), But IMO, using an evhttp dispacher in a multi-threaded process, we'll end up this way.

So, currently, I'm planning to look at other solutions like lighttpd before making any decision on the server architecture.

Tuesday, April 28, 2009

A first glance at tesseract-ocr

So, I start to look into the name card OCR project. As suggested by Alex, I'd first look at the OCR engine developed by Google, tesseract-ocr.

From wikipedia:
Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. After ten years without any development taking place, Hewlett Packard and UNLV released it as open source in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0

It's a little strange that the featured tar balls listed on the project home page have compile errors. After modifying the source code, I successfully got the program binaries, but the language charsets aren't built. Then I change to the svn HEAD version, and it works.

To use tesseract, I simply type:
[bergwolf@bin]$./tesseract phototest.tif result
Tesseract Open Source OCR Engine
[bergwolf@bin]$cat result.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Tesseract OCR engine is very accurate, and is very suitable for our name card OCR service, because usually we only have white background and black letters in our images.

However, the drawback is that it doesn't support many image formats. Most mobile device save camera photos in jpeg format. Currently, only tiff and bmp formats are recognizable by tesseract. If we want to use it as our OCR engine, two options are available: either patch tesseract with other image formats support, or use other tools like imagemagick to convert other image formats to tiff or bmp format, both of which shouldn't be very hard.

Saturday, April 18, 2009

Being one of opensource apprentices

A few days ago, I applied to be one of Alex Lau's 12 open source apprentices. After several conversations via email with Alex, I finally got a chance to see him face to face.

It was a wonderful and pleasant talk. Alex gave me a lot of advices, on my tech path and on my future plans. We talked about many open source projects, like name card OCR, SyncML, Ifolder, Logfs(and FTL), as well as p2p file systems. At this point, I'm planning to first look at the on-line name card OCR project. It is most related to the Maemo project in my laboratory, and maybe, I can find someone in the lab to implement it together with me.

My problem is that I applied for this year's GSoC program. But the GSoC slots allocation is not decided yet. So I don't know how much time I can devote to this program. Nevertheless, let's see the GSoC results first, which is due soon.

Monday, January 5, 2009

Kernel basis(2): radix trees

Abstract: this article talks about the data structure radix trees and its usage in Linux kernel.

1. What is a radix tree?
As described by wikipedia, A radix tree, Patricia trie/tree, or crit bit tree is a specialized set data structure based on the trie that is used to store a set of strings.

2. Linux kernel radix tree internals
struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;

struct radix_tree_node {
unsigned int height; /* Height from the bottom */
unsigned int count;
struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */

In Linux kernel, each radix tree node has 16/64 slots and can be indexed by a portion of the integer key. The leaf node points to the content. Empty slots contain a NULL pointer. For example, each node of the above three levels radix tree is indexed by 6 bits of the key. Nodes that have no children are not presented. Thus radix tree can be used to store sparse files.

3. Usage
In vanilla kernel, the PowPC architecture uses a radix tree to map real and virtual IRQ numbers. The NFS code uses a radix tree to index inode structures to keep track of outstanding requests. The address_space structure used to keep track of backing store contains a radix tree which tracks in-core pages tied to that mapping. Among other things, this tree allows the memory management code to quickly find pages which are dirty or under writeback.

4. How to use radix trees?
There are two ways to initialize a radix tree:
#include linux/radix-tree.h
RADIX_TREE(name, gfp_mask); /* Declare and initialize */
struct radix_tree_root my_tree;
INIT_RADIX_TREE(my_tree, gfp_mask);

The first approach is simply a wrapper of combination of the latter one.
Then all functions defined in linux/radix-tree.h can be used to manipulate the tree.

5. Notes
The radix-tree API requires users to provide all synchronizations(with some specific exceptions).
For API usage, in general,
- any function _modifying_ the tree or tags (inserting or deleting
items, setting or clearing tags) must exclude other modifications, and
exclude any functions reading the tree.
- any function _reading_ the tree or tags (looking up items or tags,
gang lookups) must exclude modifications to the tree, but may occur
concurrently with other readers.

The notable exceptions to this rule are the following functions:
(1)Destroying a radix tree.
There is no function for destroying a radix tree. It is, evidently, assumed that radix trees will last forever. In practice, deleting all items from a radix tree will free all memory associated with it other than the root node, which can then be disposed of normally.

6. Reference:

Thursday, January 1, 2009

Kernel basis(1): the container_of macro

Short version:
The following article explains how container_of macro works and how to use
it(based on

Long version:
The container_of macro is defined in linux/kernel.h
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

Some fundamentals:
1. typeof:
This is one of GNU C extensions. In ANSI C and ISO C, this should be
__typeof__. The macro takes in two types of arguments: an expression or
a type.

Expression: typeof(x[0](1))
where x is an array of pointers to functions. The macro returns the value
of the function.

Type: typeof(int *)
This is the type of pointers to int.

2. offsetof(TYPE, MEMBER)
This is an ANSI C library feature defined in stddef.h. It evaluates to
offset(in bytes) of a given member within a struct or union type.
Typical implementations:
#define offsetof(TYPE, MEMBER) ((site_t) &((TYPE*)0)->MEMBER)

#define offsetof(TYPE, MEMBER) \
((size_t) ((char *)&((TYPE*)(0))->MEMBER - (char *)0))

gcc has buitin offsetof. To use it, do
#undef offsetof
#ifdef __compiler_offsetof
#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

Now, we come to the container_of macro at last. Really easy to understand.
The first line defines a pointer(__mptr) that points to *ptr. So we have
tow pointers(__mptr and ptr) both pointing to the same memory location.

The second line finds the real location in memory of the containing
structure that contains what ptr points to. The offsetof macro calculates
the memory offset of member starting from type. Then it dismiss the offset
from memory location __mptr and gives us the memory location of the
containing structure, type.

How to use it?
This is useful for calling back functions between different software layers.
/* test code to illustrate use of Linux kernel container_of macro
* Copyright (c) 2008 Cliff Brake, BEC Systems LLC
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* This program illustrates how the container_of macro works.
* The container of macro is very useful in multi layered
* software systems where you have progressivly more detailed
* software layers. Below is an example of a bus layer,
* and then a device layer where a number of different
* devices might register with the bus.
* The device registers itself with the bus subsystem, and
* then the bus subsystem makes a callback into the device.
* Normally if there are multiple devices registered, the
* bus subsystem must store and pass a device structure
* when making callbacks. With the container_of macro, this is
* no longer necessary, and the bus subsystem only has to
* know about one generic device structure, and does not need visibility
* into lots of different device structures, or do tricks
* by casting void pointers, etc. With the container_of macro
* we can backcast from the generic data structure, to the containing
* datastructure. This forces good separation of code in that
* that bus layer cannot modifiy data structures that are specific
* to the device layer.

* (from Linux kernel source)
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.

#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

* BUS layer code

/* generic bus device structure */
struct bus_device
int general_device_param_a;
int general_device_param_b;
void (*device_callback)(struct bus_device * bd);

/* the following is a global list of
* devices that have registered with the
* bus subsystem. Normally this would
* be something like a dynamic linked list.

struct bus_device * bd_list[5];

/* function to register a device with the bus */
void register_with_bus(struct bus_device * bd)
/* since this example only deals with one
* device, will put it in slot 0

bd_list[0] = bd;

void start_bus()
int i;
struct bus_device * bd;

/* make callbacks to all devices on bus */
for (i=0;i<sizeof(bd_list)/sizeof(bd_list[0]);i++) {
bd = bd_list[i];
if (!bd) continue;
/* call device callback with generic
* bus device structure


* device X specific code
* this would normally be in a different module

/* structure that holds device X specific stuff, as well as
* generic bus_device structure

struct device_x
int device_x_specific_param_a;
int device_x_specific_param_b;
struct bus_device bd;

void device_x_callback(struct bus_device * bd)
/* if we know the structure type that contains the bus_device structure,
* we can extract a pointer to the containing structure using the container_of
* macro

/* ptr type member */
struct device_x * devx = container_of(bd, struct device_x, bd);

/* the above statement expands to
* struct device_x * devx = (
* {
* const typeof( ((struct device_x *)0)->bd ) *__mptr = (bd);
* (struct device_x *)( (char *)__mptr - ((size_t) &((struct device_x *)0)->bd) );
* }
* );

printf("device_x_callback called!, device_x_specific_param_a = %i\n",

void device_x_init()
/* dynamically allocate structures */
struct device_x * devx = malloc(sizeof(*devx));
memset(devx, 0, sizeof(*devx));

/* set a parameter in the device_x structure so
* we can test for this in the callback

devx->device_x_specific_param_a = 1001;

/* set up callback function */
devx->bd.device_callback = device_x_callback;

/* we register the generic bus device structure
* as the bus layer does not need to know
* about the device_x stucture. Note, the
* devx structure is not stored anywhere, yet
* its location is being preserved without
* specifically passing it to the bus
* layer.


int main()

/* test the above system */

/* first, initialize device_x */

/* now, start the bus. This should make
* a callback into the device_x


/* when run, this program returns:
* device_x_callback called!, device_x_specific_param_a = 1001