The second day of CLSF 2011 started with Wu Fengguang's IO-less throttling. It is a continuous work from last year. Fengguang has done great work in improving Linux kernel read-ahead. And in the last two years, he is working on improving writeback.
There are many dirty page writeback path in Linux kernel: fsync(), sync(), periodic writeback, backgroud writeback, balance_dirty_pages() and page reclaim path. Among them, balance_dirty_pages() and page reclaim path are worst for performance because they tend to writeback random single page. The purpose of balance_dirty_pages() is to balance page dirtying and page writeback, so that task doesn't cause too much memory pressure. Currently, balance_dirty_pages() calls writeback_inodes_wb() to writeback dirty pages. What Fengguang does is to adjust balance_dirty_pages() to:
1. let dirtier task sleep instead of initiate writeback in balance_dirty_pages(), so flusher can start writeback when necessary;
2. make page dirtying and writeback speed matching smoothly so that page reclaim is avoided as much as possible;
3. balance the dirty speed of each memory dirtier so that large writer don't eat all memory causing smaller writers stall for a long time.
There are a few parameters to tune and the most important one is how long dirtier should task sleep, which takes into account current disk write bandwidth, system memory pressure and task dirtying speed.
The whole patchsets are referred as IO-less throttling and are getting merged in 3.2 merge window as decided in Kernel Summit last week. It is really a time-consuming critical work that needs brilliant and patient mind. Fengguang said he draw more than 10K performance graphs and read even more in the past year.
While the IO-less throttling patches help improving writeback performance, Coly and Ma Tao said they are more interested in a side effect of the patches, per-task buffer writer bandwidth control. The problem comes with current implementation of blkio cgroup, which calculates IO bandwidth on the task that submitted bio. However, for buffer write, the pages are most likely written back by flusher thread. So the blkio controller does not work for buffer write. But Fengguang's IO-less throttling gives another way -- control write speed at balance_dirty_pages(). Fengguang said it should be possible and easy for balance_dirty_pages().
The second topic is memory management discussion led by Li Shaohua. We mainly discussed some issues found in Taobao's/Baidu's data centers. The first issue Coly presented is whether swap-off on all-in-memory service servers. When swap on, performance may get hurt when pages are swapping out, while swap off, oom-killer is easy to trigger when applications uses too many memory. The real problem is that application doesn't know about system memory pressure (malloc doesn't fail at all) until oom. A possible solution would be to send sigfault to application when system reaching memory high watermark. Xie Guangjun from Baidu, commented that swap is on in their deployment. Only small databases are all-in-memory. For file data, they use auto tiering, and their search engine applications are not allowed to allocate memory dynamically...
Another issue in Taobao is about SSD+HD auto tiering. HD data cache hit ratio is very low. As a result, when there is a cache miss in SSD, reading from HD is very slow (avg 3 IO per file, dentry, inode, data). Possible solutions:
1. Use cgroups to control memory for SSD and HD, to keep HD buffers cache in memory.
2. Use DIO to read these small data, to avoid wasting page cache for HD data.
3. create a new zone, use bdi->mapping flag to specify which zone to alloc memoroy for device, for the same reason as 1.
Also it is mentioned that memcg count buffer cache to first users but it is unfair. Possible solution would be to create a task struct for each bdi and count these pages to bdi task.
In the afternoon, first I led a session on pNFS discussion. The first issue is how much layout client should be asking for IO. Current pNFS client only asks 4KB length of layout in writepages, even if the wbc requires much more. Fengguang agreed that client has the information of how much to writeback and it is a pity to drop the information letting server do a wild guess. Then I presented the idea of making pNFS Linux a full POSIX client. It is technically do-able but the main reason (for EMC) would be business need.
After that, Xie Guangjun led a session about Baidu's work in their data centers. The first change is to use memcg to guild oom-killer. They make mapreduce jobtracker send down memory limit per task, then task tracker creates new task with memory limit (now use cgroups) and oom-killer kills all processes in the same cgroup.
Another interesting thing is that, Baidu is building a global resource manager in their data center. The idea is similar to Hadoop's new resource manager in MDS will be a general purpose one. Guangjun mentioned that Google has a similar implementation in their data center as well.
The most interesting thing from Baidu is their work on flash. In current deployment, they build SSD from NAND with MTD+UBI+LVM. It is a joint work together with Huawei and fully implemented the functionality of mysterious FTL. In the future,they will build the SSD such that it bypasses whole VFS layer and uses ioctl API to do read/write/trim. It seems that even with some of Nick Pidgin's scalability patches merged, Linux VFS is still a hot pot of lock contention for high speed storage. In our previous training from Whamcloud, they came to the same conclusion and are planning to drop VFS interface in the future.
After Baidu, Intel performance team let a discussion about the impact of emerging new hardware on current system architecture, namingly SSD and PCM. SSD is already changing system architecture in Baidu. PCM is likely to be even more evolutionary, even removing the need for current complex implemented file systems. However, it is still way far in the future. We need to focus on what we have at hand.
That's all for the two days event. I didn't mean to write so long stories but all the above are worth documenting. So I can't drop any of them...
In the end, thanks Baidu and Fujitsu for their generous sponsorship. Thank you all CLSF committees for making this happen. And thanks EMC for sending me there. It is really a valuable experience and will benefit me for quite a long time.