Tech rider: 2011

Tuesday, November 1, 2011

CLSF 2011 - Day 2

The second day of CLSF 2011 started with Wu Fengguang's IO-less throttling. It is a continuous work from last year. Fengguang has done great work in improving Linux kernel read-ahead. And in the last two years, he is working on improving writeback.

There are many dirty page writeback path in Linux kernel: fsync(), sync(), periodic writeback, backgroud writeback, balance_dirty_pages() and page reclaim path. Among them, balance_dirty_pages() and page reclaim path are worst for performance because they tend to writeback random single page. The purpose of balance_dirty_pages() is to balance page dirtying and page writeback, so that task doesn't cause too much memory pressure. Currently, balance_dirty_pages() calls writeback_inodes_wb() to writeback dirty pages. What Fengguang does is to adjust balance_dirty_pages() to:
1. let dirtier task sleep instead of initiate writeback in balance_dirty_pages(), so flusher can start writeback when necessary;
2. make page dirtying and writeback speed matching smoothly so that page reclaim is avoided as much as possible;
3. balance the dirty speed of each memory dirtier so that large writer don't eat all memory causing smaller writers stall for a long time.

There are a few parameters to tune and the most important one is how long dirtier should task sleep, which takes into account current disk write bandwidth, system memory pressure and task dirtying speed.

The whole patchsets are referred as IO-less throttling and are getting merged in 3.2 merge window as decided in Kernel Summit last week. It is really a time-consuming critical work that needs brilliant and patient mind. Fengguang said he draw more than 10K performance graphs and read even more in the past year.

While the IO-less throttling patches help improving writeback performance, Coly and Ma Tao said they are more interested in a side effect of the patches, per-task buffer writer bandwidth control. The problem comes with current implementation of blkio cgroup, which calculates IO bandwidth on the task that submitted bio. However, for buffer write, the pages are most likely written back by flusher thread. So the blkio controller does not work for buffer write. But Fengguang's IO-less throttling gives another way -- control write speed at balance_dirty_pages(). Fengguang said it should be possible and easy for balance_dirty_pages().

The second topic is memory management discussion led by Li Shaohua. We mainly discussed some issues found in Taobao's/Baidu's data centers. The first issue Coly presented is whether swap-off on all-in-memory service servers. When swap on, performance may get hurt when pages are swapping out, while swap off, oom-killer is easy to trigger when applications uses too many memory. The real problem is that application doesn't know about system memory pressure (malloc doesn't fail at all) until oom. A possible solution would be to send sigfault to application when system reaching memory high watermark. Xie Guangjun from Baidu, commented that swap is on in their deployment. Only small databases are all-in-memory. For file data, they use auto tiering, and their search engine applications are not allowed to allocate memory dynamically...

Another issue in Taobao is about SSD+HD auto tiering. HD data cache hit ratio is very low. As a result, when there is a cache miss in SSD, reading from HD is very slow (avg 3 IO per file, dentry, inode, data). Possible solutions:
1. Use cgroups to control memory for SSD and HD, to keep HD buffers cache in memory.
2. Use DIO to read these small data, to avoid wasting page cache for HD data.
3. create a new zone, use bdi->mapping flag to specify which zone to alloc memoroy for device, for the same reason as 1.

Also it is mentioned that memcg count buffer cache to first users but it is unfair. Possible solution would be to create a task struct for each bdi and count these pages to bdi task.

In the afternoon, first I led a session on pNFS discussion. The first issue is how much layout client should be asking for IO. Current pNFS client only asks 4KB length of layout in writepages, even if the wbc requires much more. Fengguang agreed that client has the information of how much to writeback and it is a pity to drop the information letting server do a wild guess. Then I presented the idea of making pNFS Linux a full POSIX client. It is technically do-able but the main reason (for EMC) would be business need.

After that, Xie Guangjun led a session about Baidu's work in their data centers. The first change is to use memcg to guild oom-killer. They make mapreduce jobtracker send down memory limit per task, then task tracker creates new task with memory limit (now use cgroups) and oom-killer kills all processes in the same cgroup.

Another interesting thing is that, Baidu is building a global resource manager in their data center. The idea is similar to Hadoop's new resource manager in MDS will be a general purpose one. Guangjun mentioned that Google has a similar implementation in their data center as well.

The most interesting thing from Baidu is their work on flash. In current deployment, they build SSD from NAND with MTD+UBI+LVM. It is a joint work together with Huawei and fully implemented the functionality of mysterious FTL. In the future,they will build the SSD such that it bypasses whole VFS layer and uses ioctl API to do read/write/trim. It seems that even with some of Nick Pidgin's scalability patches merged, Linux VFS is still a hot pot of lock contention for high speed storage. In our previous training from Whamcloud, they came to the same conclusion and are planning to drop VFS interface in the future.

After Baidu, Intel performance team let a discussion about the impact of emerging new hardware on current system architecture, namingly SSD and PCM. SSD is already changing system architecture in Baidu. PCM is likely to be even more evolutionary, even removing the need for current complex implemented file systems. However, it is still way far in the future. We need to focus on what we have at hand.

That's all for the two days event. I didn't mean to write so long stories but all the above are worth documenting. So I can't drop any of them...

In the end, thanks Baidu and Fujitsu for their generous sponsorship. Thank you all CLSF committees for making this happen. And thanks EMC for sending me there. It is really a valuable experience and will benefit me for quite a long time.

CLSF 2011 - Day 1

One year passed since last CLSF2010 in Shanghai. This year, CLSF is held in Nanjing and sponsored by Baidu and Fujitsu. And Coly has kindly built an official website for the event.

The workshop started with Btrfs updates given by Fujitsu's Zefan Li. Fujitsu has been contributing a lot to btrfs development this year. The first important feature Fujitsu developed for btrfs is LZO transparent compression. Before, btrfs only uses Zlib but the infrastructure allows adding more compression algorithms. LZO consumes less CPU and is faster than Zlib. The current compression unit is extent (at most 160KB) and every RW needs to read/write whole extent. So the feature seems to be more suitable for SRSW use case.

Another interesting feature added to btrfs is auto-defragment. Copy-on-write filesystems like btrfs/WAFL all suffer from file fragmentation. So btrfs tries to (sort of) solve it in a nice and easy way. When a page is written, btrfs tries to read in the extent around current page and mark pages dirty in memory. So later writeback can write continuous blocks and ondisk extents tend to be less fragmented. However, a side effect is that it causes more IO than necessary and may hurt random write performance. While the idea of auto-defrag is cool, defragmentation consumes resources and thus timing is very important. What is the best time for starting auto-defrag and how far it should go (e.g., 5 frags into 4 may not be worth doing at all) remains still a open question.

Miao Xie also introduced btrfs tree log and sub-transaction feature. Before tree log, btrfs maintains a big transaction (default 30sec) and if user fsync one file, the whole transaction has to be committed, hurting very much the performance of other writers. With tree log, btrfs maintains a write-ahead on-disk log for each of such fsync operations. And when transaction commits, it discards all these logs. As a result, fsync one file will only result in writing back the file's dirty data in tree log and other writers can continue writing to the same transaction, which improves performance in many writers with fsync use case.

One defect of tree log is that, when writing new logs, it also writes all old logs (again and again). The solution is to assign a sub-transaction id to each log and use it to avoid writing old logs. And it improves performance in large file many fsync use case.

One question that we are all interested is when btrfs will be suitable for production. Currently the most concern is about btrfsck, which has been delayed for years. Although btrfs has built-in RAID and scrub (which rebuilds broken RAID block automatically), people still think fsck a must. Zefan explained that Chris Mason will demo btrfsck at LinuxCon later this month. The reason he (Chris) didn't release btrfsck for so long is that he was trying different designs/implementations and do not want to corrupt user data with half-done btrfsck. Good news is that Chris is pushing btrfs into Oracle's next release UBL.

The last but most important question for btrfs is performance. According to Intel guys' tests, btrfs is slower than ext4 in many cases. Some argued that btrfs presents many fancy features ext4 don't. But these features are more attractive to desktop users than to server users, and server users are the real force to push Linux kernel going forward these days.

The second topic is Distributed/cluster storage and cloud, led by Novell's Zhang Jiaju. This is a general talk about open source cloud infrastructure. Distributed systems put a lot of pressure on DLM and sometimes make DLM itself too heavy. We also discussed why Lustre isn't wildly used outside of HPC market. Jiaju thought the most important reason is Lustre is not merged in Linux kernel thus enterprise users do not trust them as much as Kernel filesystems even like ceph.

Jiaju maintains corosync in SUSE Linux and made an implementation of Paxos. Paxos came into people's eye because of Google Chubby. He also introduced a little about corosync's Totem. Totem performs better than Paxos when cluster membership doesn't change. But we all agree that quorum is not the best way to solve large cluster synchronization problem and that's why common Chubby cluster only have several machines.

That's all for the morning. Bellow is we are continuing discussion during lunch break.

Taobao holds first two sessions in the afternoon.Taobao is using ext4 heavily in many of their servers and they are particularly interested in some features. First is Ted's bigalloc, which tries to allocate larger chunk of data and is useful for large sequential write applications like HDFS. One issue with bigalloc is directory. With bigalloc, minimal 1MB block for each directory is certainly a waste. Ma Tao has sent out preview patches , the main idea is to make use of the inode xattr space to store the inline data.

Metadata checksum, snapshot and compression are also in Taobao's eyes. Meta checksum is mainly for debug purpose. When file system corrupts, people can first check meta checksum. If checksum doesn't agree, it is very likely a disk problem instead of file system bug. However, it doesn't help no journal mode. For snapshot, Yang Yongqiang is actively working on it.

The third topic in the afternoon is block layer updates, led by Li Shaohua. The most important change in block layer is the dropping of IO barriers, in favor of FLUSH/FUA. IO barrier has been there for a very long time. They are a concept in the block layer, implemented via FLUSH/FUA and serve two purposes: request ordering and forced flushing to physcial medium. Both block layer and disk cache may queue/reorder io requests. IO barriers ensure preceding requests are processed before the barrier and following requests after. Also they ensure that all data in disk write cache are flushed before returning.

IO barriers are well-known as performance killers. However, enforcing ordering with IO barriers at block layer is more than enough. Filesystems just want to ensure some important requests to get to disk, instead of EVERYTHING sent to block layer privier to IO barriers. Filesystems should concern themselves with ordering, since that's where the information is, and not dump that problem into the block layer. A filesystem which wants operations to happen in a specific order will simply need to issue them in the proper order, waiting for completion when necessary. The block layer can then reorder requests at will.

The resolution is to export FLUSH/FUA and let filesystems decide. Therefore REQ_HARDBARRIER no longer exists, while REQ_FLUSH and REQ_FUA are added. REQ_FLUSH request forces the write cache to be flushed before beginning the operation, and REQ_FUA means the data associated with the request itself must be committed to persistent media by the time the request completes.

So the procedure of write has changed from:
submit_bio(WRITE_BARRIER)
other reqs -> (wait)preflush -> (wait)bio reqs-> (wait)postflush -> other reqs
to:
submit_bio(WRITE_FLUSH_FUA)
other reqs -> preflush -> other reqs -> (preflush finishes) bio reqs -> other reqs -> (bio reqs finish) postflush

For both cases, if device supports FUA, the bio reqs are issued with REQ_FUA flag and the postflush is omitted. Apparently, the performance gain comes from the fact that one transaction on longer need to wait for IO requests of other transactions, at the cost of letting file systems maintain the dependency of different io requests.

The last session is a from industry session led by Huawei's chief scientist Hu Ziang. Ziang asked some questions for their engineers and then introduced Huawei's Linux strategy. Huawei used to be a very closed company but now they are seeking to get open and embrace Linux and open source. They will build a very large Linux developer team in near future. Good news for China Linux community.

That's all for the first day of excitement.

Friday, September 2, 2011

X86 Calling Conversion

Wikipedian has some definition: a calling convention is a scheme for how functions receive parameters from their caller and how they return a results.

Basically, it is a compiler ABI and varies on different platforms (like Windows and Linux). This is interesting and useful for debugging (at least for understanding how debuggers work...).

For example, a simple piece of code:

  1 void f(int arg1, int arg2, int arg3, int arg4, float arg5, int arg6, float arg7,
  2         float arg8, int arg9, int arg10, int arg11, int arg12)
  3 {
  4         printf("%d %d %d %d %f %d %f %f %d %d %d %d\n",
  5                 arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9,
  6                 arg10, arg11, arg12);
  7 }
  8
  9 void main()
 10 {
 11         f(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12);
 12 }

On Linux i386, above compiles to assembly:

[SID@test]$cc a.c -o a
a.c: In function ‘f’:
a.c:4: warning: incompatible implicit declaration of built-in function ‘printf’
[SID@test]$objdump -d a > a.s

In a.s, we can see the main function calls f function by passing every argument through the stack.

160 0804842e :
161  804842e:       55                      push   %ebp
162  804842f:       89 e5                   mov    %esp,%ebp
163  8048431:       83 e4 f0                and    $0xfffffff0,%esp
164  8048434:       83 ec 30                sub    $0x30,%esp
165  8048437:       c7 44 24 2c 0c 00 00    movl   $0xc,0x2c(%esp)
166  804843e:       00
167  804843f:       c7 44 24 28 0b 00 00    movl   $0xb,0x28(%esp)
168  8048446:       00
169  8048447:       c7 44 24 24 0a 00 00    movl   $0xa,0x24(%esp)
170  804844e:       00
171  804844f:       c7 44 24 20 09 00 00    movl   $0x9,0x20(%esp)
172  8048456:       00
173  8048457:       b8 00 00 00 41          mov    $0x41000000,%eax
174  804845c:       89 44 24 1c             mov    %eax,0x1c(%esp)
175  8048460:       b8 00 00 e0 40          mov    $0x40e00000,%eax
176  8048465:       89 44 24 18             mov    %eax,0x18(%esp)
177  8048469:       c7 44 24 14 06 00 00    movl   $0x6,0x14(%esp)
178  8048470:       00
179  8048471:       b8 00 00 a0 40          mov    $0x40a00000,%eax
180  8048476:       89 44 24 10             mov    %eax,0x10(%esp)
181  804847a:       c7 44 24 0c 04 00 00    movl   $0x4,0xc(%esp)
182  8048481:       00
183  8048482:       c7 44 24 08 03 00 00    movl   $0x3,0x8(%esp)
184  8048489:       00
185  804848a:       c7 44 24 04 02 00 00    movl   $0x2,0x4(%esp)
186  8048491:       00
187  8048492:       c7 04 24 01 00 00 00    movl   $0x1,(%esp)
188  8048499:       e8 26 ff ff ff          call   80483c4 
189  804849e:       c9                      leave
190  804849f:       c3                      ret

Then on X86_64 Linux, the code compiles into following, where parameters are passed to f function through three ways: general purpose registers (di, si, dx, cx, r8d, r9d), xmm registers (xmm0~xmm2), and function stack.

159 000000000040054b :
160   40054b:       55                      push   %rbp
161   40054c:       48 89 e5                mov    %rsp,%rbp
162   40054f:       48 83 ec 20             sub    $0x20,%rsp
163   400553:       c7 44 24 10 0c 00 00    movl   $0xc,0x10(%rsp)
164   40055a:       00
165   40055b:       c7 44 24 08 0b 00 00    movl   $0xb,0x8(%rsp)
166   400562:       00
167   400563:       c7 04 24 0a 00 00 00    movl   $0xa,(%rsp)
168   40056a:       41 b9 09 00 00 00       mov    $0x9,%r9d
169   400570:       f3 0f 10 15 60 01 00    movss  0x160(%rip),%xmm2        # 4006d8 <__dso_handle+0    x30>
170   400577:       00
171   400578:       f3 0f 10 0d 5c 01 00    movss  0x15c(%rip),%xmm1        # 4006dc <__dso_handle+0    x34>
172   40057f:       00
173   400580:       41 b8 06 00 00 00       mov    $0x6,%r8d
174   400586:       f3 0f 10 05 52 01 00    movss  0x152(%rip),%xmm0        # 4006e0 <__dso_handle+0    x38>
175   40058d:       00
176   40058e:       b9 04 00 00 00          mov    $0x4,%ecx
177   400593:       ba 03 00 00 00          mov    $0x3,%edx
178   400598:       be 02 00 00 00          mov    $0x2,%esi
179   40059d:       bf 01 00 00 00          mov    $0x1,%edi
180   4005a2:       e8 1d ff ff ff          callq  4004c4 
181   4005a7:       c9                      leaveq
182   4005a8:       c3                      retq
183   4005a9:       90                      nop
184   4005aa:       90                      nop
185   4005ab:       90                      nop
186   4005ac:       90                      nop
187   4005ad:       90                      nop
188   4005ae:       90                      nop
189   4005af:       90                      nop

So why the difference? Basically this is part of System V AMD64 ABI convention which GCC and ICC (Intel compiler) implements on Linux, BSD and Mac and which defines that rdi, rsi, rdx, rcx, r8, r9 can be used to pass down integer parameters and xmm0-7 can be used to pass down float point parameters.

This leads to another question, why not other registers? On X86_64, there are 16 general purpose registers that can save integers (rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp r8~r15), and 16 xmm registers that can save float points (xmm0~xmm15). They are divided by compiler ABI into volatile and non-volatile registers. Volatile registers are scratch registers presumed by the caller to be destroyed across a call. Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. So volatile registers are naturally suitable for function arguments while there is overhead of using non-volatile registers (must be saved).

The calling conversion ABI is basically about which register is volatile/non-volatile, which is reserved for specially purpose (parameter passing, frame pointer, stack pointer, etc.), what is the order of arguments on stack, who (caller or callee) is responsible for cleaning up the stack, as well as stack layout/alignness.

Architecture	Calling convention name	Operating system, Compiler	Parameters in registers	Parameter order on stack	Stack cleanup by	Notes
64bit	Microsoft x64 calling convention	Windows (Microsoft compiler, Intel compiler)	rcx/xmm0, rdx/xmm1, r8/xmm2, r9/xmm3	RTL (C)	caller	Stack aligned on 16 bytes. 32 bytes shadow space on stack. The specified 8 registers can only be used for parameter number 1,2,3 and 4.
64bit	System V AMD64 ABI convention	Linux, BSD, Mac (GCC, Intel compiler)	rdi, rsi, rdx, rcx, r8, r9, xmm0-7	RTL (C)	caller	Stack aligned on 16 bytes. Red zone below stack.

The above table is only for either user space application or kernel space functions. Likewise, there is always an exception. Here the exception is system calls. System calls trap user space context into kernel space and have specially requirement for parameter passing:
1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.

2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

3. The number of the syscall has to be passed in register %rax.

4. System-calls are limited to six arguments, no argument is passed directly on
the stack.

5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.

6. Only values of class INTEGER or class MEMORY are passed to the kernel.

Thursday, September 1, 2011

CLFS 2010

I wasn't able to write down the details about CLFS (China Linux Filesystem and Storage workshop) last year (too busy at that time...). But luckily, I have written this slide to recode the major events in it. Here it is...

CLFS 2010

View more presentations from bergwolf

Virtual Machine File System

VMFS is an interesting paper. It tells us how VMware builds its high performance SAN file system. Since I was writing file system in ESX kernel in the passed year, I feel I got a good position in understand VMFS's designs and intentions. Here is the summary slides about it. I did my best to make sure no confidential information are leaked. All sentences and pictures are either from the paper or from Internet.

vmfs intro

View more presentations from bergwolf

Google Megastore

Reading google's infrastructure paper is always joy, of which I have just had one recently. Google's megastore presents how Google builds transactional service on top of bigtable. Here is the slide I wrote about it...

Google Megastore

View more presentations from bergwolf

RCU

RCU had been a mystery to me for quite a long time. I finally had sometime to walk it through and understand what happened and why. Here is the slide I presented to guys on Auguest's linuxfb.net (@linuxfb) seminar...

RCU

View more presentations from bergwolf

Tech rider

Tuesday, November 1, 2011

CLSF 2011 - Day 2

CLSF 2011 - Day 1

Friday, September 2, 2011

X86 Calling Conversion

Thursday, September 1, 2011

CLFS 2010

Virtual Machine File System

Google Megastore

RCU

The BARD

Labels

Where I'm Heading

Blog Archive