Friday, October 26, 2012

ext3/4 and fsync

Our team met a problem of the "infamous" ext3 slowless on fsync() calls. We solved it rather easily by switching to ext4. But one might ask, WTF is going on with ext3 fsync() performance? And why simply switching to ext4 solves it at all?

I googled some time but found no obvious explanations nor documents about why ext4 outperforms ext3 in fsync-workloads. So I had to do it RTFC way.

The phenomenon is obvious. Given a workload that thread A is sequentially extending a file with large chunk of data, and another thread B is doing periodical fsync() calls, the fsync() call is much slower on ext3 than on ext4. ext3 fsync() tries to write back pages that is dirtied by thread A, and while it writes back, more pages are dirtied by thread A and need to be written back. This creates quite large latency for fsync() on ext3.

OTOH, ext4 (with default options) doesn't have such problem.

But WHY?

 We know that ext3 and ext4 share similar design principles but they are in fact two different file systems. The secret lies in their backing journal implementation and one key feature that comes with ext4, delayed allocation.

As we know, ext4 is built upon jbd2 which is a successor of ext3's building block, jbd. While jbd2 inherits a lot of designs from jbd, one thing that is changed is how data=ordered mode implemented.
data=ordered is an option about how ext3/4 manages data/metadata ordering. From Documentation/filesystems/ext4.txt:

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction.  When it's time to write the new metadata
out to disk, the associated data blocks are written first.  In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.

ext3/4 both implements this data=ordered mode and set it as default. However, there is a small difference between how it is implemented in jbd and jbd2.

In jbd/jbd2, when a transaction is being committed, they call journal_submit_data_buffers() to writeback dirty data associated with current transaction.

In jbd's journal_submit_data_buffers(), it loops agains commit_transaction->t_sync_datalist and relies on jbd buffer head to ensure that all data has been written back.

In jbd2's journal_submit_data_buffers(), it implements data=ordered via inodes instead of jbd buffer heads. All inodes touched by the committing transaction is tracked in commit_transaction->t_inode_list and jbd2 just loops to write back each of them. There is also a small trick, jbd2 uses writepage instead of writepages to write back each inode mapping. The reason behind this is, that ext4's da_writepages will do block allocation, while writepage only writes back allocated blocks.

So, see the tricks? With ext4 data=ordered, fsync() in thread B will only write back data blocks that are already allocated. And with ext4's delayed allocation, most pages dirtied by thread A are not allocated with blocks so won't be written back by jbd2. While for pages dirtied by thread B, they are written back via da_writepages in ext4_sync_file() that calls filemap_write_and_wait_range().

So this is really tricky, because even for ext4, it only works for data=ordered mode and delayed allocation enabled. Also if thread A is not extending the file but rather over writing, ext4 is close to ext3. :)


That's all.

Monday, May 7, 2012

2012 LSF/MM Summit Summary -- Day 2

Flash Media

The second day of LSF/MM summit started with flash media led by Steven Sprouse from SanDisk. He started with an introduction of lifetime terabyte writes, which is defined as:
         physical capacity * write endurance
LTW =   -------------------------------------
            write amplification
Physical capacity is increasing but write endurance is decreasing as write cycle increases (every write cycle hurts NAND so that it stores data for shorter amount of time). Write amplification is affected by many factors, like block size, provisioning, trim etc.

Sprouse mentioned that some SSD vendors are starting to use hybrid SLC/MLC where SLC is used for frequent journal write and MLC for data blocks. Therefore the request for SSD to ask for information to differentiate metadata write and data write has been proposed, with tagging information, flash can better decide where to storage the data.

Another point made by Sprouse is the definition of "random write". Different hardware has different capability of handling random write. For flash, anything smaller than erase block is random write. However, erase block size is changing. It was 64KB back in 2004 but now most are 1MB. So it is really necessary that flash vendors expose such information to OS developers.

To help flash to get better NAND geometry, possible ways that flash vendor and OS developers can cooperate are:
1. OS tells flash what data may be dropped at the same time, so that flash can put these data in the same erase block. One example given by Ted is rpm/deb package files. They are likely to vanish at the same time during upgrade.
2. Flash vendor reports some geometry information to upper layers, like block size, page size, stripe size, etc.
3. Provide some tagging mechanisms so that userspace/fs can tag different data types, e.g., metadata and data, hot data and code data, etc.
4. filesystems help on cleaning up (trim).

It is suggested that flash vendors make a list of information that they are willing to provide and OS developers can look at it and decide what can be useful. However, even if there are standard ways to query these information, vendors are not forced to fill them in correctly.

Flash Cache in the Operating System

The discussion mainly happened around integrating bcache. Flashcache is implemented by Facebook and based on device-mapper, which bcache is implemented by Google and performs better than flashcache. However, bcache need so many information dropped/hidden by DM that it bypasses DM completely. To integrate bcache with DM, there would be a large amount of changes to DM, and some changes to block layer as well. The discussion didn't result in any real conclusion though.

High IOPS and SCSI/Block

Roland Dreier introduces two current modes of writing block drivers: register its own make_request, and using bolck layer make_request(). The former is too low level that it bypasses many useful block layer functions, while the latter is slow because of heavy lock contention. Jen Axboe proposed his multi-queue patches that implements per-CPU queue coupled with lightweight elevator. He promised to get these patches soon and it will help solve the performance problem.

Another issue discussed is Dell's vision of accessing PCIe flash devices. Currently PCIe flash devices still provide block interfaces, but they are hoping to change to use memory interfaces to get better performance. However, it is pointed out that memory interface doesn't do software error recovery. Any error reported by device is fatal failure so devices have to all error handling and only report error as the last resort.

LBA Hinting and New Storage Commands

Frederick Knight led the discussion of handling shingle drives. Shingle drives can largely increase disk density but require OS to write in band. There are three options:
1. transparent to OS
2. banding: let the host manage geometry and expose new SCSI commands for handling bands
3. transparent with Hints: make it look like a normal disks but develop new SCSI commands to hint both ways between device and host what the data is and device characteristics are to try to optimize data placement

The second option is dropped by attendees immediately. And the first option looks more like current SSD's situation and the third option is the same situation that flash vendors are pursuing.

Storage Manager

Lukas Czerner led the session to give an update of his command line storage manager. It mainly aims to be a generic storage manager and reduce the complexity of manage different storage devices and file systems. However, as underlying storage device/filesystems may provide different type-dependent options, the new storage manager reduce complexity iff users don't need those misc options.

WRITE_SAME and UNMAP, FSTRIM

The session started with some complaint about current trim command. In the ATA TRIM command, there are only two bytes used for trim range, meaning the range can be at most 64K sectors, which is 32MB for each TRIM operation. Another problem is current block layer only allows for continuous trim range. As TRIM is non-queue, the overhead goes up a lot when there are a lot of distinct ranges to trim. Christoph Hellwig had some PoC to demonstrate multi-range trim in XFS and it showed only ~1% overhead compared with no trim case.

Besides trim, scatter/gatter SCSI commands have the same multi-range problem. There are two options to implement multi-range command in block layer, allowing single BIO to carry multiple ranges, and use a linked BIOs for the range. After discussion, the latter is adopted.


NFS server issues

Bruce Fields led the session with current status of knfsd with regard to features like change attributes, delegation/oplocks, share locks, delete-file recovery/server-side sillyrename and lock recovery.

During the discussion, people asked about pNFS and Boaz made a point that kernel pNFS server may not be happening because there is no developer interest in making it happen.


2012 LSF/MM Summit Summary -- Day 1

Earlier last month (April 1 and 2), I was in San Francisco to attend 2012 Linux Storage and Filesystem Summit. It was a great experience for me because I am a great fan of many of these kernel maintainers who have in-depth understanding of file systems and storage, and sitting under the same roof, discussing cutting edge technology topics, is just what I've been dreaming of for many years.

OK, enough stupid wording. Here is my summary for this event.

The LSF/MM summit is a small by-invitation summit which focuses on collaboration and implementation. I was very lucky to be invited because of the work that I am doing in EMC, pushing Lustre client into mainline kernel. Also it is worth mentioning that EMC is silver sponsor for the event.

The LSF/MM summit is a two-day event, and consists of three tracks (IO, FS and MM). The complete schedule can be found here. I stayed in the filesystem track all the time so my summary will be mainly about discussions in the filesystem track. For discussions on IO and MM tracks, they can be found here and here.

Runtime filesystem consistency check

It is a FAST paper written by Ashvin Goel and others from the University of Toronto. The main idea is to record some consistency invariants and check them between the filesystem and the block layer, so that errors can be found earlier at transaction commit point. The consistency invariants are predefined and there are 33 for ext3 in Recon, the PoC of runtime filesystem checker they built. A more detailed introduction of Recon can be found here.

Writeback

Wu Fengguang started writeback discussion with his work on improving the writeback situation. Then he concluded his work on IO-less throttling, where the main intentions is to minimize seeks, get less lock contention and cache bouncing, lower latency, with impressive performance gains, minor regressions and lots of algorithm complexity.

For direct reclaim, pageout work has been moved from direct reclaim to flusher threads. It also focuses dirty reclaim waits on dirtier tasks for the benefit of interactive performance. Dirty pages in the end of LRU are still a problem because scanning for them wastes lots of CPU. The he suggested adding a new LRU list just for tracking dirty page, and it requires a new page flag.

Memory control groups have its problem with dirty limit mainly because there is only a global dirty limit, and flusher fairness is beyond control. There are only coarse options available such as limiting the number of operations that can be performed on a per-inode basis or limiting the amount of IO that can be submitted.

The discussion then moved to buffer write blkcg IO control. Current blkcg IO control is useless with regard to buffer write, because blkcg throttles at summit_io(), where there is mostly no context of the writer. Fengguang made a suggestion and RFC patchsets to implement buffer write IO control in balance_dirty_pages. However Tejun Heo argued that blkcg should do its work in block layer, instead of messing up with mm layer. Also there are comments that the algorithm of balance_dirty_pages() is already very complex, doing IO controlling there will make it even more difficult to understand. The dis-agreement has been there for a long time and it couldn't reach conclusion soon so people askes the discussion to continue in MM track and it later moves on on mailing list.

Writeback and Stable Pages

The same topic was discussed in last year's LSF and the conclusion was to let writer wait when it wants to write a page that is under writeback. However Ted Ts'o reported ext4 long latency when Google started to use the code. In brief, waiting page writeback (to get stable pages) can lead to large process latency. And it is not necessary for every system. Stable pages are only required for some systems where things like checksums calculated on the page require that the page be unchanged when it actually gets written.

Sage Weil and Boaz Harrosh made three options to handle the situation:
1. re-issue pages that are changed during IO;
2. wait on writeback for every page (current implementation);
3. Do a COW on the page under writeback when it is written to.

The first option was dropped instantly because it confuses storage that need stable page and is purely overhead for storage that doesn't need it.

The third option was discussed on but the overhead of COW is unknown and there are some corner cases that need to be addressed like what to do if file is truncated after COW page is created. So the first step as suggested is to introduce some APIs to let storage tell filesystem if it needs stable page, and let filesystem tell storage if storage is supported. Then for cases where stable pages are unnecessary like Google's use, file system doesn't need to do anything to send stable pages. As for stable page support, some reporting method should be added to writeback code path to find out what workload are being affected and what those affects are. Then someone can propose on how to implement the COW solution and address all the corner cases.

Copy Offload

NetApp's Frederick Knight led copy offload session. The idea of copy offload is to allow SCSI devices to copy ranges of blocks without involving the host operating system. Copy offload has been in SCSI standard since SCSI-1 days. EXTENDED COPY (XCOPY) uses two descriptors for the source and destination and a range of blocks. It can be implemented in either push model (source sends the blocks to the target) or pull model (target pulls from source).

TOKEN based copy is far more complex. Each token has a ROD (Representation of Data) that allows arrays to give an identifier for what may be a snapshot. A token represents a device and a range of sectors that the device guarantees to be stable. However, if the device doesn't support snapshotting and the region gets overwritten for any reason, the token can be declined by storage. It means that storage users have no idea of the lifetime of a token, and every time a token goes invalid, they need to either renew the token or do real data transfer.

Token based copy is somewhat similar to NFS's server side copy (NFSv4.2 draft). And it is suggested that token format need to be standardized to possibly allow copy offload from SCSI disk to CIFS/NFS volume.

Kernel AIO/DIO Interfaces

The first session in the afternoon was led by Dave Kleikamp who is trying to modify kernel AIO/DIO APIs to provide in-kernel interfaces. He changed iov_iter to make it handle both iovec (from userspace) and bio_vec (from kernel space). He modified loopback device to let it send AIO and therefore avoid caching in underlying filesystem. And people suggested that swap-over-NFS can be adapted to use the same API.

RAID engine Unification

Boaz implemented a generic RAID engine for pNFS object layout driver. Since the code is simple and efficient, he wants to push its usage and unify kernel RAID implementations. Currently besides Boaz's RAID implementation, there are two other RAID implementations, MD and btrfs. Boaz said that his implementation is more complete and support RAID stacking without extra data copy.

However, it seems the benefit is not so obvious and people are hesitating to adopt it because current code works just good. Chris Mason suggested Boaz to start with MD because MD is much simpler than btrfs. If that works well, he can continue to change btrfs to use the new RAID engine.

xfstests

With many years of advocating, xfstests has somehow become the most used regression test suit for not just xfs, but also ext4 and btrfs. There are ~280 test cases and around 100 of them are filesystem independent. However, one nightmare is that all test files are numbered instead of properly named. So anyone who wants to use it need to read the test case and find out what it actually does. Also it need to be reorganized so that similar functions are grouped in a directory instead of lying flat like right now.


So much for the first day. Here are some pictures of attendees taken by Chuck Level:

Tuesday, March 27, 2012

Objects Initialization in C99

In a get-together party with some friends of linuxfb, we chatted over many interesting/boring/useful/non-sense topics. One of these topics made me think deeper and wanted to write down something about it.

The background is that, Li Kai (@leekayak) reported that he encountered a problem where an allocated memory area is not initialized to zero. And someone told him that it is because kernel returns uninitialized memory through brk(). And Coly (@colyli, @淘泊松) corrected him that it is impossible because ever since the early days of 0.9 version of Linux kernel, memory returned to user space is always initialized because otherwise there would be security issues. And the uninitialized memory may be returned by libc memory allocator which manages memory fractions on its own. Possibly the memory freed by application is not returned to kernel thus next time it is requested, it is not initialized.

The discussion ended here but later I started to think about when should a programmer initialize variables. After digging for a while, I found following:

Storage classes
Specifiers Lifetime Scope Default initializer
auto Block (stack) Block Uninitialized
register Block (stack or CPU register) Block Uninitialized
static Program Block or compilation unit Zero
extern Program Block or compilation unit Zero
(none)1 Dynamic (heap)
Uninitialized
1 Allocated and deallocated using the malloc() and free() library functions.

And C99 standard further defines:

Except where explicitly stated otherwise, for the purposes of this subclause unnamed members of objects of structure and union type do not participate in initialization. Unnamed members of structure objects have indeterminate value even after initialization.
If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:
  • if it has pointer type, it is initialized to a null pointer;
  • if it has arithmetic type, it is initialized to (positive or unsigned) zero;
  • if it is an aggregate, every member is initialized (recursively) according to these rules;
  • if it is a union, the first named member is initialized (recursively) according to these rules.

One interesting thing about the extra effort that compiler tries to do for programmers is that it may create memory holes and leaks. Look at the example code:
struct foo {
       int a,b;
} f = {.a=1,};
It is usually used this way because C99 will ensure that f.b is initialized to 0. But when it comes to:
struct foo {
       short a;
       int b;
} f = {.a=1,};
It also initializes f.b to 0 but will generate a two-byte hole between foo.a and foo.b on 32bit machines. It is usually OK but if the code is in kernel and f is about to be sent to user, it leaves a security hole. Therefore in such unaligned member case, one need to use memset() family to initialize the structure.

Tuesday, November 1, 2011

CLSF 2011 - Day 2

The second day of CLSF 2011 started with Wu Fengguang's IO-less throttling. It is a continuous work from last year. Fengguang has done great work in improving Linux kernel read-ahead. And in the last two years, he is working on improving writeback.

There are many dirty page writeback path in Linux kernel: fsync(), sync(), periodic writeback, backgroud writeback, balance_dirty_pages() and page reclaim path. Among them, balance_dirty_pages() and page reclaim path are worst for performance because they tend to writeback random single page. The purpose of balance_dirty_pages() is to balance page dirtying and page writeback, so that task doesn't cause too much memory pressure. Currently, balance_dirty_pages() calls writeback_inodes_wb() to writeback dirty pages. What Fengguang does is to adjust balance_dirty_pages() to:
1. let dirtier task sleep instead of initiate writeback in balance_dirty_pages(), so flusher can start writeback when necessary;
2. make page dirtying and writeback speed matching smoothly so that page reclaim is avoided as much as possible;
3. balance the dirty speed of each memory dirtier so that large writer don't eat all memory causing smaller writers stall for a long time.

There are a few parameters to tune and the most important one is how long dirtier should task sleep, which takes into account current disk write bandwidth, system memory pressure and task dirtying speed.

The whole patchsets are referred as IO-less throttling and are getting merged in 3.2 merge window as decided in Kernel Summit last week. It is really a time-consuming critical work that needs brilliant and patient mind. Fengguang said he draw more than 10K performance graphs and read even more in the past year.

While the IO-less throttling patches help improving writeback performance, Coly and Ma Tao said they are more interested in a side effect of the patches, per-task buffer writer bandwidth control. The problem comes with current implementation of blkio cgroup, which calculates IO bandwidth on the task that submitted bio. However, for buffer write, the pages are most likely written back by flusher thread. So the blkio controller does not work for buffer write. But Fengguang's IO-less throttling gives another way -- control write speed at balance_dirty_pages(). Fengguang said it should be possible and easy for balance_dirty_pages().


The second topic is memory management discussion led by Li Shaohua. We mainly discussed some issues found in Taobao's/Baidu's data centers. The first issue Coly presented is whether swap-off on all-in-memory service servers. When swap on, performance may get hurt when pages are swapping out, while swap off, oom-killer is easy to trigger when applications uses too many memory. The real problem is that application doesn't know about system memory pressure (malloc doesn't fail at all) until oom. A possible solution would be to send sigfault to application when system reaching memory high watermark. Xie Guangjun from Baidu, commented that swap is on in their deployment. Only small databases are all-in-memory. For file data, they use auto tiering, and their search engine applications are not allowed to allocate memory dynamically...

Another issue in Taobao is about SSD+HD auto tiering. HD data cache hit ratio is very low. As a result, when there is a cache miss in SSD, reading from HD is very slow (avg 3 IO per file, dentry, inode, data). Possible solutions:
1. Use cgroups to control memory for SSD and HD, to keep HD buffers cache in memory.
2. Use DIO to read these small data, to avoid wasting page cache for HD data.
3. create a new zone, use bdi->mapping flag to specify which zone to alloc memoroy for device, for the same reason as 1.

Also it is mentioned that memcg count buffer cache to first users but it is unfair. Possible solution would be to create a task struct for each bdi and count these pages to bdi task.

In the afternoon, first I led a session on pNFS discussion. The first issue is how much layout client should be asking for IO. Current pNFS client only asks 4KB length of layout in writepages, even if the wbc requires much more. Fengguang agreed that client has the information of how much to writeback and it is a pity to drop the information letting server do a wild guess. Then I presented the idea of making pNFS Linux a full POSIX client. It is technically do-able but the main reason (for EMC) would be business need.


After that, Xie Guangjun led a session about Baidu's work in their data centers. The first change is to use memcg to guild oom-killer. They make mapreduce jobtracker send down memory limit per task, then task tracker creates new task with memory limit (now use cgroups) and oom-killer kills all processes in the same cgroup.

Another interesting thing is that, Baidu is building a global resource manager in their data center. The idea is similar to Hadoop's new resource manager in MDS will be a general purpose one. Guangjun mentioned that Google has a similar implementation in their data center as well.

The most interesting thing from Baidu is their work on flash. In current deployment, they build SSD from NAND with MTD+UBI+LVM. It is a joint work together with Huawei and fully implemented the functionality of mysterious FTL. In the future,they will build the SSD such that it bypasses whole VFS layer and uses ioctl API to do read/write/trim. It seems that even with some of Nick Pidgin's scalability patches merged, Linux VFS is still a hot pot of lock contention for high speed storage. In our previous training from Whamcloud, they came to the same conclusion and are planning to drop VFS interface in the future.

After Baidu, Intel performance team let a discussion about the impact of emerging new hardware on current system architecture, namingly SSD and PCM. SSD is already changing system architecture in Baidu. PCM is likely to be even more evolutionary, even removing the need for current complex implemented file systems. However, it is still way far in the future. We need to focus on what we have at hand.

That's all for the two days event. I didn't mean to write so long stories but all the above are worth documenting. So I can't drop any of them...

In the end, thanks Baidu and Fujitsu for their generous sponsorship. Thank you all CLSF committees for making this happen. And thanks EMC for sending me there. It is really a valuable experience and will benefit me for quite a long time.

CLSF 2011 - Day 1

One year passed since last CLSF2010 in Shanghai. This year, CLSF is held in Nanjing and sponsored by Baidu and Fujitsu. And Coly has kindly built an official website for the event.

The workshop started with Btrfs updates given by Fujitsu's Zefan Li. Fujitsu has been contributing a lot to btrfs development this year. The first important feature Fujitsu developed for btrfs is LZO transparent compression. Before, btrfs only uses Zlib but the infrastructure allows adding more compression algorithms. LZO consumes less CPU and is faster than Zlib. The current compression unit is extent (at most 160KB) and every RW needs to read/write whole extent. So the feature seems to be more suitable for SRSW use case.

Another interesting feature added to btrfs is auto-defragment. Copy-on-write filesystems like btrfs/WAFL all suffer from file fragmentation. So btrfs tries to (sort of) solve it in a nice and easy way. When a page is written, btrfs tries to read in the extent around current page and mark pages dirty in memory. So later writeback can write continuous blocks and ondisk extents tend to be less fragmented. However, a side effect is that it causes more IO than necessary and may hurt random write performance. While the idea of auto-defrag is cool, defragmentation consumes resources and thus timing is very important. What is the best time for starting auto-defrag and how far it should go (e.g., 5 frags into 4 may not be worth doing at all) remains still a open question.

Miao Xie also introduced btrfs tree log and sub-transaction feature. Before tree log, btrfs maintains a big transaction (default 30sec) and if user fsync one file, the whole transaction has to be committed, hurting very much the performance of other writers. With tree log, btrfs maintains a write-ahead on-disk log for each of such fsync operations. And when transaction commits, it discards all these logs. As a result, fsync one file will only result in writing back the file's dirty data in tree log and other writers can continue writing to the same transaction, which improves performance in many writers with fsync use case.

One defect of tree log is that, when writing new logs, it also writes all old logs (again and again). The solution is to assign a sub-transaction id to each log and use it to avoid writing old logs. And it improves performance in large file many fsync use case.

One question that we are all interested is when btrfs will be suitable for production. Currently the most concern is about btrfsck, which has been delayed for years. Although btrfs has built-in RAID and scrub (which rebuilds broken RAID block automatically), people still think fsck a must. Zefan explained that Chris Mason will demo btrfsck at LinuxCon later this month. The reason he (Chris) didn't release btrfsck for so long is that he was trying different designs/implementations and do not want to corrupt user data with half-done btrfsck. Good news is that Chris is pushing btrfs into Oracle's next release UBL.

The last but most important question for btrfs is performance. According to Intel guys' tests, btrfs is slower than ext4 in many cases. Some argued that btrfs presents many fancy features ext4 don't. But these features are more attractive to desktop users than to server users, and server users are the real force to push Linux kernel going forward these days.

The second topic is Distributed/cluster storage and cloud, led by Novell's Zhang Jiaju. This is a general talk about open source cloud infrastructure. Distributed systems put a lot of pressure on DLM and sometimes make DLM itself too heavy. We also discussed why Lustre isn't wildly used outside of HPC market. Jiaju thought the most important reason is Lustre is not merged in Linux kernel thus enterprise users do not trust them as much as Kernel filesystems even like ceph.

Jiaju maintains corosync in SUSE Linux and made an implementation of Paxos. Paxos came into people's eye because of Google Chubby. He also introduced a little about corosync's Totem. Totem performs better than Paxos when cluster membership doesn't change. But we all agree that quorum is not the best way to solve large cluster synchronization problem and that's why common Chubby cluster only have several machines.

That's all for the morning. Bellow is we are continuing discussion during lunch break.

Taobao holds first two sessions in the afternoon.Taobao is using ext4 heavily in many of their servers and they are particularly interested in some features. First is Ted's bigalloc, which tries to allocate larger chunk of data and is useful for large sequential write applications like HDFS. One issue with bigalloc is directory. With bigalloc, minimal 1MB block for each directory is certainly a waste. Ma Tao has sent out preview patches , the main idea is to make use of the inode xattr space to store the inline data.

Metadata checksum, snapshot and compression are also in Taobao's eyes. Meta checksum is mainly for debug purpose. When file system corrupts, people can first check meta checksum. If checksum doesn't agree, it is very likely a disk problem instead of file system bug. However, it doesn't help no journal mode. For snapshot, Yang Yongqiang is actively working on it.

The third topic in the afternoon is block layer updates, led by Li Shaohua. The most important change in block layer is the dropping of IO barriers, in favor of FLUSH/FUA. IO barrier has been there for a very long time. They are a concept in the block layer, implemented via FLUSH/FUA and serve two purposes: request ordering and forced flushing to physcial medium. Both block layer and disk cache may queue/reorder io requests. IO barriers ensure preceding requests are processed before the barrier and following requests after. Also they ensure that all data in disk write cache are flushed before returning.

IO barriers are well-known as performance killers. However, enforcing ordering with IO barriers at block layer is more than enough. Filesystems just want to ensure some important requests to get to disk, instead of EVERYTHING sent to block layer privier to IO barriers. Filesystems should concern themselves with ordering, since that's where the information is, and not dump that problem into the block layer. A filesystem which wants operations to happen in a specific order will simply need to issue them in the proper order, waiting for completion when necessary. The block layer can then reorder requests at will.

The resolution is to export FLUSH/FUA and let filesystems decide. Therefore REQ_HARDBARRIER no longer exists, while REQ_FLUSH and REQ_FUA are added. REQ_FLUSH request forces the write cache to be flushed before beginning the operation, and REQ_FUA means the data associated with the request itself must be committed to persistent media by the time the request completes.

So the procedure of write has changed from:
submit_bio(WRITE_BARRIER)
other reqs -> (wait)preflush -> (wait)bio reqs-> (wait)postflush -> other reqs
to:
submit_bio(WRITE_FLUSH_FUA)
other reqs -> preflush -> other reqs -> (preflush finishes) bio reqs -> other reqs -> (bio reqs finish) postflush

For both cases, if device supports FUA, the bio reqs are issued with REQ_FUA flag and the postflush is omitted. Apparently, the performance gain comes from the fact that one transaction on longer need to wait for IO requests of other transactions, at the cost of letting file systems maintain the dependency of different io requests.

The last session is a from industry session led by Huawei's chief scientist Hu Ziang. Ziang asked some questions for their engineers and then introduced Huawei's Linux strategy. Huawei used to be a very closed company but now they are seeking to get open and embrace Linux and open source. They will build a very large Linux developer team in near future. Good news for China Linux community.

That's all for the first day of excitement.

Friday, September 2, 2011

X86 Calling Conversion

Wikipedian has some definition: a calling convention is a scheme for how functions receive parameters from their caller and how they return a results.

Basically, it is a compiler ABI and varies on different platforms (like Windows and Linux). This is interesting and useful for debugging (at least for understanding how debuggers work...).

For example, a simple piece of code:
  1 void f(int arg1, int arg2, int arg3, int arg4, float arg5, int arg6, float arg7,
  2         float arg8, int arg9, int arg10, int arg11, int arg12)
  3 {
  4         printf("%d %d %d %d %f %d %f %f %d %d %d %d\n",
  5                 arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9,
  6                 arg10, arg11, arg12);
  7 }
  8
  9 void main()
 10 {
 11         f(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12);
 12 }


On Linux i386, above compiles to assembly:
[SID@test]$cc a.c -o a
a.c: In function ‘f’:
a.c:4: warning: incompatible implicit declaration of built-in function ‘printf’
[SID@test]$objdump -d a > a.s

In a.s, we can see the main function calls f function by passing every argument through the stack.
160 0804842e 
: 161 804842e: 55 push %ebp 162 804842f: 89 e5 mov %esp,%ebp 163 8048431: 83 e4 f0 and $0xfffffff0,%esp 164 8048434: 83 ec 30 sub $0x30,%esp 165 8048437: c7 44 24 2c 0c 00 00 movl $0xc,0x2c(%esp) 166 804843e: 00 167 804843f: c7 44 24 28 0b 00 00 movl $0xb,0x28(%esp) 168 8048446: 00 169 8048447: c7 44 24 24 0a 00 00 movl $0xa,0x24(%esp) 170 804844e: 00 171 804844f: c7 44 24 20 09 00 00 movl $0x9,0x20(%esp) 172 8048456: 00 173 8048457: b8 00 00 00 41 mov $0x41000000,%eax 174 804845c: 89 44 24 1c mov %eax,0x1c(%esp) 175 8048460: b8 00 00 e0 40 mov $0x40e00000,%eax 176 8048465: 89 44 24 18 mov %eax,0x18(%esp) 177 8048469: c7 44 24 14 06 00 00 movl $0x6,0x14(%esp) 178 8048470: 00 179 8048471: b8 00 00 a0 40 mov $0x40a00000,%eax 180 8048476: 89 44 24 10 mov %eax,0x10(%esp) 181 804847a: c7 44 24 0c 04 00 00 movl $0x4,0xc(%esp) 182 8048481: 00 183 8048482: c7 44 24 08 03 00 00 movl $0x3,0x8(%esp) 184 8048489: 00 185 804848a: c7 44 24 04 02 00 00 movl $0x2,0x4(%esp) 186 8048491: 00 187 8048492: c7 04 24 01 00 00 00 movl $0x1,(%esp) 188 8048499: e8 26 ff ff ff call 80483c4 189 804849e: c9 leave 190 804849f: c3 ret

Then on X86_64 Linux, the code compiles into following, where parameters are passed to f function through three ways: general purpose registers (di, si, dx, cx, r8d, r9d), xmm registers (xmm0~xmm2), and function stack.
159 000000000040054b 
: 160 40054b: 55 push %rbp 161 40054c: 48 89 e5 mov %rsp,%rbp 162 40054f: 48 83 ec 20 sub $0x20,%rsp 163 400553: c7 44 24 10 0c 00 00 movl $0xc,0x10(%rsp) 164 40055a: 00 165 40055b: c7 44 24 08 0b 00 00 movl $0xb,0x8(%rsp) 166 400562: 00 167 400563: c7 04 24 0a 00 00 00 movl $0xa,(%rsp) 168 40056a: 41 b9 09 00 00 00 mov $0x9,%r9d 169 400570: f3 0f 10 15 60 01 00 movss 0x160(%rip),%xmm2 # 4006d8 <__dso_handle+0 x30> 170 400577: 00 171 400578: f3 0f 10 0d 5c 01 00 movss 0x15c(%rip),%xmm1 # 4006dc <__dso_handle+0 x34> 172 40057f: 00 173 400580: 41 b8 06 00 00 00 mov $0x6,%r8d 174 400586: f3 0f 10 05 52 01 00 movss 0x152(%rip),%xmm0 # 4006e0 <__dso_handle+0 x38> 175 40058d: 00 176 40058e: b9 04 00 00 00 mov $0x4,%ecx 177 400593: ba 03 00 00 00 mov $0x3,%edx 178 400598: be 02 00 00 00 mov $0x2,%esi 179 40059d: bf 01 00 00 00 mov $0x1,%edi 180 4005a2: e8 1d ff ff ff callq 4004c4 181 4005a7: c9 leaveq 182 4005a8: c3 retq 183 4005a9: 90 nop 184 4005aa: 90 nop 185 4005ab: 90 nop 186 4005ac: 90 nop 187 4005ad: 90 nop 188 4005ae: 90 nop 189 4005af: 90 nop

So why the difference? Basically this is part of System V AMD64 ABI convention which GCC and ICC (Intel compiler) implements on Linux, BSD and Mac and which defines that rdi, rsi, rdx, rcx, r8, r9 can be used to pass down integer parameters and xmm0-7 can be used to pass down float point parameters.

This leads to another question, why not other registers? On X86_64, there are 16 general purpose registers that can save integers (rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp r8~r15), and 16 xmm registers that can save float points (xmm0~xmm15). They are divided by compiler ABI into volatile and non-volatile registers. Volatile registers are scratch registers presumed by the caller to be destroyed across a call. Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. So volatile registers are naturally suitable for function arguments while there is overhead of using non-volatile registers (must be saved).

The calling conversion ABI is basically about which register is volatile/non-volatile, which is reserved for specially purpose (parameter passing, frame pointer, stack pointer, etc.), what is the order of arguments on stack, who (caller or callee) is responsible for cleaning up the stack, as well as stack layout/alignness.

Architecture Calling convention name Operating system, Compiler Parameters in registers Parameter order on stack Stack cleanup by Notes
64bit Microsoft x64 calling convention Windows (Microsoft compiler, Intel compiler) rcx/xmm0, rdx/xmm1, r8/xmm2, r9/xmm3 RTL (C) caller Stack aligned on 16 bytes. 32 bytes shadow space on stack. The specified 8 registers can only be used for parameter number 1,2,3 and 4.
System V AMD64 ABI convention Linux, BSD, Mac (GCC, Intel compiler) rdi, rsi, rdx, rcx, r8, r9, xmm0-7 RTL (C) caller Stack aligned on 16 bytes. Red zone below stack.

The above table is only for either user space application or kernel space functions. Likewise, there is always an exception. Here the exception is system calls. System calls trap user space context into kernel space and have specially requirement for parameter passing:
1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.

2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

3. The number of the syscall has to be passed in register %rax.

4. System-calls are limited to six arguments, no argument is passed directly on
the stack.

5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.

6. Only values of class INTEGER or class MEMORY are passed to the kernel.