Friday, October 26, 2012

ext3/4 and fsync

Our team met a problem of the "infamous" ext3 slowless on fsync() calls. We solved it rather easily by switching to ext4. But one might ask, WTF is going on with ext3 fsync() performance? And why simply switching to ext4 solves it at all?

I googled some time but found no obvious explanations nor documents about why ext4 outperforms ext3 in fsync-workloads. So I had to do it RTFC way.

The phenomenon is obvious. Given a workload that thread A is sequentially extending a file with large chunk of data, and another thread B is doing periodical fsync() calls, the fsync() call is much slower on ext3 than on ext4. ext3 fsync() tries to write back pages that is dirtied by thread A, and while it writes back, more pages are dirtied by thread A and need to be written back. This creates quite large latency for fsync() on ext3.

OTOH, ext4 (with default options) doesn't have such problem.

But WHY?

 We know that ext3 and ext4 share similar design principles but they are in fact two different file systems. The secret lies in their backing journal implementation and one key feature that comes with ext4, delayed allocation.

As we know, ext4 is built upon jbd2 which is a successor of ext3's building block, jbd. While jbd2 inherits a lot of designs from jbd, one thing that is changed is how data=ordered mode implemented.
data=ordered is an option about how ext3/4 manages data/metadata ordering. From Documentation/filesystems/ext4.txt:

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction.  When it's time to write the new metadata
out to disk, the associated data blocks are written first.  In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.

ext3/4 both implements this data=ordered mode and set it as default. However, there is a small difference between how it is implemented in jbd and jbd2.

In jbd/jbd2, when a transaction is being committed, they call journal_submit_data_buffers() to writeback dirty data associated with current transaction.

In jbd's journal_submit_data_buffers(), it loops agains commit_transaction->t_sync_datalist and relies on jbd buffer head to ensure that all data has been written back.

In jbd2's journal_submit_data_buffers(), it implements data=ordered via inodes instead of jbd buffer heads. All inodes touched by the committing transaction is tracked in commit_transaction->t_inode_list and jbd2 just loops to write back each of them. There is also a small trick, jbd2 uses writepage instead of writepages to write back each inode mapping. The reason behind this is, that ext4's da_writepages will do block allocation, while writepage only writes back allocated blocks.

So, see the tricks? With ext4 data=ordered, fsync() in thread B will only write back data blocks that are already allocated. And with ext4's delayed allocation, most pages dirtied by thread A are not allocated with blocks so won't be written back by jbd2. While for pages dirtied by thread B, they are written back via da_writepages in ext4_sync_file() that calls filemap_write_and_wait_range().

So this is really tricky, because even for ext4, it only works for data=ordered mode and delayed allocation enabled. Also if thread A is not extending the file but rather over writing, ext4 is close to ext3. :)


That's all.