Friday, October 26, 2012

ext3/4 and fsync

Our team met a problem of the "infamous" ext3 slowless on fsync() calls. We solved it rather easily by switching to ext4. But one might ask, WTF is going on with ext3 fsync() performance? And why simply switching to ext4 solves it at all?

I googled some time but found no obvious explanations nor documents about why ext4 outperforms ext3 in fsync-workloads. So I had to do it RTFC way.

The phenomenon is obvious. Given a workload that thread A is sequentially extending a file with large chunk of data, and another thread B is doing periodical fsync() calls, the fsync() call is much slower on ext3 than on ext4. ext3 fsync() tries to write back pages that is dirtied by thread A, and while it writes back, more pages are dirtied by thread A and need to be written back. This creates quite large latency for fsync() on ext3.

OTOH, ext4 (with default options) doesn't have such problem.

But WHY?

 We know that ext3 and ext4 share similar design principles but they are in fact two different file systems. The secret lies in their backing journal implementation and one key feature that comes with ext4, delayed allocation.

As we know, ext4 is built upon jbd2 which is a successor of ext3's building block, jbd. While jbd2 inherits a lot of designs from jbd, one thing that is changed is how data=ordered mode implemented.
data=ordered is an option about how ext3/4 manages data/metadata ordering. From Documentation/filesystems/ext4.txt:

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction.  When it's time to write the new metadata
out to disk, the associated data blocks are written first.  In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.

ext3/4 both implements this data=ordered mode and set it as default. However, there is a small difference between how it is implemented in jbd and jbd2.

In jbd/jbd2, when a transaction is being committed, they call journal_submit_data_buffers() to writeback dirty data associated with current transaction.

In jbd's journal_submit_data_buffers(), it loops agains commit_transaction->t_sync_datalist and relies on jbd buffer head to ensure that all data has been written back.

In jbd2's journal_submit_data_buffers(), it implements data=ordered via inodes instead of jbd buffer heads. All inodes touched by the committing transaction is tracked in commit_transaction->t_inode_list and jbd2 just loops to write back each of them. There is also a small trick, jbd2 uses writepage instead of writepages to write back each inode mapping. The reason behind this is, that ext4's da_writepages will do block allocation, while writepage only writes back allocated blocks.

So, see the tricks? With ext4 data=ordered, fsync() in thread B will only write back data blocks that are already allocated. And with ext4's delayed allocation, most pages dirtied by thread A are not allocated with blocks so won't be written back by jbd2. While for pages dirtied by thread B, they are written back via da_writepages in ext4_sync_file() that calls filemap_write_and_wait_range().

So this is really tricky, because even for ext4, it only works for data=ordered mode and delayed allocation enabled. Also if thread A is not extending the file but rather over writing, ext4 is close to ext3. :)

That's all.


  1. Yes we have also encountered this bug and we are using ext3 on redhat stock kernel 2.6.32-220. It turns out that the barrier option fully implemented on 2.6.32 than 2.6.18 to be the performance killer. See my blog:

    1. Not just barrier, even if you mount with nobarrier, data=ordered mode will kill your fsync performance, and that roots directly to the jbd implementation.

    2. I see. Previously we used ext3 with barrier option and we rely on fysnc heavily. After remounting the file system with nobarrier, the performance went to normal. I've talked about this With Tao Ma on CLK 2012. We agreed the reason was due to the 2.6.18 kernel which did not fully implement barrier option, i.e. the FUA command of SCSI might just be ignored by some devices, so the old kernel got better performance with a higher risk of data corruption.This bug was fixed in between 2.6.18 and 2.6.32.

    3. Not sure if I got you. I know that DM in old kernels had problem with barrier implementation. But I'm not sure if it still holds true if you were not using LVM. I remember seeing some files in redhat 2.6.18 kernel dealing with barriers at block layer. Anyway, it's been a long time since I last looked at 2.6.18 kernel code. So I may be completely wrong...

  2. Hmm. I'm not sure that's the root cause to the problem, the problem being "fsync on B slow because we also need to unnecessarily sync what A is doing". What you say is true but from what I figure, ext4 mostly ends up working *around* the problem because it uses delayed alloc.

    JBD in ordered mode would require any data associated with metadata to be flushed before the metadata. With delayed alloc, the writer's data has no metadata changes associated yet because the actual storage allocation hasn't happened yet. fsync() on one file will require other files' metadata (and any associated data) to be flushed... just that, now, there is no "other files' metadata"


    1. Isn't it exactly what I wrote in the blog? :-)