Tuesday, November 1, 2011

CLSF 2011 - Day 1

One year passed since last CLSF2010 in Shanghai. This year, CLSF is held in Nanjing and sponsored by Baidu and Fujitsu. And Coly has kindly built an official website for the event.

The workshop started with Btrfs updates given by Fujitsu's Zefan Li. Fujitsu has been contributing a lot to btrfs development this year. The first important feature Fujitsu developed for btrfs is LZO transparent compression. Before, btrfs only uses Zlib but the infrastructure allows adding more compression algorithms. LZO consumes less CPU and is faster than Zlib. The current compression unit is extent (at most 160KB) and every RW needs to read/write whole extent. So the feature seems to be more suitable for SRSW use case.

Another interesting feature added to btrfs is auto-defragment. Copy-on-write filesystems like btrfs/WAFL all suffer from file fragmentation. So btrfs tries to (sort of) solve it in a nice and easy way. When a page is written, btrfs tries to read in the extent around current page and mark pages dirty in memory. So later writeback can write continuous blocks and ondisk extents tend to be less fragmented. However, a side effect is that it causes more IO than necessary and may hurt random write performance. While the idea of auto-defrag is cool, defragmentation consumes resources and thus timing is very important. What is the best time for starting auto-defrag and how far it should go (e.g., 5 frags into 4 may not be worth doing at all) remains still a open question.

Miao Xie also introduced btrfs tree log and sub-transaction feature. Before tree log, btrfs maintains a big transaction (default 30sec) and if user fsync one file, the whole transaction has to be committed, hurting very much the performance of other writers. With tree log, btrfs maintains a write-ahead on-disk log for each of such fsync operations. And when transaction commits, it discards all these logs. As a result, fsync one file will only result in writing back the file's dirty data in tree log and other writers can continue writing to the same transaction, which improves performance in many writers with fsync use case.

One defect of tree log is that, when writing new logs, it also writes all old logs (again and again). The solution is to assign a sub-transaction id to each log and use it to avoid writing old logs. And it improves performance in large file many fsync use case.

One question that we are all interested is when btrfs will be suitable for production. Currently the most concern is about btrfsck, which has been delayed for years. Although btrfs has built-in RAID and scrub (which rebuilds broken RAID block automatically), people still think fsck a must. Zefan explained that Chris Mason will demo btrfsck at LinuxCon later this month. The reason he (Chris) didn't release btrfsck for so long is that he was trying different designs/implementations and do not want to corrupt user data with half-done btrfsck. Good news is that Chris is pushing btrfs into Oracle's next release UBL.

The last but most important question for btrfs is performance. According to Intel guys' tests, btrfs is slower than ext4 in many cases. Some argued that btrfs presents many fancy features ext4 don't. But these features are more attractive to desktop users than to server users, and server users are the real force to push Linux kernel going forward these days.

The second topic is Distributed/cluster storage and cloud, led by Novell's Zhang Jiaju. This is a general talk about open source cloud infrastructure. Distributed systems put a lot of pressure on DLM and sometimes make DLM itself too heavy. We also discussed why Lustre isn't wildly used outside of HPC market. Jiaju thought the most important reason is Lustre is not merged in Linux kernel thus enterprise users do not trust them as much as Kernel filesystems even like ceph.

Jiaju maintains corosync in SUSE Linux and made an implementation of Paxos. Paxos came into people's eye because of Google Chubby. He also introduced a little about corosync's Totem. Totem performs better than Paxos when cluster membership doesn't change. But we all agree that quorum is not the best way to solve large cluster synchronization problem and that's why common Chubby cluster only have several machines.

That's all for the morning. Bellow is we are continuing discussion during lunch break.

Taobao holds first two sessions in the afternoon.Taobao is using ext4 heavily in many of their servers and they are particularly interested in some features. First is Ted's bigalloc, which tries to allocate larger chunk of data and is useful for large sequential write applications like HDFS. One issue with bigalloc is directory. With bigalloc, minimal 1MB block for each directory is certainly a waste. Ma Tao has sent out preview patches , the main idea is to make use of the inode xattr space to store the inline data.

Metadata checksum, snapshot and compression are also in Taobao's eyes. Meta checksum is mainly for debug purpose. When file system corrupts, people can first check meta checksum. If checksum doesn't agree, it is very likely a disk problem instead of file system bug. However, it doesn't help no journal mode. For snapshot, Yang Yongqiang is actively working on it.

The third topic in the afternoon is block layer updates, led by Li Shaohua. The most important change in block layer is the dropping of IO barriers, in favor of FLUSH/FUA. IO barrier has been there for a very long time. They are a concept in the block layer, implemented via FLUSH/FUA and serve two purposes: request ordering and forced flushing to physcial medium. Both block layer and disk cache may queue/reorder io requests. IO barriers ensure preceding requests are processed before the barrier and following requests after. Also they ensure that all data in disk write cache are flushed before returning.

IO barriers are well-known as performance killers. However, enforcing ordering with IO barriers at block layer is more than enough. Filesystems just want to ensure some important requests to get to disk, instead of EVERYTHING sent to block layer privier to IO barriers. Filesystems should concern themselves with ordering, since that's where the information is, and not dump that problem into the block layer. A filesystem which wants operations to happen in a specific order will simply need to issue them in the proper order, waiting for completion when necessary. The block layer can then reorder requests at will.

The resolution is to export FLUSH/FUA and let filesystems decide. Therefore REQ_HARDBARRIER no longer exists, while REQ_FLUSH and REQ_FUA are added. REQ_FLUSH request forces the write cache to be flushed before beginning the operation, and REQ_FUA means the data associated with the request itself must be committed to persistent media by the time the request completes.

So the procedure of write has changed from:
submit_bio(WRITE_BARRIER)
other reqs -> (wait)preflush -> (wait)bio reqs-> (wait)postflush -> other reqs
to:
submit_bio(WRITE_FLUSH_FUA)
other reqs -> preflush -> other reqs -> (preflush finishes) bio reqs -> other reqs -> (bio reqs finish) postflush

For both cases, if device supports FUA, the bio reqs are issued with REQ_FUA flag and the postflush is omitted. Apparently, the performance gain comes from the fact that one transaction on longer need to wait for IO requests of other transactions, at the cost of letting file systems maintain the dependency of different io requests.

The last session is a from industry session led by Huawei's chief scientist Hu Ziang. Ziang asked some questions for their engineers and then introduced Huawei's Linux strategy. Huawei used to be a very closed company but now they are seeking to get open and embrace Linux and open source. They will build a very large Linux developer team in near future. Good news for China Linux community.

That's all for the first day of excitement.

No comments:

Post a Comment