Tech rider: May 2012

Flash Media

The second day of LSF/MM summit started with flash media led by Steven Sprouse from SanDisk. He started with an introduction of lifetime terabyte writes, which is defined as:

         physical capacity * write endurance
LTW =   -------------------------------------
            write amplification

Physical capacity is increasing but write endurance is decreasing as write cycle increases (every write cycle hurts NAND so that it stores data for shorter amount of time). Write amplification is affected by many factors, like block size, provisioning, trim etc.

Sprouse mentioned that some SSD vendors are starting to use hybrid SLC/MLC where SLC is used for frequent journal write and MLC for data blocks. Therefore the request for SSD to ask for information to differentiate metadata write and data write has been proposed, with tagging information, flash can better decide where to storage the data.

Another point made by Sprouse is the definition of "random write". Different hardware has different capability of handling random write. For flash, anything smaller than erase block is random write. However, erase block size is changing. It was 64KB back in 2004 but now most are 1MB. So it is really necessary that flash vendors expose such information to OS developers.

To help flash to get better NAND geometry, possible ways that flash vendor and OS developers can cooperate are:
1. OS tells flash what data may be dropped at the same time, so that flash can put these data in the same erase block. One example given by Ted is rpm/deb package files. They are likely to vanish at the same time during upgrade.
2. Flash vendor reports some geometry information to upper layers, like block size, page size, stripe size, etc.
3. Provide some tagging mechanisms so that userspace/fs can tag different data types, e.g., metadata and data, hot data and code data, etc.
4. filesystems help on cleaning up (trim).

It is suggested that flash vendors make a list of information that they are willing to provide and OS developers can look at it and decide what can be useful. However, even if there are standard ways to query these information, vendors are not forced to fill them in correctly.

Flash Cache in the Operating System

The discussion mainly happened around integrating bcache. Flashcache is implemented by Facebook and based on device-mapper, which bcache is implemented by Google and performs better than flashcache. However, bcache need so many information dropped/hidden by DM that it bypasses DM completely. To integrate bcache with DM, there would be a large amount of changes to DM, and some changes to block layer as well. The discussion didn't result in any real conclusion though.

High IOPS and SCSI/Block

Roland Dreier introduces two current modes of writing block drivers: register its own make_request, and using bolck layer make_request(). The former is too low level that it bypasses many useful block layer functions, while the latter is slow because of heavy lock contention. Jen Axboe proposed his multi-queue patches that implements per-CPU queue coupled with lightweight elevator. He promised to get these patches soon and it will help solve the performance problem.

Another issue discussed is Dell's vision of accessing PCIe flash devices. Currently PCIe flash devices still provide block interfaces, but they are hoping to change to use memory interfaces to get better performance. However, it is pointed out that memory interface doesn't do software error recovery. Any error reported by device is fatal failure so devices have to all error handling and only report error as the last resort.

LBA Hinting and New Storage Commands

Frederick Knight led the discussion of handling shingle drives. Shingle drives can largely increase disk density but require OS to write in band. There are three options:
1. transparent to OS
2. banding: let the host manage geometry and expose new SCSI commands for handling bands
3. transparent with Hints: make it look like a normal disks but develop new SCSI commands to hint both ways between device and host what the data is and device characteristics are to try to optimize data placement

The second option is dropped by attendees immediately. And the first option looks more like current SSD's situation and the third option is the same situation that flash vendors are pursuing.

Storage Manager

Lukas Czerner led the session to give an update of his command line storage manager. It mainly aims to be a generic storage manager and reduce the complexity of manage different storage devices and file systems. However, as underlying storage device/filesystems may provide different type-dependent options, the new storage manager reduce complexity iff users don't need those misc options.

WRITE_SAME and UNMAP, FSTRIM

The session started with some complaint about current trim command. In the ATA TRIM command, there are only two bytes used for trim range, meaning the range can be at most 64K sectors, which is 32MB for each TRIM operation. Another problem is current block layer only allows for continuous trim range. As TRIM is non-queue, the overhead goes up a lot when there are a lot of distinct ranges to trim. Christoph Hellwig had some PoC to demonstrate multi-range trim in XFS and it showed only ~1% overhead compared with no trim case.

Besides trim, scatter/gatter SCSI commands have the same multi-range problem. There are two options to implement multi-range command in block layer, allowing single BIO to carry multiple ranges, and use a linked BIOs for the range. After discussion, the latter is adopted.

NFS server issues

Bruce Fields led the session with current status of knfsd with regard to features like change attributes, delegation/oplocks, share locks, delete-file recovery/server-side sillyrename and lock recovery.

During the discussion, people asked about pNFS and Boaz made a point that kernel pNFS server may not be happening because there is no developer interest in making it happen.

Earlier last month (April 1 and 2), I was in San Francisco to attend 2012 Linux Storage and Filesystem Summit. It was a great experience for me because I am a great fan of many of these kernel maintainers who have in-depth understanding of file systems and storage, and sitting under the same roof, discussing cutting edge technology topics, is just what I've been dreaming of for many years.

OK, enough stupid wording. Here is my summary for this event.

The LSF/MM summit is a small by-invitation summit which focuses on collaboration and implementation. I was very lucky to be invited because of the work that I am doing in EMC, pushing Lustre client into mainline kernel. Also it is worth mentioning that EMC is silver sponsor for the event.

The LSF/MM summit is a two-day event, and consists of three tracks (IO, FS and MM). The complete schedule can be found here. I stayed in the filesystem track all the time so my summary will be mainly about discussions in the filesystem track. For discussions on IO and MM tracks, they can be found here and here.

Runtime filesystem consistency check

It is a FAST paper written by Ashvin Goel and others from the University of Toronto. The main idea is to record some consistency invariants and check them between the filesystem and the block layer, so that errors can be found earlier at transaction commit point. The consistency invariants are predefined and there are 33 for ext3 in Recon, the PoC of runtime filesystem checker they built. A more detailed introduction of Recon can be found here.

Writeback

Wu Fengguang started writeback discussion with his work on improving the writeback situation. Then he concluded his work on IO-less throttling, where the main intentions is to minimize seeks, get less lock contention and cache bouncing, lower latency, with impressive performance gains, minor regressions and lots of algorithm complexity.

For direct reclaim, pageout work has been moved from direct reclaim to flusher threads. It also focuses dirty reclaim waits on dirtier tasks for the benefit of interactive performance. Dirty pages in the end of LRU are still a problem because scanning for them wastes lots of CPU. The he suggested adding a new LRU list just for tracking dirty page, and it requires a new page flag.

Memory control groups have its problem with dirty limit mainly because there is only a global dirty limit, and flusher fairness is beyond control. There are only coarse options available such as limiting the number of operations that can be performed on a per-inode basis or limiting the amount of IO that can be submitted.

The discussion then moved to buffer write blkcg IO control. Current blkcg IO control is useless with regard to buffer write, because blkcg throttles at summit_io(), where there is mostly no context of the writer. Fengguang made a suggestion and RFC patchsets to implement buffer write IO control in balance_dirty_pages. However Tejun Heo argued that blkcg should do its work in block layer, instead of messing up with mm layer. Also there are comments that the algorithm of balance_dirty_pages() is already very complex, doing IO controlling there will make it even more difficult to understand. The dis-agreement has been there for a long time and it couldn't reach conclusion soon so people askes the discussion to continue in MM track and it later moves on on mailing list.

Writeback and Stable Pages

The same topic was discussed in last year's LSF and the conclusion was to let writer wait when it wants to write a page that is under writeback. However Ted Ts'o reported ext4 long latency when Google started to use the code. In brief, waiting page writeback (to get stable pages) can lead to large process latency. And it is not necessary for every system. Stable pages are only required for some systems where things like checksums calculated on the page require that the page be unchanged when it actually gets written.

Sage Weil and Boaz Harrosh made three options to handle the situation:
1. re-issue pages that are changed during IO;
2. wait on writeback for every page (current implementation);
3. Do a COW on the page under writeback when it is written to.

The first option was dropped instantly because it confuses storage that need stable page and is purely overhead for storage that doesn't need it.

The third option was discussed on but the overhead of COW is unknown and there are some corner cases that need to be addressed like what to do if file is truncated after COW page is created. So the first step as suggested is to introduce some APIs to let storage tell filesystem if it needs stable page, and let filesystem tell storage if storage is supported. Then for cases where stable pages are unnecessary like Google's use, file system doesn't need to do anything to send stable pages. As for stable page support, some reporting method should be added to writeback code path to find out what workload are being affected and what those affects are. Then someone can propose on how to implement the COW solution and address all the corner cases.

Copy Offload

NetApp's Frederick Knight led copy offload session. The idea of copy offload is to allow SCSI devices to copy ranges of blocks without involving the host operating system. Copy offload has been in SCSI standard since SCSI-1 days. EXTENDED COPY (XCOPY) uses two descriptors for the source and destination and a range of blocks. It can be implemented in either push model (source sends the blocks to the target) or pull model (target pulls from source).

TOKEN based copy is far more complex. Each token has a ROD (Representation of Data) that allows arrays to give an identifier for what may be a snapshot. A token represents a device and a range of sectors that the device guarantees to be stable. However, if the device doesn't support snapshotting and the region gets overwritten for any reason, the token can be declined by storage. It means that storage users have no idea of the lifetime of a token, and every time a token goes invalid, they need to either renew the token or do real data transfer.

Token based copy is somewhat similar to NFS's server side copy (NFSv4.2 draft). And it is suggested that token format need to be standardized to possibly allow copy offload from SCSI disk to CIFS/NFS volume.

Kernel AIO/DIO Interfaces

The first session in the afternoon was led by Dave Kleikamp who is trying to modify kernel AIO/DIO APIs to provide in-kernel interfaces. He changed iov_iter to make it handle both iovec (from userspace) and bio_vec (from kernel space). He modified loopback device to let it send AIO and therefore avoid caching in underlying filesystem. And people suggested that swap-over-NFS can be adapted to use the same API.

RAID engine Unification

Boaz implemented a generic RAID engine for pNFS object layout driver. Since the code is simple and efficient, he wants to push its usage and unify kernel RAID implementations. Currently besides Boaz's RAID implementation, there are two other RAID implementations, MD and btrfs. Boaz said that his implementation is more complete and support RAID stacking without extra data copy.

However, it seems the benefit is not so obvious and people are hesitating to adopt it because current code works just good. Chris Mason suggested Boaz to start with MD because MD is much simpler than btrfs. If that works well, he can continue to change btrfs to use the new RAID engine.

xfstests

With many years of advocating, xfstests has somehow become the most used regression test suit for not just xfs, but also ext4 and btrfs. There are ~280 test cases and around 100 of them are filesystem independent. However, one nightmare is that all test files are numbered instead of properly named. So anyone who wants to use it need to read the test case and find out what it actually does. Also it need to be reorganized so that similar functions are grouped in a directory instead of lying flat like right now.

So much for the first day. Here are some pictures of attendees taken by Chuck Level:

Tech rider

Monday, May 7, 2012

2012 LSF/MM Summit Summary -- Day 2

2012 LSF/MM Summit Summary -- Day 1

The BARD

Labels

Where I'm Heading

Blog Archive