ZFS on macOS

03 Sep 2018 in Sysadmin / Storage

ZFS is a filesystem developed at Sun Microsystems for their Solaris operating system. This is an experience report from recently setting it up as the filesystem for my secondary storage drives on macOS.

In contrast to other common filesystems you may encounter, ZFS is fully transactional and copy-on-write (COW). At the most basic level, you can imagine it as if the filesystem were a database of small blocks, indexed in B-trees and checksummed for verification. Notably, the COW design allows nearly instantaneous “snapshots” to be recorded and incremental backups between them. It also includes the feature sets of Logical Volume Management (LVM), RAID (mdadm), caching (bcache/flashcache), and transparent compression out of the box.

At one point, ZFS seemed like the future of filesystems, with Apple intending to use it in OS X (see: Time Machine). Furthermore, Sun released ZFS as open source under the CDDL license, however due to some licensing controversy it was never merged into the mainline Linux kernel. Thus ZFS was squandered, never to be enjoyed as widely as might have otherwise been.

In recent years, it’s become possible to run ZFS on linux through the efforts of the ZFS on Linux project, and the incorporation of easy-to-install ZFS packages in the Ubuntu repos. A related project, OpenZFSOnOSX, brings ZFS to macOS by using a Solaris Porting Layer (SPL) to translate Solaris system calls for the Darwin kernel on macOS, allowing them to run ZFS code mostly unchanged. This obviously brings some performance penalty, but one that is hopefully not too significant on spinning disks, since they will likely be I/O bound by the disks anyway.

Installation

Installation was easy using the provided package on the OpenZFSOnOSX website, or

$ brew cask install openzfs

which installs a few kernel extensions:

$ kextstat | grep net.lundman
  1 0xffffff7f816aa000 0x498      0x498      net.lundman.kernel.dependencies.31 (12.5.0) CF3A4A39-BA8C-4DFF-8BA7-B3C04D69457E
  1 0xffffff7f816ab000 0x11f5000  0x11f5000  net.lundman.spl (1.7.2) 0AB91572-CACF-39DF-86B3-116FF8CDCB8E <69 7 5 4 3 1>
  1 0xffffff7f828b0000 0x2d2000   0x2d2000   net.lundman.zfs (1.7.2) 0F708776-FDC2-39C5-87CE-42CFF3C5DD48 <70 25 7 5 4 3 1>

Tooling

ZFS has some pretty amazing tooling in the zfs, zpool, and zdb commands.

zpool is used to create and manage pools of disks that logically present as a single volume (like LVM), whereas zfs is used to manage individual datasets, which are nested hierarchically within pools but may have different tunings (recordsize, etc).

One feature I found myself using a lot for monitoring was zpool iostat, which gives a current picture of I/O operations across the different media in the pool:

$ zpool iostat -v 3
                                                  capacity     operations     bandwidth 
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
pool0                                            408G  10.5T    399    202  8.76M  13.2M
  raidz1                                         408G  10.5T    399    202  8.76M  13.2M
    media-4381EF74-91A2-A841-9073-00DD77A83EAA      -      -    133     70  2.92M  4.39M
    media-A00A5F1E-98B4-AA4C-ABE6-D19D52451730      -      -    133     69  2.92M  4.39M
    media-3D6C5673-A976-544C-9206-31ED6F4986E4      -      -    133     62  2.92M  4.40M
logs                                                -      -      -      -      -      -
  media-F0A2BB1A-583A-4D31-A9FD-887C7B326862     260K  7.00G      0      0      5     11
cache                                               -      -      -      -      -      -
  media-599EF695-4A06-488E-8679-FA03ED260C8D     519M  92.3G      0      0      1  7.16K
----------------------------------------------  -----  -----  -----  -----  -----  -----

Configuration

After installation, I made a pool of my three drives using the zpool command:

$ zpool create pool0 raidz disk1 disk2 disk3

Striped, mirrored, or RAID-Z?

You may be familiar with the different RAID levels (RAID0, RAID1, etc). Striping in ZFS corresponds to RAID0, mirroring to RAID1, RAIDZ-1 to RAID5, and RAIDZ-2 to RAID6.

These different setups offer different tradeoffs in:

Performance (IOPS)
Performance (bandwidth)
Space efficiency
Resiliency to drive failure

For example, striping the disks in parallel offers the best performance, since all disks may be used independently in parallel. However, in this configuration the failure of any one disk renders then entire set unusable, since it will contain a 1/N of the data.

There are many in-depth looks at these different configurations, and even calculators, but it’s outside the scope of this article. Suffice to say, because it is my home computer and I am somewhat stingy / out of available drive bays, I chose RAIDZ-1 for my configuration of 3 x 4TB drives, resulting in ~8 TB of usable space while being able to tolerate failure of any one drive.

RAIDZ carries some notable performance limitations, however, these can be mitigated somewhat by using an SSD for both write-ahead logging and caching (i.e. by using the SSD for better low-latency random I/O). Since I have an SSD in my desktop as the main root device, I created a small 8GB partition to use for write-ahead logging (the so-called ZIL, or ZFS intent log; when split out onto an SSD sometimes referred to as an SLOG), and a medium-sized 100G partition to use as a cache. These were added to the pool with:

$ zpool add pool0 cache disk4s1
$ zpool add pool0 log disk4s2

Tuning for MySQL

My computer runs a somewhat large MySQL instance (~3TB data), and MySQL on ZFS is not a good match with the out-of-the-box default configuration. The main reasons for this are:

Mismatched page size: ZFS and MySQL both manage data in fixed-size pages, however, the default MySQL page size is 16K while the default ZFS page (called recordsize) is 128K. This leads to thrashing as updates to MySQL pages require read-modify-writes of the larger ZFS pages.
Redundancy: Both ZFS and MySQL perform write-ahead logging, checksumming, and optionally compression of their data, which is redundant in this case.

Percona has a great guide on how to tune ZFS+MySQL to play nicely together, and there are some additional notes on the OpenZFS Wiki. In the end, for my setup, I went with:

Two separate ZFS datasets mysql/data and mysql/logs.
For the data:
- recordsize=64k
- compression=lz4
- primarycache=metadata
- logbias=throughput
For the logs:
- recordsize=128k (the default)
- primarycache=metadata
MySQL tunings:
- innodb_page_size=64k
- skip-log-bin
- innodb_doublewrite=off
- innodb_checksum_algorithm=none

With these settings, we match the page size of MySQL to the recordsize of ZFS, and also disable some of the redundant consistency settings in MySQL (double-writing and checksumming) that are not strictly necessary since they are provided by ZFS.

In addition, because both MySQL and ZFS perform in-memory caching, I limited the size of ZFS’s in-memory (ARC) cache to 4GB and disabled data caching (primarycache=metadata) in favor of MySQL’s innodb_buffer_pool, which is reported to be 7-200% faster.

Memory leak?

The biggest issue I ran into was importing my large MySQL dataset. It seemed that ZFS was leaking memory (in Activity Monitor a ballooning kernel_task), even when I limited the size of the in memory ARC cache with kstat.zfs.darwin.tunable.zfs_arc_max. This would eventually degrade performance as both MySQL and ZFS would compete for available memory.

Thankfully, I eventually stumbled on a link to a Github issue with the following helpful configuration snippet:

sysctl -w kstat.zfs.darwin.tunable.zfs_arc_max=4294967296
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_limit=3221225472

sysctl -w kstat.zfs.darwin.tunable.zfs_arc_min=1610612736
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_min=1342177280
sysctl -w kstat.zfs.darwin.tunable.zfs_dirty_data_max=536870912

Apparently, when under intense write load (as when I was importing my MySQL dataset), the memory allocator in SPL can fail to release dirty pages promptly, leading to what appears to be a memory leak.

After applying these sysctls (and making them persistent by adding them to /etc/zfs/zsysctl.conf), the memory usage seems to be appropriately capped.

Conclusions

While ZFS on macOS is definitely still a bit edge, I found it reasonably reliable, with no observed kernel panics or other major issues. Performance definitely takes a hit, although the authors claim that it may improve in time: to date they have primarily focused on stability rather than performance (which seems wise given some of the backlash to early bugs in Btrfs). In addition, ZFS offers a RAID5-like configuration not otherwise available in macOS. I’ll be curious to see how it performs going forward, and whether or not any issues arise. I’m also interested in setting up a system to use ZFS snapshots to perform incremental backups offsite.

If you’re interested in diving into more technical details of ZFS, Chris Siebenmann has an excellent series of blogs about it.

ZFS on macOS

Installation

Tooling

Configuration

Striped, mirrored, or RAID-Z?

Tuning for MySQL

Memory leak?

Conclusions

the working iron never rusts

Error

Installation

Tooling

Configuration

Striped, mirrored, or RAID-Z?

Tuning for MySQL

Memory leak?

Conclusions

Templates:

Error