ZFS on macOS
in Sysadmin / Storage
ZFS is a filesystem developed at Sun Microsystems for their Solaris operating system. This is an experience report from recently setting it up as the filesystem for my secondary storage drives on macOS.
In contrast to other common filesystems you may encounter, ZFS is fully transactional and copy-on-write (COW). At the most basic level, you can imagine it as if the filesystem were a database of small blocks, indexed in B-trees and checksummed for verification. Notably, the COW design allows nearly instantaneous “snapshots” to be recorded and incremental backups between them. It also includes the feature sets of Logical Volume Management (LVM), RAID (mdadm), caching (bcache/flashcache), and transparent compression out of the box.
At one point, ZFS seemed like the future of filesystems, with Apple intending to use it in OS X (see: Time Machine). Furthermore, Sun released ZFS as open source under the CDDL license, however due to some licensing controversy it was never merged into the mainline Linux kernel. Thus ZFS was squandered, never to be enjoyed as widely as might have otherwise been.
In recent years, it’s become possible to run ZFS on linux through the efforts of the ZFS on Linux project, and the incorporation of easy-to-install ZFS packages in the Ubuntu repos. A related project, OpenZFSOnOSX, brings ZFS to macOS by using a Solaris Porting Layer (SPL) to translate Solaris system calls for the Darwin kernel on macOS, allowing them to run ZFS code mostly unchanged. This obviously brings some performance penalty, but one that is hopefully not too significant on spinning disks, since they will likely be I/O bound by the disks anyway.
Installation
Installation was easy using the provided package on the OpenZFSOnOSX website, or
$ brew cask install openzfs
which installs a few kernel extensions:
$ kextstat | grep net.lundman
69 1 0xffffff7f816aa000 0x498 0x498 net.lundman.kernel.dependencies.31 (12.5.0) CF3A4A39-BA8C-4DFF-8BA7-B3C04D69457E
70 1 0xffffff7f816ab000 0x11f5000 0x11f5000 net.lundman.spl (1.7.2) 0AB91572-CACF-39DF-86B3-116FF8CDCB8E <69 7 5 4 3 1>
71 1 0xffffff7f828b0000 0x2d2000 0x2d2000 net.lundman.zfs (1.7.2) 0F708776-FDC2-39C5-87CE-42CFF3C5DD48 <70 25 7 5 4 3 1>
Tooling
ZFS has some pretty amazing tooling in the zfs
, zpool
, and zdb
commands.
zpool
is used to create and manage pools of disks that logically present as a single volume (like LVM), whereas zfs
is used to manage individual datasets, which are nested hierarchically within pools but may have different tunings (recordsize, etc).
One feature I found myself using a lot for monitoring was zpool iostat
, which gives a current picture of I/O operations across the different media in the pool:
$ zpool iostat -v 3
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------------- ----- ----- ----- ----- ----- -----
pool0 408G 10.5T 399 202 8.76M 13.2M
raidz1 408G 10.5T 399 202 8.76M 13.2M
media-4381EF74-91A2-A841-9073-00DD77A83EAA - - 133 70 2.92M 4.39M
media-A00A5F1E-98B4-AA4C-ABE6-D19D52451730 - - 133 69 2.92M 4.39M
media-3D6C5673-A976-544C-9206-31ED6F4986E4 - - 133 62 2.92M 4.40M
logs - - - - - -
media-F0A2BB1A-583A-4D31-A9FD-887C7B326862 260K 7.00G 0 0 5 11
cache - - - - - -
media-599EF695-4A06-488E-8679-FA03ED260C8D 519M 92.3G 0 0 1 7.16K
---------------------------------------------- ----- ----- ----- ----- ----- -----
Configuration
After installation, I made a pool of my three drives using the zpool
command:
$ zpool create pool0 raidz disk1 disk2 disk3
Striped, mirrored, or RAID-Z?
You may be familiar with the different RAID levels (RAID0, RAID1, etc). Striping in ZFS corresponds to RAID0, mirroring to RAID1, RAIDZ-1 to RAID5, and RAIDZ-2 to RAID6.
These different setups offer different tradeoffs in:
- Performance (IOPS)
- Performance (bandwidth)
- Space efficiency
- Resiliency to drive failure
For example, striping the disks in parallel offers the best performance, since all disks may be used independently in parallel. However, in this configuration the failure of any one disk renders then entire set unusable, since it will contain a 1/N of the data.
There are many in-depth looks at these different configurations, and even calculators, but it’s outside the scope of this article. Suffice to say, because it is my home computer and I am somewhat stingy / out of available drive bays, I chose RAIDZ-1 for my configuration of 3 x 4TB drives, resulting in ~8 TB of usable space while being able to tolerate failure of any one drive.
RAIDZ carries some notable performance limitations, however, these can be mitigated somewhat by using an SSD for both write-ahead logging and caching (i.e. by using the SSD for better low-latency random I/O). Since I have an SSD in my desktop as the main root device, I created a small 8GB partition to use for write-ahead logging (the so-called ZIL, or ZFS intent log; when split out onto an SSD sometimes referred to as an SLOG), and a medium-sized 100G partition to use as a cache. These were added to the pool with:
$ zpool add pool0 cache disk4s1
$ zpool add pool0 log disk4s2
Tuning for MySQL
My computer runs a somewhat large MySQL instance (~3TB data), and MySQL on ZFS is not a good match with the out-of-the-box default configuration. The main reasons for this are:
- Mismatched page size: ZFS and MySQL both manage data in fixed-size pages, however, the default MySQL page size is 16K while the default ZFS page (called
recordsize
) is 128K. This leads to thrashing as updates to MySQL pages require read-modify-writes of the larger ZFS pages. - Redundancy: Both ZFS and MySQL perform write-ahead logging, checksumming, and optionally compression of their data, which is redundant in this case.
Percona has a great guide on how to tune ZFS+MySQL to play nicely together, and there are some additional notes on the OpenZFS Wiki. In the end, for my setup, I went with:
- Two separate ZFS datasets
mysql/data
andmysql/logs
. - For the
data
:recordsize=64k
compression=lz4
primarycache=metadata
logbias=throughput
- For the
logs
:recordsize=128k
(the default)primarycache=metadata
- MySQL tunings:
innodb_page_size=64k
skip-log-bin
innodb_doublewrite=off
innodb_checksum_algorithm=none
With these settings, we match the page size of MySQL to the recordsize of ZFS, and also disable some of the redundant consistency settings in MySQL (double-writing and checksumming) that are not strictly necessary since they are provided by ZFS.
In addition, because both MySQL and ZFS perform in-memory caching, I limited the size of ZFS’s in-memory (ARC) cache to 4GB and disabled data caching (primarycache=metadata
) in favor of MySQL’s innodb_buffer_pool
, which is reported to be 7-200% faster.
Memory leak?
The biggest issue I ran into was importing my large MySQL dataset. It seemed that ZFS was leaking memory (in Activity Monitor a ballooning kernel_task
), even when I limited the size of the in memory ARC
cache with kstat.zfs.darwin.tunable.zfs_arc_max
. This would eventually degrade performance as both MySQL and ZFS would compete for available memory.
Thankfully, I eventually stumbled on a link to a Github issue with the following helpful configuration snippet:
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_max=4294967296
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_limit=3221225472
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_min=1610612736
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_min=1342177280
sysctl -w kstat.zfs.darwin.tunable.zfs_dirty_data_max=536870912
Apparently, when under intense write load (as when I was importing my MySQL dataset), the memory allocator in SPL can fail to release dirty pages promptly, leading to what appears to be a memory leak.
After applying these sysctls (and making them persistent by adding them to /etc/zfs/zsysctl.conf
), the memory usage seems to be appropriately capped.
Conclusions
While ZFS on macOS is definitely still a bit edge, I found it reasonably reliable, with no observed kernel panics or other major issues. Performance definitely takes a hit, although the authors claim that it may improve in time: to date they have primarily focused on stability rather than performance (which seems wise given some of the backlash to early bugs in Btrfs). In addition, ZFS offers a RAID5-like configuration not otherwise available in macOS. I’ll be curious to see how it performs going forward, and whether or not any issues arise. I’m also interested in setting up a system to use ZFS snapshots to perform incremental backups offsite.
If you’re interested in diving into more technical details of ZFS, Chris Siebenmann has an excellent series of blogs about it.