♆ Build with BSD

Practical Engineering with FreeBSD.

ZFS Tutorial Part 1

By Will Green, Last updated 2016-03-16

Introduction

ZFS is an open-source filesystem used by FreeBSD, illumos, Solaris, and Linux. This series of tutorials takes you through all the important aspects of ZFS, including pool and filesystem creation, quotas, compression, disk failures, fault tolerance, snapshots, clones, caching, and the ZFS architecture itself. To allow you to safely experiment we use files instead of disks: that way you can use your laptop or virtual machine to build and break complex setups.

In this first tutorial I provide a brief overview of ZFS and show you how to manage ZFS pools: the foundation of ZFS. In subsequent parts we look at ZFS filesystems and their features in more depth. All the ZFS tutorials should work on any recent ZFS system including Ubuntu 16.04 LTS. We welcome your thoughts, criticisms and questions @BuildWithBSD.

"Let your hook be always cast; in the pool where you least expect it, there will be a fish." » Ovid

ZFS Overview

ZFS takes a different approach to storage management. Rather than have a filesystem on top of separate volume/raid software it looks after the complete stack: from devices to filesystems. Because of this it has many features that are difficult or impossible on traditional Unix filesystems, for example: instant snapshots, copy-on-write clones, data-rot protection, and the ability to add new disks or filesystems on the fly.

The architecture of ZFS has four levels. One or more ZFS filesystems exist in a ZFS pool, which consists of one of more vdevs composed of one or more devices (typically disks or SAN). Don't worry if this doesn't make sense yet; we'll explore all these levels throughout the tutorial series.

The following diagram shows a simple ZFS pool consisting of two mirrored pairs of disks. There are three filesystems sharing the pool resources. For resilience each mirror stores the data twice: once on each disk in the mirror. We will create a pool like this later in the tutorial.

Getting Started

Requirements

WARNING: FreeBSD/PC-BSD 10.1 users on SSD should make sure they're on 10.1-RELEASE-p11 or greater to avoid a kernel panic. See FreeBSD-EN-15:07.zfs for details.

Ubuntu 16.04

If you're running Ubuntu 16.04 LTS (Xenial Xerus) your kernel has ZFS support, but you still need to install the command-line tools:

$ sudo apt-get install zfsutils-linux

Creating Files

Normally you would create your storage pool with disks or on a SAN. For this tutorial series we use files: this allows you to experiment with ZFS on a personal machine or VM.

Create five 256 MiB files to act as disks:

$ mkdir /tmp/zfstut
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk1
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk2
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk3
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/disk4
$ dd bs=1m count=256 if=/dev/zero of=/tmp/zfstut/sparedisk

$ ls -lh /tmp/zfstut
total 3
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk1
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk2
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk3
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 disk4
-rw-r--r--  1 flux  wheel   256M Jan 23 17:54 sparedisk

ZFS Commands

ZFS has just two commands and they're actually user friendly!

If you run either command with no options it gives you a handy summary.

Pools

All ZFS filesystems live in a pool and share its resources. The first step in using ZFS is to create a pool. ZFS pools are administered using the zpool command.

Before creating new pools you should check for existing pools to avoid confusing them with your tutorial pools. You can check what pools exist with zpool list:

$ zpool list 
no pools available

On systems that use ZFS by default you will see existing pools like this:

$ zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank    38G  3.82G  34.2G         -     3%    10%  1.00x  ONLINE  -

Be careful not to change or destroy any existing pools during the tutorial. We'll give our pools distinctive names to make them easy to identify. In subsequent output we'll only show the pools used in the tutorial.

Single Disk Pool

The simplest pool consist of a single device. Pools are created using zpool create. You need appropriate privileges to create or change ZFS pools. In this tutorial we show the commands being run as root, but you can also use sudo or a role with appropriate privileges. We can create a single disk pool as follows:

# zpool create herring /tmp/zfstut/disk1
# zpool list herring
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
herring   240M    50K   240M         -     1%     0%  1.00x  ONLINE  -

No volume management, configuration, newfs or mounting is required. You now have a working pool complete with mounted ZFS filesystem under /herring. We will learn about adjusting mount points in part 2 of the tutorial.

Create a file in the 'herring' filesystem:
# dd bs=1m count=64 if=/dev/random of=/herring/foo
# ls -lh /herring
total 60845
-rw-r--r--  1 root  wheel    64M Jan 23 17:55 foo
# zpool list herring
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
herring   240M  64.2M   176M         -    21%    26%  1.00x  ONLINE  -

The new file is using about a quarter of the pool capacity (indicated by the CAP value). NB. If you run the list command before ZFS has finished writing to the disk you will see lower USED and CAP values than shown above; wait a few moments and try again.

Now destroy your pool with zpool destroy:

# zpool destroy herring
# zpool list herring
cannot open 'herring': no such pool

You will only receive a warning about destroying your pool if files on it are in use. We'll see in a later tutorial how you can recover a pool you've accidentally destroyed.

Mirrored Pool

A pool composed of a single disk doesn't offer any redundancy: if the disk fails our data is lost. One method of providing redundancy is to create a pool out of a mirrored pair of disks (the pair form a vdev):

# zpool create trout mirror /tmp/zfstut/disk1 /tmp/zfstut/disk2
# zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M    50K   240M         -     1%     0%  1.00x  ONLINE  -

To see more detail about the pool use zpool status:

# zpool status trout
  pool: trout
 state: ONLINE
  scan: none requested
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     0
      /tmp/zfstut/disk2  ONLINE       0     0     0

errors: No known data errors

We can see our pool contains one mirror of two disks. Let's create a file and see how USED changes:

# dd bs=1m count=64 if=/dev/random of=/trout/foo

# zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M  64.2M   176M         -    20%    26%  1.00x  ONLINE  -

As before about a quarter of the disk has been used, but the data is now stored redundantly over both disks.

Mirror Resilience

Let's test the resilience of our mirror by overwriting some of one disk with random data:

# dd bs=1m seek=10 count=1 conv=notrunc if=/dev/random of=/tmp/zfstut/disk1

ZFS will spot the issue with the damaged data if we try to access it, however we can also force a check of the pool with zpool scrub:

# zpool scrub trout

We can then inspect the pool status again:

# zpool status trout
  pool: trout
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
  attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.12M in 0h0m with 0 errors on Sat Jan 23 18:10:39 2016
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     9
      /tmp/zfstut/disk2  ONLINE       0     0     0

errors: No known data errors

The disk file is fine; it's just some of its data that's damaged, so zpool clear should do the trick:

# zpool clear trout
# zpool status trout
  pool: trout
 state: ONLINE
  scan: scrub repaired 1.12M in 0h0m with 0 errors on Sat Jan 23 18:10:39 2016
config:

  NAME                   STATE     READ WRITE CKSUM
  trout                  ONLINE       0     0     0
    mirror-0             ONLINE       0     0     0
      /tmp/zfstut/disk1  ONLINE       0     0     0
      /tmp/zfstut/disk2  ONLINE       0     0     0

errors: No known data errors

Our pool is back to a healthy state: the data was repaired using the other disk in the mirror.

That's all very well if the data is corrupted, but what about a disk failure? We can simulate a whole disk failure by truncating the disk file and running another scrub:

# echo > /tmp/zfstut/disk1

# zpool scrub trout
# zpool status trout
  pool: trout
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
  invalid.  Sufficient replicas exist for the pool to continue
  functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jan 23 18:15:11 2016
config:

  NAME                     STATE     READ WRITE CKSUM
  trout                    DEGRADED     0     0     0
    mirror-0               DEGRADED     0     0     0
      1905457780109944468  UNAVAIL      0     0     0  was /tmp/zfstut/disk1
      /tmp/zfstut/disk2    ONLINE       0     0     0

errors: No known data errors

The disk file we truncated is showing as UNAVAIL, but no data errors are reported for the pool as a whole. We can still read and write to the pool:

# dd bs=1m count=64 if=/dev/random of=/trout/bar 
# ls -lh /trout
total 131165
-rw-r--r--  1 root  wheel    64M Jan 23 18:17 bar
-rw-r--r--  1 root  wheel    64M Jan 23 18:04 foo

To maintain redundancy we should replace the broken disk with another using zpool replace.

# zpool replace trout /tmp/zfstut/disk1 /tmp/zfstut/sparedisk

Check to see if our pool is healthy again:

# zpool status trout
  pool: trout
 state: ONLINE
  scan: resilvered 128M in 0h0m with 0 errors on Fri Jan 22 22:17:59 2016
config:

  NAME                       STATE     READ WRITE CKSUM
  trout                      ONLINE       0     0     0
    mirror-0                 ONLINE       0     0     0
      /tmp/zfstut/sparedisk  ONLINE       0     0     0
      /tmp/zfstut/disk2      ONLINE       0     0     0

errors: No known data errors

If you are quick enough, or your device is slow enough, you may see a resilvering in progress. A resilivering is analogous to remirroring in traditional raid, but only copies blocks with data in them; this can save many hours on large magnetic disks.

Adding to a Mirrored Pool

You can add disks to a pool without taking it offline. Let's double the size of our trout pool by adding a second mirror using zpool add:

# zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   240M   128M   112M         -    40%    53%  1.00x  ONLINE  -

# zpool add trout mirror /tmp/zfstut/disk3 /tmp/zfstut/disk4

# zpool list trout
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
trout   480M   128M   352M         -    20%    26%  1.00x  ONLINE  -

This happens almost instantly, and the filesystems within the pool remain available during the addition. Looking at the status now shows the pool consists of two mirrors:

# zpool status trout
  pool: trout
 state: ONLINE
  scan: resilvered 128M in 0h0m with 0 errors on Fri Jan 22 22:17:59 2016
config:

  NAME                       STATE     READ WRITE CKSUM
  trout                      ONLINE       0     0     0
    mirror-0                 ONLINE       0     0     0
      /tmp/zfstut/sparedisk  ONLINE       0     0     0
      /tmp/zfstut/disk2      ONLINE       0     0     0
    mirror-1                 ONLINE       0     0     0
      /tmp/zfstut/disk3      ONLINE       0     0     0
      /tmp/zfstut/disk4      ONLINE       0     0     0

errors: No known data errors

We can see where the data is currently written in our pool using zpool iostat -v:

# zpool iostat -v trout
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
trout                       128M   352M      4      5   480K   144K
  mirror                    128M   112M      4      5   480K   144K
    /tmp/zfstut/sparedisk      -      -      0     26    351  1.90M
    /tmp/zfstut/disk2          -      -      4      5   484K   151K
  mirror                     11K   240M      0      0      0    488
    /tmp/zfstut/disk3          -      -      0      1    509  49.8K
    /tmp/zfstut/disk4          -      -      0      1    509  49.8K
-------------------------  -----  -----  -----  -----  -----  -----

All the data is currently written on the first mirror pair, and none on the second. This makes sense as the second pair of disks was added after the data was written and ZFS doesn't move existing data around. However, if we write some new data to the pool the new mirror will be used:

# dd bs=1m count=128 if=/dev/random of=/trout/quuxx

# zpool iostat -v trout
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
trout                       256M   224M      3      7   441K   260K
  mirror                    178M  62.4M      3      6   441K   182K
    /tmp/zfstut/sparedisk      -      -      0     18    159  1.19M
    /tmp/zfstut/disk2          -      -      3      6   444K   188K
  mirror                   78.8M   161M      0      9      0   614K
    /tmp/zfstut/disk3          -      -      0     10    185   633K
    /tmp/zfstut/disk4          -      -      0     10    185   633K
-------------------------  -----  -----  -----  -----  -----  -----

Note how a more of the new data has been written to the new mirror than the old: ZFS tries to make best use of all the resources in the pool. As more writes occur the mirrors will gradually move towards balance.

Finally we should destroy the trout pool: we don't be using it in the next tutorial:

# zpool destroy trout

Conclusion

That's it for part 1. I hope this has given you a taste of the power of ZFS and a solid foundation in ZFS pools. In part 2 we will look at managing ZFS filesystems, including properties, quotas, and compression. ♆

» Read part 2 of the ZFS tutorial.

Build with BSD ©2016 Will Green. Share your thoughts with @BuildWithBSD. Hosted on FreeBSD.