Jump to content

ZFS from A to Z (a ZFS tutorial)

Edit: There's enough info here for this post to start being useful, but it's still not anywhere close to done. Any chance this can get stickied?  :D

 

@wpirobotbuilder @looney @alpenwasser here it is!  :D Thought you guys would be interested in this.

 

If you notice any errors in my information or find important information that is missing or have any questions, please ask! I will be maintaining this thread for as long as I am a forum member.

 

1 What is the purpose of this tutorial?

This tutorial attempts to thoroughly cover the basics of ZFS: What it is, why you would want to use it, how to set it up, and how to use it.

 

This tutorial is meant to be understandable by someone with no prior knowledge of ZFS, although a basic knowledge of *nix operating systems is expected for installation section.

 

2 What is ZFS?

ZFS is a file system, similar to how NTFS is the file system that Windows uses, HFS+ is the file system that OS X uses, and ext4 is the file system that most current Linux distros use by default. (ZFS is also a logical volume manager, but I won't get into that now) A file system controls how a computer reads, writes, and organizes files. This post isn't about file systems though, (and frankly I don't know enough about file systems to write a long post about them) so if you'd like to know more start here: http://en.wikipedia.org/wiki/File_system

 

ZFS was originally developed by Sun Microsystems (which is now Oracle) for its operating system, Solaris. ZFS is the default file system of Solaris, and is an option for FreeBSD, but it gets a bit more complicated on Linux as it is not a native option. ZFS can not be shipped with Linux in its current state because the CDDL (Common Development and Distribution License) that ZFS is licensed with is incompatible with the GPL (GNU General Public License) that Linux is licensed with. You can read more here: http://en.wikipedia.org/wiki/Common_Development_and_Distribution_License

 

3 Why should I use ZFS?

Opinion: ZFS is the best file system and storage integrity option in existence right now.

Fact: ZFS can provide more redundant, cheaper, and faster storage than any other RAID solution that I am aware of.

 

3.1 Redundancy

ZFS's provides redundancy mainly through software raid with similar RAID levels to other solutions, but what sets ZFS apart is its block-level checksumming and online data integrity checking. Normal redundant hardware and software RAID levels allow for disk failures and replacements, but they fail to check for silent data corruption.

 

Silent data corruption is caused by the occasional read and write error in hard drives, and usually does not get detected until a degraded RAID array fails to rebuild. At that point it is too late to do anything and your supposedly redundant storage solution becomes useless. (To clarify: when this happens, all data is lost).

 

ZFS accounts for silent data corruption by storing a checksum of all used data blocks, which can be checked for integrity at any time through a process known as scrubbing. Scrubbing a ZFS pool (ZFS refers to groups of storage devices as pools, they might otherwise be known as arrays, but they're not quite arrays. I'll get into that later). re-computes the checksum of all used blocks and compares it to the stored checksum. If they do not match, then the corrupted block can be re-constructed using the stripes and/or parity blocks from the other disks. It is pertinent to note that scrubbing can be done during file system use. The FS (file system) does not need to be un-mounted or otherwise taken offline.

 

Another issue with many RAID solutions is the RAID5/6 write hole problem, wherein if power is lost during a write operation, some blocks can be left unwritten or half written and the FS won't know, leaving it with corrupted files. Hardware RAID accounts for this with battery backup, but battery backup is a hacked solution at best as it is useless if the battery is missing, dead, or old.

ZFS accounts for the write hole problem by using using copy-on-write, meaning when data is written, it does not overwrite the modified file, but instead writes it to a new block and then adjusts pointers accordingly. The checksums discusses previously also ensure that a block of data matches what was supposed to be written.

 

3.2 Cost

Cost is always a limiting factor. Hardware RAID requires expensive controllers, some software RAID solutions require you pay for the software (flexRAID), and most solutions encourage the use of enterprise (read: SAS) class disks because of their higher speeds and lower read/write error rates.

 

ZFS, by contrast, is free, encourages the use of cheep disks (read: SATA), and solves the issues of read/write errors in software instead of hardware, making it incredibly cheep. The only potential for extra cost in ZFS implementation arises from ZFS's desire for lots of ram and a fast cache drive (usually an SSD). See sample configurations in section 10 for a better look at cost analysis.

 

3.3 Speed

"So if ZFS is redundant AND cheep, it can't be fast, can it? There's no way you can have all three," you must be thinking.

 

Well I'm here to tell you that you thought wrong! Implemented properly, ZFS can be faster than just about all forms of hardware and software RAID.

 

ZFS achieves its speed through caching, logging, deduplication, and compression. 

 

3.3.1 Caching & Logging

By default, ZFS will use up to 7/8 if your system memory as a kind of level 1 cache and can be configured to use a fast disk (read: an SSD) as a kind of level 2 cache. In ZFS terms the memory-based cache is know at the ARC, or adaptive replacement cache, and the fast disk based cache is known as the L2 ARC, or level 2 ARC. ZFS also has an optional ZIL, or ZFS Intent Log, a dedicated partition of a fast disk that ZFS uses to store transactions temporarily before they are read or written to the disk in bursts.

 

3.3.2 Deduplication & Compression

ZFS natively supports both deduplication and compression. Deduplication, for those who don't know, allows for copies or near copies of data to be stored as a reference to the original data instead of as a straight copy, which saves space at the cost of a bit of CPU time and a lot of memory. Deuplication and compression can be enabled for an entire ZFS pool, or just a few datasets. I'll get into datasets later, but for now just think of them as folders.

 

4: Why shouldn't I use ZFS?

There are a few important caveats that one should be aware of before use:

 

4.1 License Incompatibility

As discussed earlier, the license that ZFS is licensed under (CDDL) is not compatible with the license that Linux is under, GPL, so no Linux distro comes shipped with ZFS. This means installing it as the root FS is difficult for most distros and impossible for others.

 

4.2 ZOL is Incomplete

The ZFSonLinux (ZOL) project is not perfect. It is under constant development and has most of the features of Oracle ZFS, but it is missing a few features like Samba implementation. example: Oracle ZFS allows Samba sharing of a ZFS dataset with one command, ZOL does not have that feature yet.

 

4.3 No Defragmentation

ZFS performs very poorly once the FS starts to get full (past 80% according to Oracle, but some reports show degradation at lower percentages). This is due to ZFS not having defragmentation functionality (according to Wikipedia, defraging is in the works). Because of the L2 ARC and ZIL (see 3.3.1 if you don't know what those are) as well as some FS voodo that I don't fully understand, this presents as less of an issue that one might think.

 

4.4 Hardware

For ZFS to deliver maximum performance, it needs a decent amount of RAM and a decent sized L2ARC. For home use, 8GB of memory and a 64-120GB SSD are sufficient, though power users may notice some benefit at up to 32GB of RAM and 250-500GB of SSD space. ECC memory is desirable for maximum data integrity, but not required.

 

5 What do I need to use ZFS?

I alluded to this topic a bit in the last section, but here I will explicitly define ZFSs requirements.

 

5.1 Operating System

To use ZFS you must be running one of the following OSs:

 

5.1.1 Solaris & Variants

These include Solaris, Open Solaris, and Illumos

Little known fact that Solaris is free and can be downloaded from Oracle's website.

 

5.1.2 FreeBSD

It is likely that ZFS works on other BSD variants, but BSD isn't my cup of tea (I know little to nothing about it) so you'll have to find another source to either confirm or deny that.

 

5.1.3 Linux

I have personally used ZFS on LinuxMint and CentOS for extended periods of time with no problems and have tested it on Debian. As far as I'm aware it will run on most distros with one caveat: for the most part you can not install the root FS on ZFS. There are some hacks to get this to work, but I have yet to try any of them. (If you really want everything on ZFS, its worth looking into Solaris). For a full list of the distros that are supported by ZOL, see section 9.3

 

5.1.4 OS X

ZFS can actually be used on OS X as OS X is a Unix operating system, but that's outside of my expertise so I will not be covering that.

5.2 Drives

5.2.1 Array Drives

In general your main array drives will be HDDs. SSDs will work fine, but you will be paying a LOT more for your storage. The drives don't need to be anything special, though drives in the same array DO need to be the same size.

 

Since ZFS takes care of errors by saving checksums of all data (and pointers to data for that matter), the quality of the drives used is rather unimportant. For example, there is no need to use enterprise drives for their lower read/write error rates, although it is worth using them for raw throughput if you're looking to build an array with very high throughput capabilities. However, even for applications requiring high throughout you can usually get away with 7,200 RPM consumer-grade drives and a larger ARC and L2 ARC for lower total cost.

 

In my personal array I use WD Reds, but just about anything will work: all of WD's consumer line (black, red, green, blue) and all of Seagate's consumer line (desktop series, NAS series, etc.)

5.2.2 L2 ARC/ZIL Drives

For a low-end to mid-range ZFS server it's okay to put the L2 ARC and ZIL partitions on the same drive, but for a high-end servers there should be multiple deives. (I'll get into that more in section 10 sample setups). That being said, the L2 ARC and ZIL have essentially the same hardware requirement in that they should be SSDs. It should be noted that both the ZIL and L2 ARC are optional for ZFS implementations, but they will greatly improve performance.

 

For low to mid range ZFS servers most consumer SSDs will do, and size is up to the individual user. I would recommend 64-240GB, with 120GB being sufficient for most applications. For drives of similar price, the drive with higher write endurance should be chosen as the L2 ARC/ZIL drives experience significant amounts of writing, even if the workload is mainly reads (moving data in and out of the cache requires heavy writing). The Intel 530 and Corsair Neutron GTX are both very good choices as they excel in sustained random performance (See Anandtech's SSD reviews (section 9.4) for a more in-depth explanation of SSD performance consistency).

 

If multiple drives are used for the L2 ARC/ZIL, they are treated as a JBOD by ZFS. Therefore in order to optimize the L2 ARC/ZIL for the most IOPS/GB, high end systems should use multiple SSDs, however it is better to have more smaller SSDs than a few large ones so that the overall throughput is better. 240GB drives are the sweet spot for this. As far as specific drive choice, I would advise intel DC3500 or DC3700 series drives. Both are optimized for performance constancy and have high write endurance, with the 3700 series having the higher endurance of the two. PCI or 12Gb SAS SSDs are a superior option, but that market is so small right now that I won't be touching on it here.

5.3 Memory

Memory is one thing ZFS really loves. You average non-ZFS consumer-grade file server will run comfortably on 2GB of RAM, but a low to mid-range ZFS-based server should have a 8-16GB ARC, with no real upper bound for high end servers (although too much is wasteful). Applications that do a lot of random reading (like virtual machines) will benefit the most from a large ARC/L2ARC. Sequential applications such as a media server will see very little benefit from a large ARC/L2ARC, and would likely run fine on 4GB.

 

Note that these memory specs are for the ARC, not the system memory. If the system is going to run other memory heavy applications, take those into account when spec'ing the machine. If you want an ARC of 16GB and are running another server application that takes 10GB of ram, the server should probably have 32GB of memory.

 

The most important point about system memory for ZFS-based fileservers, however, is that it should be ECC (error correcting) memory. One of the few vulnerabilities of ZFS is memory corruption. Data stored in memory is liable to be corrupted by a rare bit-flip from cosmic radiation, but ECC memory solves this issue nicely.

5.4 Other Considerations

5.4.1 Motherboard

For a low end ZFS server, any cheep consumer motherboard will do, however for mid-range or high end builds, server motherboard should be used for their ECC memory and buffered DIMM support. See section 10 for more info.

5.4.2 CPU

ZFS is generally not too computationally intensive, but if you want to use some of its advanced features (compression, deduplication, L2ARC, etc) it will need more horsepower to ensure ideal performance. Low end and lower mid range systems will work well on an Intel Pentium or i3 or AMD APU. Upper-mid-range systems will need quad core chips such as a 4C Xeon E3, 4C Xeon E5 (E5 series supports buffered dimms (needed for >32GB of memory)), or 4C Opterons. High-end systems will need one or more higher-core count Xeon E5s or Opterons.

 

6 How do I get started?

6.1 Installation

 

6.1.1 Debian 7

wget http://archive.zfsonlinux.org/debian/pool/main/z/zfsonlinux/zfsonlinux_2%7Ewheezy_all.debdpkg -i zfsonlinux_2~wheezy_all.debapt-get updateapt-get install debian-zfs

Tested and working as of 12/23/2013 on Debian 7.0.2

6.1.2 Ubuntu 12.04

sudo apt-get install python-software-propertiessudo apt-add-repository --yes ppa:zfs-native/stablesudo apt-get updatesudo apt-get install ubuntu-zfs

Note that the package python-software-properties is necessary for the command apt-add-repository, but is not included on a minimal installation of Ubuntu.

 

Tested and working as of 12/23/2013 on Ubuntu Server 12.04.3

 

6.1.3 CentOS 6.x

The CentOS install is a bit finicky.

First make sure you're running the newest kernel in the el6 repository (2.6.32-431 at time of writing). The install WILL SILENTLY FAIL and you will not be able to use any of the ZFS functionality if you do not use this kernel. I'm hardly a Linux expert so I don't know exactly why this is, but if I find a fix/workaround in the the future I'll be sure to add it here.

 

To update the kernel, run:

yum update kernel*reboot now

YOU MUST REBOOT. The system will continue to run the old kernel and ZFS will not install until you reboot.

 

After rebooting, run:

yum updateyum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release-1-3.el6.noarch.rpmyum install zfs

Tested and working as of 1/16/2014 on CentOS 6.4 and CentOS 6.5

6.1.4 Solaris

1. do nothing

2. ZFS is native

3. ?????

4. profit

 

6.1.5 FreeBSD

1. do nothing
2. ZFS is native
3. ?????
4. profit

6.2 Basic Setup

 

Note that the following test setup was done on a Ubuntu Server 12.04.3 virtual machine. Setup should be similar on most OSs, but if you run into any problems, please ask questions in the comments and I will answer them.

 

6.2.1 Creating Your First Zpool

To create a zpool, first find out what disks you have on your system, open a terminal and enter the following command:

lsblk | grep "sd" 

The output should look something like this:

sda      8:0    0     8G  0 disk├─sda1   8:1    0     4G  0 part /├─sda2   8:2    0     1K  0 part└─sda5   8:5    0     4G  0 part [SWAP]sdb      8:16   0     8G  0 disksdc      8:32   0     8G  0 disksdd      8:48   0     8G  0 disksde      8:64   0     1G  0 disk 
Notice in this example, I have 5 drives labled sda, sdb, sdc, sdd, sde. (sd stands for single device). The first drive, sda, is my boot drive and has 3 partitions labeled sda1, sda2, and sda5. The other 4 drives have no partitions.
 
In this example I will create a raidz array using the disks sdb, sdc, and sdd. Note that raidz is ZFS's version of RAID 5. Other levels include mirror (raid1), raidz2 (raid6), and raidz3 (triple fault tolerant raid, unofficially known as raid7).
 
To create a zpool, use the command:
sudo zpool create <pool name> <vdev>
For example:
eric@[Ubuntu]$ sudo zpool create happy-fun-pool raidz sdb sdc sdd -f
Note that the -f (force) option is useful if you have new, unpartitioned drives or drives that have existing data on them THAT YOU WANT TO OVERWRITE, as the -f option will  use the entire disk and create new partitions.
 
To make sure that this command worked correctly, enter:
sudo zpool status <pool-name>
For example:
eric@[Ubuntu]$ sudo zpool status happy-fun-pool

Should output something like this:

  pool: happy-fun-pool  state: ONLINE  scan: none requestedconfig:        NAME        STATE     READ WRITE CKSUM        happy-fun-pool  ONLINE       0     0     0          sdb       ONLINE       0     0     0          sdc       ONLINE       0     0     0          sdd       ONLINE       0     0     0errors: No known data errors

6.2.2 Adding a ZIL and L2ARC

Find out which disk is your SSD using  the following command:

lsblk | grep "sd"

The output should look something like this:

eric@[member='Ubuntu']$ lsblk | grep "sd"NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTsda      8:0    0     8G  0 disk├─sda1   8:1    0     4G  0 part /├─sda2   8:2    0     1K  0 part└─sda5   8:5    0     4G  0 part [SWAP]sdb      8:16   0     8G  0 disk├─sdb1   8:17   0     8G  0 part└─sdb9   8:25   0     8M  0 partsdc      8:32   0     8G  0 disk├─sdc1   8:33   0     8G  0 part└─sdc9   8:41   0     8M  0 partsdd      8:48   0     8G  0 disk├─sdd1   8:49   0     8G  0 part└─sdd9   8:57   0     8M  0 partsde      8:64   0     1G  0 disk 
Note that by looking at the 4th column we see that sde is smaller than the rest of my drives (it is 1GB while the rest are 8GB). I will be using it as my ZIL and L2ARC drive.
In order to have both the ZIL and L2ARC on the same drive, we need to give them separate partitions. To do that, use your favorite partitioning tool. (Fdisk is always a good one, a tutorial can be found here http://www.tldp.org/HOWTO/Partition/fdisk_partitioning.html). Note that the ZIL doesn't need much space at all. If you're using a 120GB SSD for your ZIL / L2ARC, a 1GB ZIL should be sufficient. The rest should be left for the L2ARC.
 
To ensure that your partitioning was successful, run lsblk | grep "sd" again:
eric@[Ubuntu]$ lsblk | grep "sd"NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTsda      8:0    0     8G  0 disk├─sda1   8:1    0     4G  0 part /├─sda2   8:2    0     1K  0 part└─sda5   8:5    0     4G  0 part [SWAP]sdb      8:16   0     8G  0 disk├─sdb1   8:17   0     8G  0 part└─sdb9   8:25   0     8M  0 partsdc      8:32   0     8G  0 disk├─sdc1   8:33   0     8G  0 part└─sdc9   8:41   0     8M  0 partsdd      8:48   0     8G  0 disk├─sdd1   8:49   0     8G  0 part└─sdd9   8:57   0     8M  0 partsde      8:64   0     1G  0 disk├─sde1   8:65   0   800M  0 part└─sde2   8:66   0   223M  0 part
Note that the drive sde now has two partitions labeled sde1 and sde2
 
 
After partitioning your SSD, you'll want to add the partitions to your zpool as cache and log devices. You can do so using the following command:
sudo zpool add <pool-name> cache </cache/partition/location> log </log/partition/location>
WARNING: For reasons explained under the warning section here: https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/ you should specify your log and cache drives by their device ids. To find those, enter:
ls /dev/disk/by-id/ | grep "ata"

That should output something like this. Note that your drive names will be very different if you are not in a virtual machine:

eric@[member='Ubuntu']$ ls /dev/disk/by-id/ | grep "ata"ata-VBOX_CD-ROM_VB2-01700376ata-VBOX_CD-ROM_VB5-1a2b3c4data-VBOX_HARDDISK_VB2df90434-7b71f9c8ata-VBOX_HARDDISK_VB2df90434-7b71f9c8-part1ata-VBOX_HARDDISK_VB2df90434-7b71f9c8-part9ata-VBOX_HARDDISK_VB512670cf-970bdc54ata-VBOX_HARDDISK_VB512670cf-970bdc54-part1ata-VBOX_HARDDISK_VB512670cf-970bdc54-part9ata-VBOX_HARDDISK_VB524c75de-aaa4cc3fata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part1ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part2ata-VBOX_HARDDISK_VBf3faef2d-33fd94d2ata-VBOX_HARDDISK_VBf3faef2d-33fd94d2-part1ata-VBOX_HARDDISK_VBf3faef2d-33fd94d2-part2ata-VBOX_HARDDISK_VBf3faef2d-33fd94d2-part5ata-VBOX_HARDDISK_VBfd18c5cd-dbe29221ata-VBOX_HARDDISK_VBfd18c5cd-dbe29221-part1ata-VBOX_HARDDISK_VBfd18c5cd-dbe29221-part9

Now, to actually add the cache and log devices to my zpool, I entered the following. (tip: tab completion makes your life a lot easier):

eric@[Ubuntu]$ sudo zpool add happy-fun-pool cache /dev/disk/by-id/ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part1 log /dev/disk/by-id/ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part2

To ensure the devices were added successfully, again run sudo zpool status <pool-name>. My output looked like this:

eric@[Ubuntu]$ sudo zpool status happy-fun-pool    pool: happy-fun-pool  state: ONLINE  scan: none requestedconfig:        NAME                                           STATE     READ WRITE CKSUM        happy-fun-pool                                 ONLINE       0     0     0          sdb                                          ONLINE       0     0     0          sdc                                          ONLINE       0     0     0          sdd                                          ONLINE       0     0     0        logs          ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part2  ONLINE       0     0     0        cache          ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part1  ONLINE       0     0     0errors: No known data errors

To check for the sizes of all devices (to ensure the correct partitions/drives have been used), you can use:

sudo zpool iostat -v

Which will output something like:

eric@[Ubuntu]$ sudo zpool iostat -v                                                  capacity     operations    bandwidthpool                                           alloc   free   read  write   read  write---------------------------------------------  -----  -----  -----  -----  -----  -----happy-fun-pool                                  148K  23.8G      0      0    712  4.19K  sdb                                          54.5K  7.94G      0      0    235  1.02K  sdc                                          39.5K  7.94G      0      0    231  1.01K  sdd                                          54.5K  7.94G      0      0    233  1.02Klogs                                               -      -      -      -      -      -  ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part2      0   218M      0      0     20  1.98Kcache                                              -      -      -      -      -      -  ata-VBOX_HARDDISK_VB524c75de-aaa4cc3f-part1  28.5K   795M      0      0    116     44---------------------------------------------  -----  -----  -----  -----  -----  -----

6.2.3 Mounting and Unmounting the Pool

Mounting your new zpool is trivial. Just set the mountpoint:

sudo zfs set mountpoint=<mountpoint> <pool-name>

For example:

eric@[Ubuntu]$ sudo zfs set mountpoint=/home/eric/pool happy-fun-pool

Then simply mount the filesystem with either:

sudo zfs mount -a 

To mount all pools, or:

sudo zfs mount <pool-name> 

To mount a single pool.

 

To check where a pool mounts to, enter:

sudo zfs list 

Which will show you the mountpoint:

eric@[Ubuntu]$ sudo zfs listNAME             USED  AVAIL  REFER  MOUNTPOINThappy-fun-pool   128K  23.4G    30K  /home/eric/pool 

Similarly to unmount all pools enter:

sudo zfs umount -a

or enter:

sudo zfs umount <pool-name> 

to unmount a single pool.

 

7 The Zpool

7.1 Datasets

From a very basic standpoint datasets are just directories, but they are capable of a lot more. ZFS lets you set properties for datasets that are different from the rest of the pool, as well as take snapshots of any given dataset or even an entire pool.

7.1.1 Creating Datasets

Creating datasets is fairly straightforward. Simply use the following command:

zfs create <pool_name/path/to/dataset> 

For example:

zfs create happy-fun-pool/test

Note that there is no slash in front of the pool name.

7.1.2 Compression

Compression is one of ZFS's many, many configurable properties.

 

To enable it, simply enter:

zfs set compression=<algorithm> <dataset> 

Where algorithm is one of the following:

on | off | lzjb | gzip | gzip-[1-9] | zle | lz4 

Here's my personal system:

# zfs listNAME                   USED  AVAIL  REFER  MOUNTPOINTtank                  3.95T  6.70T   384K  /tanktank/media            3.02T  6.70T   336K  /tank/mediatank/media/movies     1.73T  6.70T  1.73T  /tank/media/moviestank/media/music      41.3G  6.70T  41.3G  /tank/media/musictank/media/pictures    272K  6.70T   272K  /tank/media/picturestank/media/tvshows    1.25T  6.70T  1.25T  /tank/media/tvshowstank/minecraft         272K  6.70T   272K  /tank/minecrafttank/nfs               400M  6.70T   400M  /tank/nfstank/server           1.26M  6.70T   320K  /tank/servertank/server/vms        975K  6.70T   975K  /tank/server/vmstank/upload            847K  6.70T   304K  /tank/uploadtank/upload/active     272K  6.70T   272K  /tank/upload/activetank/upload/finished   272K  6.70T   272K  /tank/upload/finishedtank/user              945G  6.70T   288K  /tank/usertank/user/eric         945G  6.70T   945G  /tank/user/erictank/vdisks           3.02G  6.70T  3.02G  /tank/vdisks

One of the cool things about using ZFS datasets is that they inherit properties from their parents. For example, I enabled lz4 compression for the whole pool with:

zfs set compression=lz4 tank 

And then disabled if for my media dataset because movies, pics, music, etc, are generally incompressible, so trying to compress them would just waste CPU cycles.

zfs set compression=off tank/media 

Now the child datasets of tank/media (specifically: tank/media/movies, tank/media/music, tank/media/pictures, and tank/media/tvshows) all have compression disabled as well.

 

This can be confirmed with a simple:

# zfs get compression tank/media/picturesNAME                 PROPERTY     VALUE     SOURCEtank/media/pictures  compression  off       inherited from tank/media

It's worth noting that compression is very computationally cheep and you should basically always enable it for your storage pools. Also of note is that compression must be turned on before data is written to the disk. ZFS will not try to compress data that was written before compression was enabled.

7.1.3 Deduplication

Deduplication is fairly interesting on ZFS. For those who don't know, deduplication is the process of storing duplicate data as a pointer to the original data instead of storing the same data twice. In practice, this means that things like backups, virtual disks, and documents will take up much less disk space than they otherwise would. Deduplication is accomplished by taking the SHA256 hash of every block written to the disk and storing the hash value in memory. Then, for every new block that comes through, if its hash matches an existing hash, the block is stored as a pointer to the original data instead of being written to disk. It does come with some caveats, however.

 

Deduplication takes a LOT of memory to store the deduplication table (or DDT). On the order of about 5GB of memory per TB of used disk space. To make things worse, ZFS will only use up to 1/4 of max ARC (memory cache) size for the DDT by default. I'll go into this more in a later post as there's a lot more to ZFS dedup tuning.

 

To enable it, use:

zfs set dedup=on <dataset> 

7.2 Snapshots

nothing here yet (lots to come)

7.3 Scrubbing

Scrubbing is the process of reading all the data on the filesystem, hashing each block (both data blocks and metadata blocks), and comparing it to the stored hash. If the hashes do not match, then ZFS will attempt to correct any errors using the non-corrupt data on redundant disks. This process is one of the selling points of ZFS and is one of the main sources of its reliability. Scrubs should be run between once a week and once a month, and they should be run during non-peak load hours as they will hamper system performance (though the system will still be very much usable). To be explicit: scrubs are performed while the filesystem is online and mounted, not offline as with things like fsck.

 

To scrub the filesystem, run:

zpool scrub <pool name> 

To get the status of the scrub, run:

zpool status <pool name> 

and you will get an output that looks something like this:

# zpool status tank  pool: tank state: ONLINE  scan: scrub in progress since Thu Jan 16 00:31:05 2014    65.0G scanned out of 5.93T at 403M/s, 4h14m to go    0 repaired, 1.07% doneconfig:        NAME                                            STATE     READ WRITE CKSUM        tank                                            ONLINE       0     0     0          raidz2-0                                      ONLINE       0     0     0            sdc                                         ONLINE       0     0     0            sdb                                         ONLINE       0     0     0            sdf                                         ONLINE       0     0     0            sdg                                         ONLINE       0     0     0            sdh                                         ONLINE       0     0     0            sdi                                         ONLINE       0     0     0        logs          scsi-3600605b0052bc3b01a5a6b2c099f2d74-part1  ONLINE       0     0     0        cache          scsi-3600605b0052bc3b01a5a6b2c099f2d74-part2  ONLINE       0     0     0errors: No known data errors 

Note that scrubs run fairly quickly. My array consists of 6x3TB WD reds and it is scrubbing at just over 400MB/s.

 

If you accidentally start a scrub when you don't want to, or the performance penalty is more than you're comfortable with, you can run:

zpool scrub tank -s

to cancel the scrub.

 

7.4 Resilvering

Reslivering is the process of rebuilding a redundant ZFS vdev after a drive fails and a new one is inserted. It should happen by default when a dead drive is replaced. More on this later.

 

I'm going to pull a drive from a real (well virtual) system to show what the output looks like from a real, degraded system.

7.5 Pool Monitoring

ZFS has a few good commands for monitoring your ZFS devices and datasets.

 

One of my favorites is:

zpool iostat -v <interval> 

Which will give you real time io statistics and cache/log usage. Here's a sample output from my system:

zpool iostat -v 2                                                   capacity     operations    bandwidthpool                                            alloc   free   read  write   read  write----------------------------------------------  -----  -----  -----  -----  -----  -----tank                                            5.93T  10.3T  2.01K    138   250M  3.82M  raidz2                                        5.93T  10.3T  2.01K    138   250M  3.82M    sdc                                             -      -    741     37  63.5M  1.19M    sdb                                             -      -    725     39  63.9M  1.15M    sdf                                             -      -    688     39  62.4M  1.13M    sdg                                             -      -    674     35  63.2M  1.18M    sdh                                             -      -    667     35  64.2M  1.17M    sdi                                             -      -    631     37  63.8M  1.18Mlogs                                                -      -      -      -      -      -  scsi-3600605b0052bc3b01a5a6b2c099f2d74-part1   400K  1016M      0      0      0      0cache                                               -      -      -      -      -      -  scsi-3600605b0052bc3b01a5a6b2c099f2d74-part2  25.8G   168G      0     44      0  3.65M----------------------------------------------  -----  -----  -----  -----  -----  -----

It will continue to spit this output at you as it changes with time until you stop it with ctrl-c.

 

Another very useful command for viewing the status of your datasets is with:

zpool status <pool name> 

This will give you some basic health information about your pool.

# zpool status tank  pool: tank state: ONLINE  scan: scrub in progress since Thu Jan 16 00:31:05 2014    293G scanned out of 5.93T at 420M/s, 3h54m to go    0 repaired, 4.82% doneconfig:        NAME                                            STATE     READ WRITE CKSUM        tank                                            ONLINE       0     0     0          raidz2-0                                      ONLINE       0     0     0            sdc                                         ONLINE       0     0     0            sdb                                         ONLINE       0     0     0            sdf                                         ONLINE       0     0     0            sdg                                         ONLINE       0     0     0            sdh                                         ONLINE       0     0     0            sdi                                         ONLINE       0     0     0        logs          scsi-3600605b0052bc3b01a5a6b2c099f2d74-part1  ONLINE       0     0     0        cache          scsi-3600605b0052bc3b01a5a6b2c099f2d74-part2  ONLINE       0     0     0errors: No known data errors 

Here you can see my pool is running a scrub and all the devices in the pool are in good health. The read/write/checksum columns will report the number of errors for that given operation.

7.6 Expanding a Zpool

<nothing here yet>

7.7 ZFS Raid Levels

<nothing here yet>

 

8 Performance Analysis

<nothing here yet>

 

9 Further Reading

9.1 ZFS Administration Best Practices

Oracle's guide on best ZFS administration practices (this is definitely worth a read)

http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html

9.2 ZFS Primer

A ZFS primer from Oracle:

https://blogs.oracle.com/partnertech/entry/a_hands_on_introduction_to1

9.3 ZOL Website

The ZFS on Linux homepage. This page has a list of all supported distributions.

http://zfsonlinux.org/

9.4 Anandtech SSD Reviews

Reference these when choosing an SSD for the L2 ARC/ZIL.

http://www.anandtech.com/tag/ssd

9.5 ZFSonLinux Setup Guide

This is where I first learned about ZFS

https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

 

10 Sample Setups

Please note that these are only sample setups. If you notice hardware/brand bias in these builds, it's not because I don't like brand X or that I think brand Y performs better than brand Z, it's just because these are brands that I happen to use. These are just sample builds.

 

To be perfectly clear, THESE ARE JUST EXAMPLES. I don't want anyone in the comments telling me "If you swap x for y in the build it will be $10 cheeper." That's not the point of this. These are just ballparks.

 

Also note that if anyone has specific system requirements/brand preferences, tell me what you're looking for in the comments and I will gladly throw together some system specs for you. (eg: "I want a 16TB box and I prefer seagate, AMD, and kingston. I also want it to be fast to I can run VMs on it.").

 

Also, if you have specific questions like "how should I partition my SSD for mamimum performance?" please ask.

10.1 Entry Level: $600, 3TB

CPU: Pentium G3220:                                                                  $  60mobo: ASUS H81M-K mATX board                                                         $  60memory: 4GB of cheep DDR3                                                            $  40storage: 2x3TB WD reds                                                               $ 260boot drive: 60GB Kingston SSDNow V300 SSD                                            $  70case: NZXT Source 210                                                                $  40PSU: Corsair CX 430                                                                  $  45

Throw the 2x3TB drives in a mirror and you've got yourself a nice, low power box with 3TB of available storage.

10.2 Lower-Mid range: $1200, 9TB

CPU: Core i3 4130:                                                                   $ 130mobo: Supermicro MBD-X10SSL-F-O                                                      $ 170memory: 2x4GB of unbuffered, ECC memory                                              $ 105storage: 4x3TB WD reds                                                               $ 535boot drive: 240GB Crucial M500                                                       $ 150case: NZXT Source 210                                                                $  40PSU: Corsair CX 430                                                                  $  45 

Put the 4x3TB drives in raidz (raid 5). Use 100GB of the M500 for boot, 20GB for swap, 60GB for the L2ARC and 1GB for the ZIL. Leave the rest of the drive empty so that it will still get high IOPS when the L2ARC is full.

 

Note: I went with an M500 specifically because it has power loss data protection (aka capacitor banks to flush cache), which I felt was important for a server. It's also very cheep given its size. 

10.4 Mid range: $1800

10.5 Upper-Mid range: $2800, 24TB

CPU: Xeon E3 1220v3                                                                 $ 205 mobo: Supermicro MBD-X10SSL-F-O                                                     $ 170memory: 2x8GB of unbuffered, ECC memory                                             $ 200storage: 8x4TB WD reds                                                              $1520boot drive: 120GB Crucial M500                                                      $  90case: Fractal Define R4                                                             $ 110PSU: SeaSonic G Series 550W                                                         $  85ZIL/L2ARC Drive: 180GB Intel 530                                                    $ 150Heatsink: Noctua NH-L9i                                                             $  50HBA: LSI 9211-4i                                                                    $ 165 

Note that you will need some molex-to-sata power cables because the PSU does not have enough sata power connectors. Also you will probably need to get 

 

Put the 8 HDDs in raidz2, install your OS to the Crucial drive, and use the Intel drive for the ZIL/L2ARC. Use 140GB of the Intel drive for the L2ARC and 1-5GB for the ZIL. This gives an extra 20% over provisioning so that the drive stays fast and healthy over time.

 

I would enable deduplication on a box like this since L2ARC is big enough to hold a very large dedup table.

 

If this is enough storage for you, but you want the server to be faster, upgrade the WD reds to WD SE drives (or similar enterprise SATA drive), and add in a second ZIL/L2ARC drive with the same partition scheme as the first to double your cache size and cache IOPS. (Note that to add another cache drive to this system you will need the 8 port HBA rather than the 4 port one). Furthermore, consider adding an infiniband HCA, 10GigE NIC, or dual/quad port GigE NIC to take full advantage of the speed this setup can offer.

 

If this setup is fast enough for you but you want more space, you should to upgrade the HCA to the 8 port card, get two more drives, and mount them in the 5.25" bays.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

Appendix A: Deduplication

This section was adapted from an email I sent to a joint Stony Brook University/Harvey Mudd College research team I'm working with that is developing deduplication software and wanted to compare their software to ZFS dedup. I apologize if there are any bits in here that seem out of place.

 

As mentioned in section 7.1.3, there is a lot you should know before running deduplication on ZFS.

 

To get maximum performance and dedup ratios with ZFS deduplication you will need to consider:

    1. record size
    2. total DDT (dedup table) size
    3. checksum (hashing) throughput
 
1. Record Size:
The default ZFS recordsize (block size) is 128k. This is actually only half true, however, as recordsize is actually dynamic and 128k is the max recordsize. Recordsize can still be tuned to any value between 0.5k and 128k, but be aware that smaller records mean more entries in the DDT and you will quickly run out of memory.
 
You can use:
# zfs set recordsize=<recordsize> <pool name/path/to/dataset>
to change the recordsize. Also to find what recordsize is being used, you can just run:
# zfs get recordsize <pool name/path/to/dataset>
It is important to note that the zfs man page says the following:
 
    recordsize=size
        Specifies a suggested block size for files in the file system. This prop-
        erty is designed solely for use with database workloads that access files
        in  fixed-size  records. ZFS automatically tunes block sizes according to
        internal algorithms optimized for typical access patterns.
 
but some online sources say that altering recordsize can increase performance and dedup ratio if you know what you're doing, so I suppose the best advice is to try various recordsizes and see if they perform better than the default.
 
2. DDT size
From what I've read, each entry in the DDT takes 320 bytes. I've ran into a few sources that conflicted on this value, but 320 is the number that I've seen the most. If you do the math, this means that with a 128k recordsize, 1TB on disk takes 2.5GB in memory, which contradicts the "advised 5GB of memory per 1TB on disk" that I've read in a few ZFS sysadmin guides. This is because ZFS uses dynamic recorsize and 128k is the default max.
 
More importantly, the DDT is considered metadata in the eyes of ZFS, so its max size in the ARC can be configured. (by default metadata will only consume up to 1/4 of the total allowed ARC size)
 
To adjust the ARC options at run time you'll want to look in /sys/module/zfs/parameters. To change the metadata size limit in the ARC you'll want to run something like:
# echo 10737418240 >> /sys/module/zfs/parameters/zfs_arc_meta_limit
to set it to 10GB.
 
Note that this is a run time configuration only. To configure a value persistent through reboot you'll have to create a config file:
# vi /etc/modprobe.d/zfs.conf
and add the following line:
options zfs zfs_arc_meta_limit=10737418240
Notes:
a. All the values in /sys/module/zfs/parameters will be 0 by default. To see what the current values actually are you can run
# cat /proc/spl/kstat/zfs/arcstats
b. To have the values in /etc/modprobe.d/zfs.conf take effect, you must reboot the system.
 
c. You can get some useful info about the size of the DDT by using:
# zpool status <pool name> -D

The DDT info will be at the bottom.

 
It's important to note that once it fills its allotted chunk of the ARC, the DDT will spill into the L2ARC if one is present.
 
3. Checksuming
ZFS uses the SHA256 algorithm to checksum all data on disk, and if dedup is enabled it will compare these checksums to decide if each two blocks contain the same data. By default ZFS does not compare the actual data in each block, just the checksums.
 
Checksuming is not exactly tunable, but it's worth noting that on slow systems with very fast disk arrays, the CPU may actually bottleneck the disks because it can not compute checksums fast enough. If you're using even remotely modern hardware this likely will not be an issue.
 
4. Extraneous:
The zdb (ZFS debug) command can give you a lot of good info if you pass it
the appropriate flags.
 
 
Sources/further reading:
This article mentions that DDT entry size is 320bytes.
 
Blog from one of the guys who wrote the original ZFS dedup functionality.
 
Another Oracle blog, the one on dedup usage and performance considerations.
 
The Gentoo wiki article on ZFS. Explains how to tune/configure ZoL.
 
The Oracle ZFS administration guide. The ZoL website links to this page,
saying that "the majority of it is directly applicable to ZFS on Linux."
 
The ZoL doccumentation.
 
A ZoL user guide that is linked to from the ZoL docs page.
 
Specifically these two pages on deduplication.
 
Also, the man pages for zfs, zpool, and zdb. (these on their own actually 
contain a lot of the data from the above sources)

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

<reserved2>

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

@wpirobotbuilder @looney @alpenwasser here it is!  :D Thought you guys would be interested in this.

I am. :D

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to comment
Share on other sites

Link to post
Share on other sites

@wpirobotbuilder @looney @alpenwasser here it is!  :D Thought you guys would be interested in this.

Most definitely.

I do not feel obliged to believe that the same God who has endowed us with sense, reason and intellect has intended us to forgo their use, and by some other means to give us knowledge which we can attain by them. - Galileo Galilei
Build Logs: Tophat (in progress), DNAF | Useful Links: How To: Choosing Your Storage Devices and Configuration, Case Study: RAID Tolerance to Failure, Reducing Single Points of Failure in Redundant Storage , Why Choose an SSD?, ZFS From A to Z (Eric1024), Advanced RAID: Survival Rates, Flashing LSI RAID Cards (alpenwasser), SAN and Storage Networking

Link to comment
Share on other sites

Link to post
Share on other sites

  • 1 month later...

Aw man, I was going to write something up with some recommended ZFS configurations. Beat me to it :P

I do not feel obliged to believe that the same God who has endowed us with sense, reason and intellect has intended us to forgo their use, and by some other means to give us knowledge which we can attain by them. - Galileo Galilei
Build Logs: Tophat (in progress), DNAF | Useful Links: How To: Choosing Your Storage Devices and Configuration, Case Study: RAID Tolerance to Failure, Reducing Single Points of Failure in Redundant Storage , Why Choose an SSD?, ZFS From A to Z (Eric1024), Advanced RAID: Survival Rates, Flashing LSI RAID Cards (alpenwasser), SAN and Storage Networking

Link to comment
Share on other sites

Link to post
Share on other sites

Aw man, I was going to write something up with some recommended ZFS configurations. Beat me to it :P

Sorry :P

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

This is a beast of a tutorial, good work

Linux "nerd".  If I helped you please like my post and maybe add me as a friend :)  ^_^!

Link to comment
Share on other sites

Link to post
Share on other sites

This is a beast of a tutorial, good work

Thanks  :D  I've put a lot of time into it, so I just hope someone can get some use out of it :)

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

This is brilliant to say the least. Thanks for the effort.

Main Rig:  CPU i5-4670k   MOBO Gigabyte Z97N-WIFI   GPU GTX 980ti    RAM 8GB  STORAGE 128GB ADATA(OS)/250GB Samsung 850 EVO(APPS)/3TB WD Red

AUDIO: AMP/DAC TEAC AI-301DA SPEAKERS: Cambridge Audio SX50 Phones: Philips Fidelio X1

Link to comment
Share on other sites

Link to post
Share on other sites

So this tutorial is nearing completion, but I'm about to start school again so I won't have much time to work on it.

 

Are there any topics that I should have covered or that people would like me to cover that I haven't covered yet?

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

One Suggestion. Change from this color. To this color. Which is 1 to the right of the first blue color on the 'More Colors' selector. It's easier to read on a dark background for night/dark theme users.
 

4.4 Hardware

For ZFS to deliver maximum performance, it needs a decent amount of RAM and a decent sized cache. For home use, 8GB of memory and a 64-120GB SSD are sufficient, though power users may notice some benefit at up to 32GB of RAM and 250-500GB of SSD space. ECC memory is desirable for maximum data integrity, but not required.

Hmm. What do you mean by cache? Since you are specifically talking about ZFS and not a zpool in a vdev, I assume you mean L2ARC. 

If you mean that, read bullet point 7 on this slide:

12071238163_6ca80246e2_o.png

Which came from here and was linked here at the FreeNAS forums.

I hope you see what the problem would be in using a 64GB+ SSD for L2ARC when a large amount of RAM is needed to keep up with that much cache. Their example of 32GB of RAM for a 120GB L2ARC would mean a 64GB L2ARC would need a minimum of 16GB of RAM, roughly.

Now, if you meant cache as in normal cache for a single volume (rather than the whole FS), then I have the question of "Does ZFS treat that as L2ARC or not?" I assume not, since normal cache doesn't need to be stored in RAM like L2ARC does.

Also, one question:

Let's say I make a VDev using ZFS with HDDs connected to my motherboard's onboard controller. Then I move the HDD's to an HBA controller later. 

Will ZFS care? Will it see the VDev still be seen and work fine or will I have to rebuild it?

† Christian Member †

For my pertinent links to guides, reviews, and anything similar, go here, and look under the spoiler labeled such. A brief history of Unix and it's relation to OS X by Builder.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

-snip-

<interesting info>

Interesting, I've never heard that before. I'll keep that info in mind but I'm not sure I fully believe it and here's why:

 

First of all, that slideshow is specifically about ZFS on freeBSD. I've never read anything like that about ZFS on Linux or Solaris ZFS, and they are entirely different code bases at this point (at least as far as I know). Specifically, I don't believe that an L2ARC takes up that much memory for indexing. In that thread, someone mentions that each index entry in the ARC for the L2ARC takes around 200 bytes of memory. Then in this thread (http://forums.freenas.org/threads/at-what-point-does-l2arc-make-sense.17373/) one of the same members says each entry takes 380 bytes in memory. Let's assume the higher number is right and do a little math:

 

We know that ZFS uses dynamic recordsize, with a default max of 128k. Even if your average recordsize in your L2ARC is 16k, you'll only use as much memory as 1/40 of your L2ARC size for indexing. So you could very feasibly run a system with 32GB of memory and a 500GB L2ARC without any problems. (around 12.5GB of the ARC would be used for L2ARC indexing). If your average recordsize in the L2ARC was 64k then you would only be using around 3GB of memory to index 500GB of L2ARC.

 

I think the numbers are a little off in that presentation, or the writer doesn't fully understand ZFS recordsize.

 

 

Also, one question:

Let's say I make a VDev using ZFS with HDDs connected to my motherboard's onboard controller. Then I move the HDD's to an HBA controller later. 

Will ZFS care? Will it see the VDev still be seen and work fine or will I have to rebuild it?

I haven't tested that specific case, but as long as the controller has IT mode firmware, then it should work without a hitch.

 

Edit: I changed the colors.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

I haven't tested that specific case, but as long as the controller has IT mode firmware, then it should work without a hitch.

Awesome.

Then one last question: How can I know if a controller has IT mode firmware? The specific one I need to know about is this one.

 

I don't want to have to flash the controller's firmware, so buying one with it is kind of the only option. However, I have no idea how to know if it has that or not.

† Christian Member †

For my pertinent links to guides, reviews, and anything similar, go here, and look under the spoiler labeled such. A brief history of Unix and it's relation to OS X by Builder.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Awesome.

Then one last question: How can I know if a controller has IT mode firmware? The specific one I need to know about is this one.

 

I don't want to have to flash the controller's firmware, so buying one with it is kind of the only option. However, I have no idea how to know if it has that or not.

Google is your friend: here's the product page on the Supermicro website: http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=E

 

"The AOC-USAS2-L8e supports 122 devices as a HBA in IT mode."

 

Also, that page has the drivers and tools you will need for IT mode (And the user guide PDF. Supermicro's user guides tend to be very good, so I would give that a read.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

Whats your take on btrfs? Or is that another post entirely? :P

I haven't played around with btrfs at all, so I can't say too much about it. The developers somewhat recently changed its status from "not ready for production," to "ready for experimental use," or something to that end.

 

Give it a few years and I think it will start to be the filesystem of choice for reliable data storage on a medium sized scale, mainly just because it is license-compatible with Linux.

 

Put it this way: I plan on using btrfs for the root filesystem on my next server, and XFS (not ZFS) for data storage.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

Google is your friend: here's the product page on the Supermicro website: http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=E

 

Also, that page has the drivers and tools you will need for IT mode (And the user guide PDF. Supermicro's user guides tend to be very good, so I would give that a read.

Well. I feel dumb. Sorry. I mean, I saw that IT mode thing on that page before thanks to Google, but I wasn't sure if that meant I would have to do something similar to flashing the firmware or not. 

I figured if it said that, it either had an option in the BIOS to switch to IT mode, or it required something much more intensive to get into IT mode. 

And if it's the former, I'll probably just go into the BIOS from the start and do that then try it without bothering with drivers or tools. I plan on using FreeNAS, so installing the drivers or messing with the tools would probably be a nightmare for someone who wants it to be as plug N play as possible. 

Will do. Thanks.

Edit:

Ya know what. I haven't had much sleep. I'm going to get some. I'm sorry for all the stupidity coming from me as of the past several posts. I'm just gonna shut up now.

Awesome tutorial. I should learn a lot from it once I'm not stupid anymore.

† Christian Member †

For my pertinent links to guides, reviews, and anything similar, go here, and look under the spoiler labeled such. A brief history of Unix and it's relation to OS X by Builder.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

I haven't played around with btrfs at all, so I can't say too much about it. The developers somewhat recently changed its status from "not ready for production," to "ready for experimental use," or something to that end.

 

Give it a few years and I think it will start to be the filesystem of choice for reliable data storage on a medium sized scale, mainly just because it is license-compatible with Linux.

 

Put it this way: I plan on using btrfs for the root filesystem on my next server, and XFS (not ZFS) for data storage.

 

From what I gather it's redundancy features saves loads on time and space, in regards to copying large files, as the filesystem will make a hard link of the file instead and when you write to the link it will write in to a new file or sectors.

 

Also (I believe ZFS has this): builtin file-system volume management. Finally: no more of this pussyfooting around.

 

Do you however see btrfs replacing ext4 as a desktop and workstation solution? Or would that be tantamount to giving a chess-board to a baby?

Thoroughness rating
#########

Link to comment
Share on other sites

Link to post
Share on other sites

Do you however see btrfs replacing ext4 as a desktop and workstation solution? Or would that be tantamount to giving a chess-board to a baby?

I believe there has been talk from openSUSE, Oracle Linux, and RHEL of making btrfs their default root filesystem once it becomes stable.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

I'm sorry for all the stupidity coming from me as of the past several posts.

Not stupidity :) I always appreciate critiques and new information. When I have some free time I'm definitely going to look more into some specifics about ARC usage for indexing the L2ARC, and more specifics about the inner workings of the ZIL.

 

Get some sleep though. Sleep deprivation sucks.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×