ZFS Record Size
Marsell Kukuljevic of Joyent wrote me to say (paraphrasing):
"I thought ZFS record size is variable: by default it's 128K, but write 2KB of data (assuming nothing else writes), then only 2KB writes to disk (excluding metadata). What does record size actually enforce?
I assume this is affected by transaction groups, so if I write 6 2K files, it'll write a 12K record, but if I write 6 32K files, it'll write two records: 128K and 64K. That causes a problem with read and write magnification in future writes though, so I'm not sure if such behaviour makes sense. Maybe recordsize only affects writes within a file?
I'm asking this in context of one of the recommendations in the evil tuning guide, to use a recordsize of 8K to match with Postgres' buffer size. Fair enough, I presume this means that records written to disk are then always at most 8KB (ignoring any headers and footers), but how does compression factor into this?
I've noticed that Postgres compresses quite well. With LZJB it still gets ~3x. Assuming a recordsize of 8K, then it'd be about 3KB written to disk for that record (again, excluding all the metadata), right?"
The recordsize parameter enforces the size of the largest blockwritten to a ZFS file system or volume.There is an excellent blog about the ZFS recordsize here. Note that ZFS does not always read/write recordsizebytes. For instance, a write of 2K to a file will typically result inat least one 2KB write (and maybe more than one for metadata).The recordsize is the largest block that ZFS will read/write.The interested reader can verify this by using DTrace on bdev_strategy(), left as an exercise.Also note that because of the way ZFS maintains information aboutallocated/free space on disk (i.e., spacemaps), smaller recordsizeshould not result in more space or time being used to maintain thatinformation.
Instead of repeating the blog post, let's do someexperimenting.
To make things easy (i.e., we don't want to sift through tens ofthousands of lines of zdb(1M) output), we'll createa small pool and work with that. I'm assuming you are on a systemthat supports ZFS and has zdb. SmartOS would be an excellentchoice...
## mkfile 100m /var/tmp/poolfile# zpool create testpool /var/tmp/poolfile# zfs get recordsize,compression testpoolNAME PROPERTY VALUE SOURCEtestpool recordsize 128K defaulttestpool compression off default#
An alternative to using files (/var/tmp/poolfile), is to create achild dataset using the zfs command, and run zdb on the childdataset. This also cuts down on the amount of data displayed by zdb.
We'll start with the simplest case:
# dd if=/dev/zero of=/testpool/foo bs=128k count=11+0 records in1+0 records out# zdb -dddddddd testpool... Object lvl iblk dblk dsize lsize %full type 21 1 16K 128K 128K 128K 100.00 ZFS plain file (K=inherit) (Z=inherit) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 0 path /foo uid 0 gid 0 atime Thu Mar 21 03:50:24 2013 mtime Thu Mar 21 03:50:24 2013 ctime Thu Mar 21 03:50:24 2013 crtime Thu Mar 21 03:50:24 2013 gen 2462 mode 100644 size 131072 parent 4 links 1 pflags 40800000004Indirect blocks: 0 L0 0:1b4800:20000 20000L/20000P F=1 B=2462/2462 segment [0000000000000000, 0000000000020000) size 128K...#
From the above output, we can see that the "foo" file has one block.It is on vdev 0 (the only vdev in the pool), at offset 0x1b4800(relative to the 4MB label at the beginning of every disk), and sizeis 0x20000 (=128K). Note that if you're following along, and don'tseethe "/foo" file in your output, run sync, or wait a few seconds.Generally, it can take up to 5 seconds before the data is on disk.This implies that zdb reads from disk, bypassing ARC (which is whatyou want for a file system debugger).
Now let's do the same for a 2KB file.
# rm /testpool/foo# dd if=/dev/zero of=/testpool/foo bs=2k count=1# zdb -dddddddd testpool... Object lvl iblk dblk dsize lsize %full type 22 1 16K 2K 2K 2K 100.00 ZFS plain file (K=inherit) (Z=inherit) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 0 path /foo uid 0 gid 0 atime Thu Mar 21 04:21:25 2013 mtime Thu Mar 21 04:21:25 2013 ctime Thu Mar 21 04:21:25 2013 crtime Thu Mar 21 04:21:25 2013 gen 2839 mode 100644 size 2048 parent 4 links 1 pflags 40800000004Indirect blocks: 0 L0 0:180000:800 800L/800P F=1 B=2839/2839 segment [0000000000000000, 0000000000000800) size 2K...#
So, as Marsell notes, block size is variable. Here, the foo file isat offset 0x180000 and size is 0x800 (=2K).What if we use a larger block size than 128KB to dd?
# rm /testpool/foo# dd if=/dev/zero of=/testpool/foo bs=256k count=11+0 records in1+0 records out# zdb -dddddddd testpool... Object lvl iblk dblk dsize lsize %full type 23 2 16K 128K 258K 256K 100.00 ZFS plain file (K=inherit) (Z=inherit) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 1 path /foo uid 0 gid 0 atime Thu Mar 21 04:23:32 2013 mtime Thu Mar 21 04:23:32 2013 ctime Thu Mar 21 04:23:32 2013 crtime Thu Mar 21 04:23:32 2013 gen 2868 mode 100644 size 262144 parent 4 links 1 pflags 40800000004Indirect blocks: 0 L1 0:1f0c00:400 0:12b5a00:400 4000L/400P F=2 B=2868/2868 0 L0 0:1b3800:20000 20000L/20000P F=1 B=2868/2868 20000 L0 0:180000:20000 20000L/20000P F=1 B=2868/2868 segment [0000000000000000, 0000000000040000) size 256K...#
This time, the file has 2 blocks, each 128KB large. Because the datadoes not fit into 1 block, there is 1 indirect block (block containingblock pointers) at 0x1f0c00, and it is 0x400 (1KB) on disk. Theindirect block is compressed. Decompressed, it is 0x4000 bytes(=16KB). The "4000L/400P" refers to the logical size (4000L) andthe physical size (400P). Logical is after decompression, physical isthe size compressed on disk. Turning off compression only effectsindirect blocks. Other metadata is always compressed (always lzjb??).
Now we'll try creating 6 2-KB files, and see what that gives us. (Note that output has been omitted.)
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; done...# sync# zdb -dddddddd testpool... path /f1Indirect blocks: 0 L0 0:80000:800 800L/800P F=1 B=4484/4484 path /f2Indirect blocks: 0 L0 0:81200:800 800L/800P F=1 B=4484/4484 path /f3Indirect blocks: 0 L0 0:81a00:800 800L/800P F=1 B=4484/4484 path /f4Indirect blocks: 0 L0 0:82200:800 800L/800P F=1 B=4484/4484 path /f5Indirect blocks: 0 L0 0:82a00:800 800L/800P F=1 B=4484/4484 path /f6Indirect blocks: 0 L0 0:87200:800 800L/800P F=1 B=4484/4484...
So, they all fit in the same 128k block (between 0x80000 and 0xa0000).And they are all in the same transaction group (4484).There is a gap between the space used for file f1 and file f2, but f2through f6 are contiguous. Does this result in one write to the disk?Hard to say as it is difficult to correlate writes to disk withwrites to ZFS files, and also because the "disk" is actually a file.It should be possible to determine if it is one write or multiple byusing DTrace and a child dataset in a pool with real disks.Would we get the same behavior if the writes were in separatetransaction groups? Let's try to find out.
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; sleep 6; done...# zdb -dddddddd testpool... path /f1Indirect blocks: 0 L0 0:ef400:800 800L/800P F=1 B=4827/4827 path /f2Indirect blocks: 0 L0 0:fd800:800 800L/800P F=1 B=4828/4828 path /f3Indirect blocks: 0 L0 0:82a00:800 800L/800P F=1 B=4829/4829 path /f4Indirect blocks: 0 L0 0:88000:800 800L/800P F=1 B=4831/4831 path /f5Indirect blocks: 0 L0 0:8b200:800 800L/800P F=1 B=4832/4832 path /f6Indirect blocks: 0 L0 0:8ca00:800 800L/800P F=1 B=4833/4833...
Each write is in a different transaction group, and they are notcontiguous. Some are in different 128KB blocks.
So, back to Marsell's questions... Marsell says that if he writes6 2KB files, it will be a 12KB write. That is not clear from theabove output. In fact, it may be 6 2KB writes, it might be 1 128KBwrite. It could even be 1 12KB write. I ran the first 6x2KB a secondtime, and all of the files were contiguous on disk. It is alsopossible that even if the writes are in different transaction groups,they could all be contiguous.
Let's write 6 32KB files.
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=32k count=1; done# zdb -dddddddd testpool... path /f1Indirect blocks: 0 L0 0:8da00:8000 8000L/8000P F=1 B=5108/5108 path /f2Indirect blocks: 0 L0 0:a8e00:8000 8000L/8000P F=1 B=5108/5108 path /f3Indirect blocks: 0 L0 0:b0e00:8000 8000L/8000P F=1 B=5108/5108 path /f4Indirect blocks: 0 L0 0:b8e00:8000 8000L/8000P F=1 B=5108/5108 path /f5Indirect blocks: 0 L0 0:efc00:8000 8000L/8000P F=1 B=5108/5108 path /f6Indirect blocks: 0 L0 0:da400:8000 8000L/8000P F=1 B=5108/5108...
The writes are all in the same transaction group, but not all in thesame 128KB block. In fact, a single write may be spread acrosstransaction groups. Note that this implies that there can bedata loss, i.e., not all data written in one write call ends up ondisk if there is a power failure. ZFS guarantees consistency of thefile system, i.e., the transaction is all or none. If there aremultiple transactions, some transactions may not make it to disk.Applications concerned about this should either use synchronouswrites, or have some other recovery mechanism. Note that synchronouswrites use the ZFS intent log (ZIL), so performance may not becompromised.
Here is a single write of 4MB.
# dd if=/dev/zero of=/testpool/big bs=4096k count=11+0 records in1+0 records out# sync# zdb -dddddddd testpool... path /bigIndirect blocks: 0 L1 0:620000:400 0:1300000:400 4000L/400P F=32 B=5410/5410 0 L0 0:200000:20000 20000L/20000P F=1 B=5409/5409 20000 L0 0:220000:20000 20000L/20000P F=1 B=5409/5409 40000 L0 0:240000:20000 20000L/20000P F=1 B=5409/5409 60000 L0 0:260000:20000 20000L/20000P F=1 B=5409/5409 80000 L0 0:280000:20000 20000L/20000P F=1 B=5409/5409 a0000 L0 0:2a0000:20000 20000L/20000P F=1 B=5409/5409 c0000 L0 0:2c0000:20000 20000L/20000P F=1 B=5409/5409 e0000 L0 0:2e0000:20000 20000L/20000P F=1 B=5409/5409 100000 L0 0:300000:20000 20000L/20000P F=1 B=5409/5409 120000 L0 0:320000:20000 20000L/20000P F=1 B=5409/5409 140000 L0 0:340000:20000 20000L/20000P F=1 B=5409/5409 160000 L0 0:360000:20000 20000L/20000P F=1 B=5409/5409 180000 L0 0:380000:20000 20000L/20000P F=1 B=5409/5409 1a0000 L0 0:3a0000:20000 20000L/20000P F=1 B=5409/5409 1c0000 L0 0:3c0000:20000 20000L/20000P F=1 B=5409/5409 1e0000 L0 0:3e0000:20000 20000L/20000P F=1 B=5409/5409 200000 L0 0:400000:20000 20000L/20000P F=1 B=5409/5409 220000 L0 0:420000:20000 20000L/20000P F=1 B=5409/5409 240000 L0 0:440000:20000 20000L/20000P F=1 B=5409/5409 260000 L0 0:460000:20000 20000L/20000P F=1 B=5409/5409 280000 L0 0:485e00:20000 20000L/20000P F=1 B=5410/5410 2a0000 L0 0:4a5e00:20000 20000L/20000P F=1 B=5410/5410 2c0000 L0 0:4c5e00:20000 20000L/20000P F=1 B=5410/5410 2e0000 L0 0:500000:20000 20000L/20000P F=1 B=5410/5410 300000 L0 0:520000:20000 20000L/20000P F=1 B=5410/5410 320000 L0 0:540000:20000 20000L/20000P F=1 B=5410/5410 340000 L0 0:560000:20000 20000L/20000P F=1 B=5410/5410 360000 L0 0:580000:20000 20000L/20000P F=1 B=5410/5410 380000 L0 0:5a0000:20000 20000L/20000P F=1 B=5410/5410 3a0000 L0 0:5c0000:20000 20000L/20000P F=1 B=5410/5410 3c0000 L0 0:5e0000:20000 20000L/20000P F=1 B=5410/5410 3e0000 L0 0:600000:20000 20000L/20000P F=1 B=5410/5410
The write is spread across 2 transaction groups. Examining the code in zfs_write(), you can see that each write is broken into recordsize blocks, results in a separate transactions (see the calls to dmu_tx_create() in that code). The transactions can be spread across multiple transaction groups (calls to dmu_tx_assign()).
If the recordsize is set to 8k, the maximum size of a block will be8KB. Let's give that a try and look at results. Blocks that arealready allocated are not effected.
# zfs set recordsize=8192 testpool# zfs get recordsize testpoolNAME PROPERTY VALUE SOURCEtestpool recordsize 8K local# dd if=/dev/zero of=/testpool/smallblock bs=128k count=11+0 records in1+0 records out# sync# zdb -dddddddd testpool... path /smallblockIndirect blocks: 0 L1 0:64e800:400 0:1312c00:400 4000L/400P F=16 B=5653/5653 0 L0 0:624800:2000 2000L/2000P F=1 B=5653/5653 2000 L0 0:627c00:2000 2000L/2000P F=1 B=5653/5653 4000 L0 0:632800:2000 2000L/2000P F=1 B=5653/5653 6000 L0 0:634800:2000 2000L/2000P F=1 B=5653/5653 8000 L0 0:636800:2000 2000L/2000P F=1 B=5653/5653 a000 L0 0:638800:2000 2000L/2000P F=1 B=5653/5653 c000 L0 0:63a800:2000 2000L/2000P F=1 B=5653/5653 e000 L0 0:63c800:2000 2000L/2000P F=1 B=5653/5653 10000 L0 0:63e800:2000 2000L/2000P F=1 B=5653/5653 12000 L0 0:640800:2000 2000L/2000P F=1 B=5653/5653 14000 L0 0:642800:2000 2000L/2000P F=1 B=5653/5653 16000 L0 0:644800:2000 2000L/2000P F=1 B=5653/5653 18000 L0 0:646800:2000 2000L/2000P F=1 B=5653/5653 1a000 L0 0:648800:2000 2000L/2000P F=1 B=5653/5653 1c000 L0 0:64a800:2000 2000L/2000P F=1 B=5653/5653 1e000 L0 0:64c800:2000 2000L/2000P F=1 B=5653/5653
Basically, the behavior is the same as with the default 128KBrecordsize, except that the maximum size of a block is 8KB. Thisshould hold for all blocks (data and metadata). Any modified metadata(due to copy-on-write) will also use the smaller block size.As for performance implications, I'll leave that to theRoch Bourbonnais blog, referenced at the beginning.
For compression, nothing really changes. The maximum size of acompressed block is the recordsize. We'll reset the recordsize to thedefault, and turn on lzjb compression.
# zfs set recordsize=128k testpool# zfs set compression=lzjb testpool# zfs get recordsize,compression testpoolNAME PROPERTY VALUE SOURCEtestpool recordsize 128K localtestpool compression lzjb local#
And write 256KB...
# dd if=/dev/zero of=/testpool/zero bs=128k count=22+0 records in2+0 records out# zdb -dddddddd testpool... path /zeroIndirect blocks:
Good, so compression of all NULLs resulted in no blocks. Let's writedata.
# dd if=/usr/dict/words of=/testpool/foo.compressed bs=128k count=21+1 records in1+1 records out# zdb -dddddddd testpool... path /foo.compressedIndirect blocks: 0 L1 0:6b2200:400 0:1390800:400 4000L/400P F=2 B=5830/5830 0 L0 0:690600:15200 20000L/15200P F=1 B=5830/5830 20000 L0 0:6a5800:ca00 20000L/ca00P F=1 B=5830/5830
So, 1 indirect block and 2 blocks of compressed data. The first blockof compressed data is 0x15200 in size, the second is 0xca00. The twoblocks are contiguous, so it is possible they are written in 1 writeto the disk.
To conclude, recordsize is handled at the block level. It is themaximum size of a block that may be written by ZFS. Existingdata/metadata is not changed if the recordsize is changed, and/or ifcompression is used. As for performance tuning, I would be careful ofputting too much faith in the ZFS evil tuning guide. It is dated,some of the descriptions are not accurate, and there are thingsmissing.
I'll have another ZFS related blog soon. Currently waiting for a bugto be fixed in zdb.
We offer comprehensive training for Triton Developers, Operators and End Users.
Post written by rachelbalik