February 13, 2014 - by Mr. Max Bruning
This post assumes some knowledge of ZFS internals. You can obtain this knowledge by reading the source code, and/or by taking a ZFS Internals course. The post also makes extensive use of <a href="http://www.illumos.org/man/1/mdb">mdb(1). A full description of mdb is well beyond the scope of this blog post.
When teaching the ZFS Internals course, I often give students the following lab:
"For an application that reads data from a file, find the data in the ARC."
The ARC (Adjustable Replacement Cache) is an in-memory cache of
recently and/or frequently accessed data/metadata from disk.
ZFS file system (and volume) data and metadata are read/written via
the ARC. A good description can be found in the source code at
usr/src/uts/common/uts/fs/zfs/arc.c
.
A more general (and possibly more useful) question is to identify how
much of the ARC a given file, file system, or volume is using.
To do the lab, we'll set up a simple ZFS pool using a file, then we'll put some (known) data into a file in the pool. Then we'll run a program to read the data, then we'll look for the data in ARC. For this lab, you'll need a system running SmartOS (illumos, OpenIndiana, and probably Solaris 10 and 11 variants should also work). The system should not be "busy". If there is a lot of file system activity, the data for the file may not stay cached for very long.
Here are the first steps:
# mkfile 100m /var/tmp/zfsfile <-- create a file to be used for the pool.
# zpool create testpool /var/tmp/zfsfile
# cp /usr/dict/words /testpool/words <-- our file with known data
# zpool export testpool
# zpool import -d /var/tmp testpool
We export the pool to clear the ARC of any data left from the cp(1).
Now we'll read in the words file and find it in ARC (or not, if the system is very busy). First we'll just go through the steps, then we'll go through some explanation.
# dd if=/testpool/words of=/dev/null bs=128k
1+1 records in
1+1 records out
#
# ls -i /testpool/words
2040 /testpool/words
# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm sppp nfs ]
> 0t2040=K <- convert inumber (object id) to hex
7f8
>
> ::dbufs -o 7f8 -n testpool| ::dbuf
addr object lvl blkid holds os
ffffff00cf433e80 7f8 0 0 0 testpool
ffffff00cf43d9f8 7f8 0 1 0 testpool
ffffff00ce7a31b0 7f8 1 0 2 testpool
ffffff00ce7b8860 7f8 0 bonus 1 testpool
>
>
> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf
dmu_buf_t db = {
uint64_t db.db_object = 0x7f8
uint64_t db.db_offset = 0 <-- beginning of file
uint64_t db.db_size = 0x20000 <-- 128k
void *db.db_data = 0xffffff0043402000 <-- location of arc
data buffer
}
arc_buf_t *db_buf = 0xffffff00ccdb0ee0
>
> ffffff0043402000,10/c
1st <-- this is the beginning of the "words" file
2nd
3rd
> ffffff00ccdb0ee0::print -t arc_buf_t
arc_buf_t {
arc_buf_hdr_t *b_hdr = 0xffffff00cfb0c708
arc_buf_t *b_next = 0
kmutex_t b_evict_lock = {
void *[1] _opaque = [ 0 ]
}
void *b_data = 0xffffff0043402000
arc_evict_func_t *b_efunc = dbuf_do_evict
void *b_private = 0xffffff00cf433e80
}
> ffffff00cfb0c708::print -t arc_buf_hdr_t
arc_buf_hdr_t {
dva_t b_dva = {
uint64_t [2] dva_word = [ 0x100, 0x24400 ]
}
uint64_t b_birth = 0x1e596
uint64_t b_cksum0 = 0x2f6c9bcce37c
... <-- output omitted
arc_buf_hdr_t *b_hash_next = 0xffffff00c82e8a90
arc_buf_t *b_buf = 0xffffff00ccdb0ee0
...
uint64_t b_size = 0x20000
uint64_t b_spa = 0x1fc28bd029207b7b
arc_state_t *b_state = ARC_mfu <-- buffer is in MFU list
...
}
>
>
The data structures used to maintain the ARC are
arc_buf_hdr_t
and arc_buf_t
. These data
structures are used to determine if a buffer is in ARC, and, if so,
where (mru, mfu, mru ghost, mfu ghost, l2arc). (The ghost lists are
used to determine when a mru or mfu cache is too small). But
they do not identify what object the data/metadata holds. For this,
the dmu_buf_impl_t
structure (hereafter referred to as
"dbuf" structures) can be used. Note that not
everything in the ARC is mapped by dbufs.
The following diagram shows the data structures used by the DMU to manage data and metadata in the ARC.
DBUF_HASH(objset, objid, level, blkid)
|
| ______ ------> hash chain of dmu_buf_impl_t structs
| |-----|0 | _ ___
| |-----| | |------>|_|------------>| | dnode_t __
| |-----| __|_| dnode_handle_t _____ |__|----------->| |dnode_phys_t
|-->|-----|-->| |--------------->| | |_|(in metadata)
|-----| |____|dmu_buf_impl_t | | data/metadata
|-----| | | | (or bonus buffer)
|-----| |----------- | |
|-----| | | |
|_____|hash_table_mask+1 | /-->|____|
dbuf_hash_table.hash_table | |
__ V__|_
| |
buf_hash(spa, dva, birth) |______|arc_buf_t (NULL for bonus buffer)
| ^
| _____ |
| |----|0 ___V___
|->|----|----------------->| | arc_buf_hdr_t
|----| |______|
|----| |------> hash chain of arc_buf_hdr_t
|____|ht_mask+1
buf_hash_table.ht_table
The following describes the mdb commands that were used to find the data.
> ::dbufs -o 7f8 -n testpool| ::dbuf
The ::dbufs
dcmd walks the dmu_buf_impl_t
cache of allocated ::dbufs
. The "-o
7f8
" only displays entries with object id 0x7f8, the "inumber"
of the words file, and the "-n testpool
" only shows those
entries in the testpool object set. The "::dbuf
" dcmd
displays a summary of the dmu_buf_impl_t
.
The output of the above command shows the address of the
dmu_buf_impl_t
, the object id, the level of indirection,
the block id, the number of holds on the object, and the object set name.
ZFS can use up to 6 levels of indirect blocks.
The object id will either be a number (for instance, 0x7f8), or "mdn" (meta dnode), which is used for
objset_phys_t
structures which are in memory. The
objset_phys_t
data structure contains information about
the meta object set (the MOS), which describes the root of a pool,
child datasets, clones, snapshots, dedupe table, volumes, and the space map for
a pool, among other things. There are also objset_phys_t
structures for each dataset, clone, volume, child dataset, and
snapshot which locates the objects (files, directories) within the
object set.
The block id identifies which block in the object is referenced by the
dmu_buf_impl_t
, or the block id contains the string
"bonus". The bonus buffer (a field in the dnode_phys_t
)
contains attributes (ownership, timestamps, permissions, etc.) of an
object. Note that entries marked "bonus" have
a NULL value for the arc_buf_t *
field in the
dmu_buf_impl_t
. The bonus buffer is in the ARC, but is
there as part of the dnode_phys_t
for the object. The
bonus DMU buffers are copies of the data from the corresponding
dnode_phys_t
. And the dnode_phys_t
that
contains the bonus buffer is also in the DMU cache (and ARC).
The "holds" value says how many things are currently using the DMU buffer. The buffer can not be freed if the hold count is non-zero.
::dbufs -o 7f8 -n testpool|
:dbuf".
addr object lvl blkid holds os
ffffff00cf433e80 7f8 0 0 0 testpool ffffff00cf43d9f8 7f8 0 1 0 testpool ffffff00ce7a31b0 7f8 1 0 2 testpool ffffff00ce7b8860 7f8 0 bonus 1 testpool
For object id 0x7f8, there are 4 dbufs. The first is for the first block (128k
bytes), and the second is for the second block in the file. The third one is
for a level 1 indirect block. It contains block pointers that contain
the blocks described by the first two entries. The last one is for the
bonus buffer for the file. The dnode_phys_t
describing
the file is in an "mdn" dbuf.
At this point, we get more information about the first dbuf.
> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf db_dnode_handle
dmu_buf_t db = {
uint64_t db.db_object = 0x7f8
uint64_t db.db_offset = 0
uint64_t db.db_size = 0x20000
void *db.db_data = 0xffffff0043402000
}
arc_buf_t *db_buf = 0xffffff00ccdb0ee0
struct dnode_handle *db_dnode_handle = 0xffffff00ccf520c8
>
The db
member describes the buffer. The
db.db_data
field is the address of where the buffer
starts in memory. Going to that address shows the first 128k of data
for the words file.
> ffffff0043402000,20000/c
1st <-- this is the beginning of the "words" file
2nd
3rd
...
>
The arc_buf_t
contains a pointer to the
arc_buf_hdr_t
for the buffer, which in turn shows that
the buffer is in the ARC_mfu
cache. Note that the
address of the buffer in the arc_buf_t
(b_data
) matches the db_data
field in the
dmu_buf_impl_t
. The b_private
field in the
arc_buf_t
is a pointer back to the dmu_buf_impl_t
.
Now let's look at the dnode_t
for the file.
> ffffff00cf433e80::print -t dmu_buf_impl_t db_dnode_handle | ::print -t dnode_handle_t dnh_dnode | ::print -t dnode_t
dnode_t {
...
list_node_t dn_link = { <-- linked list of all dnodes on system
...
}
struct objset *dn_objset = 0xffffff00cd063040
uint64_t dn_object = 0x7f8
struct dmu_buf_impl *dn_dbuf = 0xffffff00ce7ae268 <-- dbuf for dnode_phys_t
struct dnode_handle *dn_handle = 0xffffff00ccf520c8
dnode_phys_t *dn_phys = 0xffffff00cf8e8000
dmu_object_type_t dn_type = 0t19 (DMU_OT_PLAIN_FILE_CONTENTS)
uint16_t dn_bonuslen = 0xa8
uint8_t dn_bonustype = 0x2c
...
uint32_t dn_dbufs_count = 0x4
...
refcount_t dn_holds = {
uint64_t rc_count = 0x4
}
...
list_t dn_dbufs = { <-- dbufs with this dnode_t
...
}
struct dmu_buf_impl *dn_bonus = 0xffffff00ce7b8860
...
}
The dnode_t
contains a pointer to a
dmu_buf_impl_t
.
Let's look at this:
> ffffff00ce7ae268::dbuf
addr object lvl blkid holds os
ffffff00ce7ae268 mdn 0 3f 1 testpool
>
So, the dbuf that contains the dnode_phys_t
for the words
file is a meta dnode object, at indirect level 0 and at block id 0x3f.
Let's take a closer look.
> ffffff00ce7ae268::print -t dmu_buf_impl_t
dmu_buf_impl_t {
dmu_buf_t db = {
uint64_t db_object = 0
uint64_t db_offset = 0xfc000
uint64_t db_size = 0x4000
void *db_data = 0xffffff00cf8e5000
}
struct objset *db_objset = 0xffffff00cd063040
struct dnode_handle *db_dnode_handle = 0xffffff00cd063060
struct dmu_buf_impl *db_parent = 0xffffff00ce795e48
struct dmu_buf_impl *db_hash_next = 0
uint64_t db_blkid = 0x3f
blkptr_t *db_blkptr = 0xffffff00cf8f2f80
...
dbuf_states_t db_state = 4 (DB_CACHED)
refcount_t db_holds = {
uint64_t rc_count = 0x1
}
arc_buf_t *db_buf = 0xffffff00ccdb0d30
...
void *db_user_ptr = 0xffffff00ccf51e80
...
}
This dbuf is for a 16k (0x4000) byte block at offset 0xfc000. Note
that the blkid (0x3f) times the block size (0x4000) gives the offset
of 0xfc000. This is a block of dnode_phys_t
structures.
> ::sizeof dnode_phys_t
sizeof (dnode_phys_t) = 0x200
>
> 4000%200=K <-- block size is 0x4000, dnode_phys_t size is 0x200
20 <-- 32 dnode_phys_t / block
>
> 7f8%20=K <-- 0x748 is the object id for words
3f <-- matches the db_blkid
> 3f*20=K <-- where does block containing 7f8 begin?
7e0
> 7f8-7e0=K <-- get offset from beginning of block
18
> ffffff00cf8e5000+(18*200)=K <-- get address of dnode_phys_t
for words file
ffffff00cf8e8000 <-- matches dn_phys in dnode_t above
If the ARC buffer is evicted, a callback (dbuf_do_evict()
)
will clean up the dmu_buf_impl_t
. See the comment before
dbuf_clear()
in uts/common/fs/zfs/dbuf.c
for some
details. Here is the same ::dbufs
command run as before,
but after some ARC/dbuf evictions.
> ::dbufs -o 7f8 -n testpool | ::dbuf
addr object lvl blkid holds os
ffffff00cf433e80 7f8 0 0 0 testpool
ffffff00ce7a31b0 7f8 1 0 1 testpool
ffffff00ce7b8860 7f8 0 bonus 1 testpool
So, one of the buffers (the one containing the second block of the file) is no longer cached.
An interesting question to ask may be: For a given file, how much of the file data/metadata is in ARC?
To do this, we'll use a file I have been intermittently looking at over time.
# ls -i /var/tmp/foo.out
1337 -rw-r--r-- 1 root root 13328871 Jan 6 09:56 /var/tmp/foo.out
#
# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic
uppc apix scsi_vhci ufs ip hook neti sockfs arp usba
stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc
logindmux ptm sppp nfs ]
> ::dbufs -o 0t1337 | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
6832128 <-- decimal, about 6.8MB
>
Maybe more interesting is how much ARC space a given dataset or volume is using. The following shows total space in ARC used by the testpool dataset.
> ::dbufs -n testpool | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
411648
>
Volumes are used in the Joyent Public Cloud (JPC) for kvm-based virtual machines. Let's get the space used by a Fedora instance. First, the list of virtual machines.
# vmadm list
UUID TYPE RAM STATE ALIAS
8d02acc6-a9cc-e033-f196-aa6841702872 OS 768 running vm1
9d3ccd6c-511b-60c5-db07-d742941bb62b KVM 1024 running ubuntu-1
a5528066-171a-694f-85ef-cac9928c9fd3 OS 2048 running vm1
dd6fb539-cfac-c84a-f336-d1232a6f673e OS 2048 running -
62293978-3947-eb5a-dcdf-a6b4728b39bf KVM 8192 running maxfedora
ed3f45b1-833a-438d-8214-3876a58d9371 OS 8192 running
moray1
Volumes are more difficult as we cannot use the volume name with the
"-n" option and "::dbufs". The following command line walks through
the set of all dnode_t
on the system. For each one, it
gets the type of the dnode_t
looking for type value of
0x17 (DMU_OT_ZVOL
, e.g., volume). For dnode_t
of that
type, it prints out the name of the volume.
> ::walk dnode_t d | ::print dnode_t dn_phys | ::print
dnode_phys_t dn_type | ::grep ".==17" | ::eval '
This is a long one-liner. Briefly, it walks the list of dnode_t
in
memory. For each dnode_t
, the walker stores the address
of the dnode_t
in an mdb variable ("d"). For each
dnode_t
, it prints the dn_phys
value
(dn_phys
is the address of a dnode_phys_t
,
which is an in-memory copy of the same data structure that is on
disk. The dn_type
field refers to the "type" of the
dnode_phys_t
. If the type is 0x17, the dnode_t
(and corresponding dnode_phys_t
), is for a ZFS volume.
The "::eval" gets the value of the "d" variable (the
dnode_t
with type equal to 0x17) and prints the
dn_objset
for that dnode. At the end, this one-liner
will list the names of all of the ZFS volumes currently on the system.
To find the amount of ARC space consumed by the "maxfedora" virtual
machine (uuid = 62293978-3947-eb5a-dcdf-a6b4728b39bf-disk1), we can
find all dbufs whose dnode handle takes us to the
dnode_t
for the volume. Using the above command, we want the first
dnode_t
.
> ::walk dnode_t d | ::print dnode_t dn_phys | ::print
dnode_phys_t dn_type | ::grep ".==17" | ::eval '
The dnode_t
contains a list of all dbufs that are used
for it. We'll walk the list of dbufs, and for each one that has a non-NULL arc
buf pointer, we'll get the size from the arc buf header and add them
up. To walk the list of dbufs in the dnode_t
, we need to
know the address of the list.
> ::offsetof dnode_t dn_dbufs
offsetof (dnode_t, dn_dbufs) = 0x248, sizeof (...->dn_dbufs) = 0x20
>
Adding the offset of the dn_dbufs
member to the address
of the dnode_t for the "62293978-3947-eb5a-dcdf-a6b4728b39bf-disk0"
volume, we'll walk the list of dbufs for the volume. This volume is
the system disk for the "maxfedora" image.
> ffffff0dc5845710 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
216375296 <-- number of bytes mapped in ARC for the volume
>
And here is the data disk (/data
) in the Fedora instance.
> ffffff11c639ebe0 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
743317504
>
As mentioned earlier, not all of what is in the ARC is mapped by
dbufs. First of all, not all dbufs refer to arc buffers (the
db_buf
field in the dmu_buf_impl_t
can be
NULL). All but 2 of these instances of the dbufs are to get quick
access to bonus buffers. The bonus buffers are in the ARC, but as
part of the dnode_phys_t
which contains them. There are
many arc buffers that do not have a pointer back to a dbuf
(b_private
in the arc_buf_t
is NULL). I
have looked at some of these arc buffers and have found a few
different types of metadata, but also some buffers which contain
data. One possible reason for this is prefetch.
Here is a way to see all of ARC that is mapped by dbufs.
> ::walk dmu_buf_impl_t d| ::print dmu_buf_impl_t db_buf |
::grep ".!=0" | ::eval "
And here is a way to see space in ARC.
> ::walk arc_buf_t | ::print -t arc_buf_t b_hdr | ::print -t
-d arc_buf_hdr_t b_size !sed -e 's/uint64_t b_size = 0t//' |
awk '{sum+=$1} END{print sum}'
1236963840
>
Note that this number matches closely with the size shown by:
> ::arc !grep size
size = 1200 MB
...
>
Determining the cause of the difference (1181588480 for dbufs, and 1236963840 for arc buffers) is left as an exercise for the reader.
All of this is very interesting, but also quite a few steps. It would
be nice to have an lsarc
command that lists what files
are in the arc, how much data/metadata is in arc for a given
file/dataset/volume, a breakdown between data and metadata, and even
which arc cache the data is on (MRU or MFU). Once you understand that
the dbufs provide a map of (most of) the arc, this command becomes
possible.