Jump to content

XFS

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

XFS and our servers

This page serves as a description of how we format our xfs partitions and why.

How they're formatted

root@db1047:/a/sqldata# xfs_info /dev/sda6
meta-data=/dev/sda6              isize=256    agcount=4, agsize=109108672 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=436434688, imaxpct=5
         =                       sunit=64     swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=213120, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

How they get formatted

fenari:/home/midom/xfsfix is some python that gets the right device and UUID names, then spits out executable bash code that looks something like this:

root@db1047:/a/sqldata# python /root/xfsfix
umount /dev/sda6
mkfs.xfs -f -d sunit=512,swidth=4096 -L data /dev/sda6
xfs_admin -U f1363f7d-8a44-4abe-9e38-bf2171e265c8 /dev/sda6
mount /dev/sda6

You'll notice that the sunit and swidth numbers put out by the script don't match what xfs_info prints out. Domas guesses that the script's numbers are too large, but states that the numbers printed by xfs_info are acceptable as is.

Why they're formatted this way


[11:55 AM] <domas> 70U
[11:56 AM] <maplebed> so domas any thoughts on that pastebin?
[11:57 AM] <domas> hmmmm
[11:57 AM] <domas> *shrug*
[11:57 AM] • domas looks some more
[11:57 AM] <maplebed> Jeff_Green notices that the sunit=64 and swidth=512 is also present on db26
[11:57 AM] <domas> not on other machines?
[11:57 AM] <maplebed> (in the quest to see what's "right" that seems like a good place to start)
[11:57 AM] <domas> db26 is LVM
[11:57 AM] <maplebed> I haven't looked at other machines yet.
[11:58 AM] <maplebed> db42 is the same...
[11:59 AM] <domas> I guess I just have too high numbers there
[11:59 AM] <domas> it is not in bytes but in 512b sectors
[12:00 PM] <maplebed> not blocks? (which are set to 4096)?
[12:01 PM] <domas> pain oh pain
[12:01 PM] <Jeff_Green> ha
[12:01 PM] <domas> 'sectors' is usually in 512 in linux
[12:02 PM] <domas> 512*512 is 256k alignment
[12:02 PM] <maplebed> at any rate, I've got to run; if you think the current settings are fine I'll update the docs.
[12:02 PM] <domas> they are good enough
[12:02 PM] <domas> 32k alignment is good enough too
[12:02 PM] <domas> the major thing is not to have 16k partitioned
[12:02 PM] <domas> meh, we're talking 10% perf here
[12:03 PM] <domas> and we're not overloading i/o anyway
[12:03 PM] <Jeff_Green> domas: could you email/wiki/something us some notes on your tweaks?
[12:03 PM] <domas> jeff_green: there're not that many!
[12:03 PM] <domas> but I can try!
[12:03 PM] <Jeff_Green> i saw we tweak raid-related stripe stuff only?
[12:04 PM] <Jeff_Green> at CL we ended up tweaking only agcount (to 32) and the usual mount options, curious what/why you tweak
[12:04 PM] <domas> jeff_green a/g doesn't matter much, we have just one file that is big enough =)
[12:05 PM] <domas> jeff_green: stripe alignment is to avoid multiple reads for one block
[12:08 PM] <Jeff_Green> how does that interact with striping in hardware RAID?
[12:10 PM] <domas> well
[12:10 PM] <domas> if you don't align files on stripe boundaries
[12:10 PM] <domas> if a file is made out of 16k blocks
[12:10 PM] <domas> and you have 64k stripe
[12:10 PM] <domas> and it is not aligned
[12:11 PM] <domas> so, 25% of blocks will need two I/Os instead of one
[12:11 PM] <domas> because the block will reside on two disks
[12:11 PM] <domas> now, if you align them all on stripe boundary, all blocks are residing just on one disk
[12:11 PM] <domas> (I'm not counting mirrors)
[12:12 PM] <domas> back in the day it was much more painful, as we had to align partitions too
[12:12 PM] <domas> (which is what xfsfix was mostly for)
[12:12 PM] <domas> we were editing partitiontable with xfsfix before
[12:12 PM] <Jeff_Green> ok. I'm going to apply this to db1040 as a comprehension exercise