overlayfs挂载选项volatile

Posted on 2023-10-29 In Linux

引入

最近在公司遇到了容器相关的线上问题, 简单来说就是K8s会出现PLEG not healthy的报错. 经过初步的排查, 发现是docker有个goroutine卡在了umount, 同时这个这个groutine会占用某个锁, 导致查询状态的handler里拿不到锁, 最终导致k8s层面的报错.

通过对umount事件的监控, 以及相关源码代码的阅读, 发现是overlayfs在umount的时候, 会对upper层所在的fs进行一次sync, 导致大量脏页回写. 如果这个机器内存较大, 并且有过频繁的IO, 那么就会脏页较多, overlayfs umount时等待磁盘IO完成而阻塞过久.

umount的stack trace如下:

[<0>] wb_wait_for_completion+0x5a/0x90
[<0>] __writeback_inodes_sb_nr+0xa0/0xd0
[<0>] writeback_inodes_sb+0x3d/0x50
[<0>] _sync_filesystem+0x55/0x60
[<0>] sync_filesystem+0x33/0x50
[<0>] ovl_sync_fs+0x61/0xa0 [overlay]
[<0>] _sync_filesystem+0x33/0x60
[<0>] sync_filesystem+0x44/0x50
[<0>] generic_shutdown_super+0x27/0x120
[<0>] kill_anon_super+0x18/0x30
[<0>] deactivate_locked_super+0x3b/0x90
[<0>] deactivate_super+0x42/0x50
[<0>] cleanup_mnt+0x109/0x170
[<0>] _cleanup_mnt+0x12/0x20
[<0>] task_work_run+0x70/0xb0
[<0>] exit_to_user_mode_prepare+0x1b6/0x1c0
[<0>] syscall_exit_to_user_mode+0x27/0x50
[<0>] do_syscall_64+0x69/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

重点在这个ovl_sync_fs函数, 它会对整个overlayfs的uppser层所在文件系统进行sync操作:

// from kernel 5.10
/* Sync real dirty inodes in upper filesystem (if it exists) */
static int ovl_sync_fs(struct super_block *sb, int wait)
{
        struct ovl_fs *ofs = sb->s_fs_info;
        struct super_block *upper_sb;
        int ret;

        if (!ovl_upper_mnt(ofs))
                return 0;

        if (!ovl_should_sync(ofs))
                return 0;
        /*
         * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
         * All the super blocks will be iterated, including upper_sb.
         *
         * If this is a syncfs(2) call, then we do need to call
         * sync_filesystem() on upper_sb, but enough if we do it when being
         * called with wait == 1.
         */
        if (!wait)
                return 0;
        
        // 找到upper层所在的fs
        upper_sb = ovl_upper_mnt(ofs)->mnt_sb;

        down_read(&upper_sb->s_umount);
        // 执行sync, 会造成整个fs脏页回写磁盘, 耗时很长
        ret = sync_filesystem(upper_sb);
        up_read(&upper_sb->s_umount);

        return ret;
}

很显然, 对于k8s的很多场景来说, sync文件系统是多余的操作. k8s每次实例启动都重新挂载rootfs, 实例退出后将rootfs umount并删除(可能描述的不对). 所以最好有办法能够避免overlayfs umount时候的强制刷盘.

解决方案

如果仔细查看ovl_sync_fs函数, 可以发现它会在函数开头执行两个判断, 一次是判断upper层是否存在, 一次是ovl_should_sync判断overlayfs是否应该进行sync. 解决问题的关键可能就在于ovl_should_sync能否绕过sync.

kernel

ovl_should_sync函数被包含在2020.8.31提的patch里

[PATCH v5] overlayfs: Provide a mount option “ volatile” to skip sync - Vivek Goyal

这个patch给overlayfs提供了一个新的挂载选项volatile, 当挂载这个选项后, overlayfs会去掉一些sync操作, 包括针对单独文件的sync以及整个文件系统的sync.

影响的地方如下:

umount不会再强制对upper层文件系统执行sync, 也就是针对本次问题出现的场景.
remount的时候, 也可能对upper层所在文件系统进行sync, 这是2020年加内核主线的patch.
[PATCH] ovl: sync dirty data when remounting to ro mode ‒ Union Filesystem
这是因为把overlayfs remount成只读之后, 在umount overlayfs时,kill_anon_super -> generic_shutdown_super -> sync_filesystem检查overlayfs为只读时会跳过sync_filesystem. 所以把overlayfs从可写remount成只读的时候, 直接进行一次sync_filesystem, 避免最终umount的时候遗漏sync_filesystem操作. volatile选项会取消这一次sync_filesystem操作.
针对单个文件fsync调用, 如果带有volatile挂载选项, 会跳过.
当文件copy up到upper层的时候, 也会进行vfs_fsync()操作. 如果带有volatile选项, 会跳过.
O_SYNC的场景, 如果有volatile选项, 也会绕过sync退化成overlayfs默认的写行为.

本质上这些sync操作都是为了避免系统crash造成overlayfs磁盘数据丢失. volatile挂载选项和Kubernetes的使用场景十分契合. 如果内核在向overlayfs写入数据时崩溃, kubelet总是会重新创建新的容器, 而不会复用之前的rootfs. 因此，在 kubernetes中, 容器的rootfs是临时的. 在pod中使用 volatile 选项是安全的, 因为我们没有机会重复使用旧的rootfs. 在有状态容器中使用这种配置也是安全的, 因为需要持久化的数据理应写入外部卷, 在运行时不会受到volatile标志的影响.

contianerd

当然, 新的挂载选项需要runtime的支持, 才能够在挂载rootfs带上这个选项.

在contaienrd社区已经有了许多相关讨论:

这个pr [overlay] add configurable mount options to overlay snapshotter by dmcgowan · Pull Request #8676 允许对overlayfs的挂载选项进行设置, 并且被backport到了1.6.24.

requirements

总结一下, overlayfs volatile特性, 需要的版本如下:

Linux kernel >= 5.10
对应patch: [PATCH] ovl: sync dirty data when remounting to ro mode ‒ Union Filesystem
containerd >= 1.6.24 or containerd >= 1.7.4
对应PR: [overlay] add configurable mount options to overlay snapshotter by dmcgowan · Pull Request #8676

containerd配置:

# /etc/containerd/config.toml
version = 2
[plugins]
  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    mount_options = ["volatile"]

reference

ext4文件系统默认挂载选项

Posted on 2023-07-29 Edited on 2023-10-29 In Linux

引入

在Ext4的官方文档里，可以看到有很多挂载的选项，并且有一些被标记为了默认，比如delalloc。但是通过procfs的/proc/mounts并没有看到这些默认的选项，比如delalloc（有个nodelalloc的disable delalloc选项，这两个是非此即彼的关系，却都没有出现）。

1 2	$ cat /proc/mounts\| grep ext4 /dev/sdb / ext4 rw,relatime,discard,errors=remount-ro,data=ordered 0 0

而对于另一个文件系统相关的文件里，却能够看到这些完整的选项：

$ cat /proc/fs/ext4/sdb/options    
rw
bsddf
nogrpid
block_validity
dioread_nolock
discard
delalloc
nowarn_on_error
journal_checksum
barrier
auto_da_alloc
user_xattr
acl
noquota
resuid=0
resgid=0
errors=remount-ro
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

带着这个问题，基于6.1.36内核源码，梳理了文件系统如何通过procfs来展示挂载信息，并且在创建和挂载文件系统时如何处理挂载选项。

procfs的数据

/proc/{pid}/mounts

由于有了mount namespace，系统挂载点可以在各个进程之间相互隔离，不再是全局一致。所以/proc/mounts其实是指向/proc/self/mounts的符号链接。

1 2	$ ls -l /proc/mounts lrwxrwxrwx 1 root root 11 Jul 30 05:31 /proc/mounts -> self/mounts

展示的函数是show_vfsmnt：

// fs/proc_namespace.c
static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
{
	struct proc_mounts *p = m->private;
	struct mount *r = real_mount(mnt);
	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
	struct super_block *sb = mnt_path.dentry->d_sb;
	int err;

	if (sb->s_op->show_devname) {
		err = sb->s_op->show_devname(m, mnt_path.dentry);
		if (err)
			goto out;
	} else {
		mangle(m, r->mnt_devname ? r->mnt_devname : "none");
	}
	seq_putc(m, ' ');
	/* mountpoints outside of chroot jail will give SEQ_SKIP on this */
	err = seq_path_root(m, &mnt_path, &p->root, " \t\n\\");
	if (err)
		goto out;
	seq_putc(m, ' ');
	show_type(m, sb);
	seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw");
	err = show_sb_opts(m, sb);
	if (err)
		goto out;
	show_mnt_opts(m, mnt);
	if (sb->s_op->show_options)
		err = sb->s_op->show_options(m, mnt_path.dentry);
	seq_puts(m, " 0 0\n");
out:
	return err;
}

其中主要有三个函数输出了options，show_sb_opts()和show_mnt_opts()，还有sb->s_op->show_options，其中show_sb_opts()和show_mnt_opts打印的是vfs层面通用的一些选项，比较少。

// fs/proc_namespace.c
static int show_sb_opts(struct seq_file *m, struct super_block *sb)
{
	static const struct proc_fs_opts fs_opts[] = {
		{ SB_SYNCHRONOUS, ",sync" },
		{ SB_DIRSYNC, ",dirsync" },
		{ SB_MANDLOCK, ",mand" },
		{ SB_LAZYTIME, ",lazytime" },
		{ 0, NULL }
	};
	const struct proc_fs_opts *fs_infop;

	for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
		if (sb->s_flags & fs_infop->flag)
			seq_puts(m, fs_infop->str);
	}

	return security_sb_show_options(m, sb);
}

static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{
	static const struct proc_fs_opts mnt_opts[] = {
		{ MNT_NOSUID, ",nosuid" },
		{ MNT_NODEV, ",nodev" },
		{ MNT_NOEXEC, ",noexec" },
		{ MNT_NOATIME, ",noatime" },
		{ MNT_NODIRATIME, ",nodiratime" },
		{ MNT_RELATIME, ",relatime" },
		{ MNT_NOSYMFOLLOW, ",nosymfollow" },
		{ 0, NULL }
	};
	const struct proc_fs_opts *fs_infop;

	for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
		if (mnt->mnt_flags & fs_infop->flag)
			seq_puts(m, fs_infop->str);
	}

	if (mnt_user_ns(mnt) != &init_user_ns)
		seq_puts(m, ",idmapped");
}

而sb->s_op->show_options，对于ext4文件系统来说，是_ext4_show_options。

// fs/ext4/super.c
static const struct super_operations ext4_sops = {
    // ...
	.show_options	= ext4_show_options,
    // ...
};

static int ext4_show_options(struct seq_file *seq, struct dentry *root)
{
	return _ext4_show_options(seq, root->d_sb, 0);
}

/*
 * Show an option if
 *  - it's set to a non-default value OR
 *  - if the per-sb default is different from the global default
 */
static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
			      int nodefs)
{
	struct ext4_sb_info *sbi = EXT4_SB(sb);
	struct ext4_super_block *es = sbi->s_es;
	int def_errors, def_mount_opt = sbi->s_def_mount_opt;
	const struct mount_opts *m;
	char sep = nodefs ? '\n' : ',';

#define SEQ_OPTS_PUTS(str) seq_printf(seq, "%c" str, sep)
#define SEQ_OPTS_PRINT(str, arg) seq_printf(seq, "%c" str, sep, arg)

	if (sbi->s_sb_block != 1)
		SEQ_OPTS_PRINT("sb=%llu", sbi->s_sb_block);

	for (m = ext4_mount_opts; m->token != Opt_err; m++) {
		int want_set = m->flags & MOPT_SET;
		if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
		    (m->flags & MOPT_CLEAR_ERR) || m->flags & MOPT_SKIP)
			continue;
		if (!nodefs && !(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
			continue; /* skip if same as the default */
		if ((want_set &&
		     (sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
		    (!want_set && (sbi->s_mount_opt & m->mount_opt)))
			continue; /* select Opt_noFoo vs Opt_Foo */
		SEQ_OPTS_PRINT("%s", token2str(m->token));
	}

	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
		SEQ_OPTS_PRINT("resuid=%u",
				from_kuid_munged(&init_user_ns, sbi->s_resuid));
	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
		SEQ_OPTS_PRINT("resgid=%u",
				from_kgid_munged(&init_user_ns, sbi->s_resgid));
	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
	if (test_opt(sb, ERRORS_RO) && def_errors != EXT4_ERRORS_RO)
		SEQ_OPTS_PUTS("errors=remount-ro");
	if (test_opt(sb, ERRORS_CONT) && def_errors != EXT4_ERRORS_CONTINUE)
		SEQ_OPTS_PUTS("errors=continue");
	if (test_opt(sb, ERRORS_PANIC) && def_errors != EXT4_ERRORS_PANIC)
		SEQ_OPTS_PUTS("errors=panic");
	if (nodefs || sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ)
		SEQ_OPTS_PRINT("commit=%lu", sbi->s_commit_interval / HZ);
	if (nodefs || sbi->s_min_batch_time != EXT4_DEF_MIN_BATCH_TIME)
		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
	if (sb->s_flags & SB_I_VERSION)
		SEQ_OPTS_PUTS("i_version");
	if (nodefs || sbi->s_stripe)
		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
			(sbi->s_mount_opt ^ def_mount_opt)) {
		if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
			SEQ_OPTS_PUTS("data=journal");
		else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
			SEQ_OPTS_PUTS("data=ordered");
		else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
			SEQ_OPTS_PUTS("data=writeback");
	}
	if (nodefs ||
	    sbi->s_inode_readahead_blks != EXT4_DEF_INODE_READAHEAD_BLKS)
		SEQ_OPTS_PRINT("inode_readahead_blks=%u",
			       sbi->s_inode_readahead_blks);

	if (test_opt(sb, INIT_INODE_TABLE) && (nodefs ||
		       (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
		SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
	if (nodefs || sbi->s_max_dir_size_kb)
		SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
	if (test_opt(sb, DATA_ERR_ABORT))
		SEQ_OPTS_PUTS("data_err=abort");

	fscrypt_show_test_dummy_encryption(seq, sep, sb);

	if (sb->s_flags & SB_INLINECRYPT)
		SEQ_OPTS_PUTS("inlinecrypt");

	if (test_opt(sb, DAX_ALWAYS)) {
		if (IS_EXT2_SB(sb))
			SEQ_OPTS_PUTS("dax");
		else
			SEQ_OPTS_PUTS("dax=always");
	} else if (test_opt2(sb, DAX_NEVER)) {
		SEQ_OPTS_PUTS("dax=never");
	} else if (test_opt2(sb, DAX_INODE)) {
		SEQ_OPTS_PUTS("dax=inode");
	}
	ext4_show_quota_options(seq, sb);
	return 0;
}

/proc/fs/ext4/{device}/options

在ext4文件系统被挂载的时候，会在__ext4_fill_super()里调用ext4_register_sysfs()来注册procfs的条目：

// fs/ext4/sysfs.c
int ext4_register_sysfs(struct super_block *sb)
{
	// ...
	if (sbi->s_proc) {
		proc_create_single_data("options", S_IRUGO, sbi->s_proc,
				ext4_seq_options_show, sb);
		// ...
	}
	return 0;
}

展示数据的函数为ext4_seq_options_show()，最终也是调用了_ext4_show_options()，不过最后一个参数nodefs为1，导致和/proc/{pid}/mounts的输出不一致。

// fs/ext4/super.c
int ext4_seq_options_show(struct seq_file *seq, void *offset)
{
	struct super_block *sb = seq->private;
	int rc;

	seq_puts(seq, sb_rdonly(sb) ? "ro" : "rw");
	rc = _ext4_show_options(seq, sb, 1);
	seq_puts(seq, "\n");
	return rc;
}

其中，导致差异的点有两个，一个是导致sep为’,’还是’\n’；另一个是nodefs为0的情况下，会省略一些选项的输出。

static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
			      int nodefs)
{
	struct ext4_sb_info *sbi = EXT4_SB(sb);
	struct ext4_super_block *es = sbi->s_es;
	int def_errors, def_mount_opt = sbi->s_def_mount_opt;
	const struct mount_opts *m;
	char sep = nodefs ? '\n' : ',';

	if (sbi->s_sb_block != 1)
		SEQ_OPTS_PRINT("sb=%llu", sbi->s_sb_block);

	for (m = ext4_mount_opts; m->token != Opt_err; m++) {
		int want_set = m->flags & MOPT_SET;
		if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
		    m->flags & MOPT_SKIP)
			continue;
		if (!nodefs && !(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
			continue; /* skip if same as the default */
		if ((want_set &&
		     (sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
		    (!want_set && (sbi->s_mount_opt & m->mount_opt)))
			continue; /* select Opt_noFoo vs Opt_Foo */
		SEQ_OPTS_PRINT("%s", token2str(m->token));
	}

	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
		SEQ_OPTS_PRINT("resuid=%u",
				from_kuid_munged(&init_user_ns, sbi->s_resuid));
	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
		SEQ_OPTS_PRINT("resgid=%u",
				from_kgid_munged(&init_user_ns, sbi->s_resgid));
	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
	if (test_opt(sb, ERRORS_RO) && def_errors != EXT4_ERRORS_RO)
		SEQ_OPTS_PUTS("errors=remount-ro");
	// ...
	if (nodefs || sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ)
		SEQ_OPTS_PRINT("commit=%lu", sbi->s_commit_interval / HZ);
	if (nodefs || sbi->s_min_batch_time != EXT4_DEF_MIN_BATCH_TIME)
		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
	if (nodefs || sbi->s_stripe)
		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
			(sbi->s_mount_opt ^ def_mount_opt)) {
		if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
			SEQ_OPTS_PUTS("data=journal");
		else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
			SEQ_OPTS_PUTS("data=ordered");
		else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
			SEQ_OPTS_PUTS("data=writeback");
	}
	if (nodefs ||
	    sbi->s_inode_readahead_blks != EXT4_DEF_INODE_READAHEAD_BLKS)
		SEQ_OPTS_PRINT("inode_readahead_blks=%u",
			       sbi->s_inode_readahead_blks);

	if (test_opt(sb, INIT_INODE_TABLE) && (nodefs ||
		       (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
		SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
	if (nodefs || sbi->s_max_dir_size_kb)
		SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
	// ...
}

看来起来主要就是nodefs为1和0的两种情况导致输出不一致。可以看到这段逻辑，如果是default的options，并且nodefs为0，就跳过。

1 2	if (!nodefs && !(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt))) continue; /* skip if same as the default */

这里default_options应该就是文档里描述的那些。我先入为主的以为这个就是磁盘上super_block的s_default_opts字段，于是通过tune_2fs查看了下，发现并不是。

$ sudo tune2fs -l /dev/sdb
tune2fs 1.46.5 (30-Dec-2021)
// ...
Default mount options:    user_xattr acl
// ...

这里只有user_xattr和acl，理论上其他的那些options当nodefs为0时也会输出，比如delalloc。但是却并没有。所以，还得探究下这个sbi->s_def_mount_opt字段究竟是如何被设置的。

默认挂载选项

熟悉linux文件系统的都知道，vfs会有个通用的super block，每个文件系统也会有自己的super block，它们在磁盘和在内存上都会有些许差距。对于ext4来说，它在磁盘上的super block布局为

// fs/ext4/ext4.h
/*
 * Structure of the super block
 */
struct ext4_super_block {
/*00*/	__le32	s_inodes_count;		/* Inodes count */
	__le32	s_blocks_count_lo;	/* Blocks count */
	__le32	s_r_blocks_count_lo;	/* Reserved blocks count */
	__le32	s_free_blocks_count_lo;	/* Free blocks count */
/*10*/	__le32	s_free_inodes_count;	/* Free inodes count */
	__le32	s_first_data_block;	/* First Data Block */
	__le32	s_log_block_size;	/* Block size */
	__le32	s_log_cluster_size;	/* Allocation cluster size */
/*20*/	__le32	s_blocks_per_group;	/* # Blocks per group */
	__le32	s_clusters_per_group;	/* # Clusters per group */
	__le32	s_inodes_per_group;	/* # Inodes per group */
	__le32	s_mtime;		/* Mount time */
/*30*/	__le32	s_wtime;		/* Write time */
	__le16	s_mnt_count;		/* Mount count */
	__le16	s_max_mnt_count;	/* Maximal mount count */
	__le16	s_magic;		/* Magic signature */
	__le16	s_state;		/* File system state */
	__le16	s_errors;		/* Behaviour when detecting errors */
	__le16	s_minor_rev_level;	/* minor revision level */
/*40*/	__le32	s_lastcheck;		/* time of last check */
	__le32	s_checkinterval;	/* max. time between checks */
	__le32	s_creator_os;		/* OS */
	__le32	s_rev_level;		/* Revision level */
/*50*/	__le16	s_def_resuid;		/* Default uid for reserved blocks */
	__le16	s_def_resgid;		/* Default gid for reserved blocks */
	/*
	 * These fields are for EXT4_DYNAMIC_REV superblocks only.
	 *
	 * Note: the difference between the compatible feature set and
	 * the incompatible feature set is that if there is a bit set
	 * in the incompatible feature set that the kernel doesn't
	 * know about, it should refuse to mount the filesystem.
	 *
	 * e2fsck's requirements are more strict; if it doesn't know
	 * about a feature in either the compatible or incompatible
	 * feature set, it must abort and not try to meddle with
	 * things it doesn't understand...
	 */
	__le32	s_first_ino;		/* First non-reserved inode */
	__le16  s_inode_size;		/* size of inode structure */
	__le16	s_block_group_nr;	/* block group # of this superblock */
	__le32	s_feature_compat;	/* compatible feature set */
/*60*/	__le32	s_feature_incompat;	/* incompatible feature set */
	__le32	s_feature_ro_compat;	/* readonly-compatible feature set */
/*68*/	__u8	s_uuid[16];		/* 128-bit uuid for volume */
/*78*/	char	s_volume_name[EXT4_LABEL_MAX];	/* volume name */
/*88*/	char	s_last_mounted[64] __nonstring;	/* directory where last mounted */
/*C8*/	__le32	s_algorithm_usage_bitmap; /* For compression */
	/*
	 * Performance hints.  Directory preallocation should only
	 * happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.
	 */
	__u8	s_prealloc_blocks;	/* Nr of blocks to try to preallocate*/
	__u8	s_prealloc_dir_blocks;	/* Nr to preallocate for dirs */
	__le16	s_reserved_gdt_blocks;	/* Per group desc for online growth */
	/*
	 * Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set.
	 */
/*D0*/	__u8	s_journal_uuid[16];	/* uuid of journal superblock */
/*E0*/	__le32	s_journal_inum;		/* inode number of journal file */
	__le32	s_journal_dev;		/* device number of journal file */
	__le32	s_last_orphan;		/* start of list of inodes to delete */
	__le32	s_hash_seed[4];		/* HTREE hash seed */
	__u8	s_def_hash_version;	/* Default hash version to use */
	__u8	s_jnl_backup_type;
	__le16  s_desc_size;		/* size of group descriptor */
/*100*/	__le32	s_default_mount_opts;
	__le32	s_first_meta_bg;	/* First metablock block group */
	__le32	s_mkfs_time;		/* When the filesystem was created */
	__le32	s_jnl_blocks[17];	/* Backup of the journal inode */
	/* 64bit support valid if EXT4_FEATURE_COMPAT_64BIT */
/*150*/	__le32	s_blocks_count_hi;	/* Blocks count */
	__le32	s_r_blocks_count_hi;	/* Reserved blocks count */
	__le32	s_free_blocks_count_hi;	/* Free blocks count */
	__le16	s_min_extra_isize;	/* All inodes have at least # bytes */
	__le16	s_want_extra_isize; 	/* New inodes should reserve # bytes */
	__le32	s_flags;		/* Miscellaneous flags */
	__le16  s_raid_stride;		/* RAID stride */
	__le16  s_mmp_update_interval;  /* # seconds to wait in MMP checking */
	__le64  s_mmp_block;            /* Block for multi-mount protection */
	__le32  s_raid_stripe_width;    /* blocks on all data disks (N*stride)*/
	__u8	s_log_groups_per_flex;  /* FLEX_BG group size */
	__u8	s_checksum_type;	/* metadata checksum algorithm used */
	__u8	s_encryption_level;	/* versioning level for encryption */
	__u8	s_reserved_pad;		/* Padding to next 32bits */
	__le64	s_kbytes_written;	/* nr of lifetime kilobytes written */
	__le32	s_snapshot_inum;	/* Inode number of active snapshot */
	__le32	s_snapshot_id;		/* sequential ID of active snapshot */
	__le64	s_snapshot_r_blocks_count; /* reserved blocks for active
					      snapshot's future use */
	__le32	s_snapshot_list;	/* inode number of the head of the
					   on-disk snapshot list */
#define EXT4_S_ERR_START offsetof(struct ext4_super_block, s_error_count)
	__le32	s_error_count;		/* number of fs errors */
	__le32	s_first_error_time;	/* first time an error happened */
	__le32	s_first_error_ino;	/* inode involved in first error */
	__le64	s_first_error_block;	/* block involved of first error */
	__u8	s_first_error_func[32] __nonstring;	/* function where the error happened */
	__le32	s_first_error_line;	/* line number where error happened */
	__le32	s_last_error_time;	/* most recent time of an error */
	__le32	s_last_error_ino;	/* inode involved in last error */
	__le32	s_last_error_line;	/* line number where error happened */
	__le64	s_last_error_block;	/* block involved of last error */
	__u8	s_last_error_func[32] __nonstring;	/* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
	__u8	s_mount_opts[64];
	__le32	s_usr_quota_inum;	/* inode for tracking user quota */
	__le32	s_grp_quota_inum;	/* inode for tracking group quota */
	__le32	s_overhead_clusters;	/* overhead blocks/clusters in fs */
	__le32	s_backup_bgs[2];	/* groups with sparse_super2 SBs */
	__u8	s_encrypt_algos[4];	/* Encryption algorithms in use  */
	__u8	s_encrypt_pw_salt[16];	/* Salt used for string2key algorithm */
	__le32	s_lpf_ino;		/* Location of the lost+found inode */
	__le32	s_prj_quota_inum;	/* inode for tracking project quota */
	__le32	s_checksum_seed;	/* crc32c(uuid) if csum_seed set */
	__u8	s_wtime_hi;
	__u8	s_mtime_hi;
	__u8	s_mkfs_time_hi;
	__u8	s_lastcheck_hi;
	__u8	s_first_error_time_hi;
	__u8	s_last_error_time_hi;
	__u8	s_first_error_errcode;
	__u8    s_last_error_errcode;
	__le16  s_encoding;		/* Filename charset encoding */
	__le16  s_encoding_flags;	/* Filename charset encoding flags */
	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
	__le32	s_reserved[94];		/* Padding to the end of the block */
	__le32	s_checksum;		/* crc32c(superblock) */
};

它会有个字段s_default_mount_opts，其实就是tune2fs工具展示的Default mount options，这个值是在磁盘上永久保存的，一般都是当mkfs创建文件系统的时候写入，也可以通过tune2fs工具来修改。

它允许的默认值包括如下：

// fs/ext4/ext4.h
/*
 * Default mount options
 */
#define EXT4_DEFM_DEBUG		0x0001
#define EXT4_DEFM_BSDGROUPS	0x0002
#define EXT4_DEFM_XATTR_USER	0x0004
#define EXT4_DEFM_ACL		0x0008
#define EXT4_DEFM_UID16		0x0010
#define EXT4_DEFM_JMODE		0x0060
#define EXT4_DEFM_JMODE_DATA	0x0020
#define EXT4_DEFM_JMODE_ORDERED	0x0040
#define EXT4_DEFM_JMODE_WBACK	0x0060
#define EXT4_DEFM_NOBARRIER	0x0100
#define EXT4_DEFM_BLOCK_VALIDITY 0x0200
#define EXT4_DEFM_DISCARD	0x0400
#define EXT4_DEFM_NODELALLOC	0x0800

mkfs可以通过配置文件来设置创建文件系统后super block里default_mntopts字段的值：

// /etc/mke2fs.conf
[defaults]
        base_features = sparse_super,large_file,filetype,resize_inode,dir_index,ext_attr
        default_mntopts = acl,user_xattr
        enable_periodic_fsck = 0
        blocksize = 4096
        inode_size = 256
        inode_ratio = 16384

[fs_types]
        ext3 = {
                features = has_journal
        }
        ext4 = {
                features = has_journal,extent,huge_file,flex_bg,metadata_csum,64bit,dir_nlink,extra_isize
        }
        small = {
                inode_ratio = 4096
        }
        floppy = {
                inode_ratio = 8192
        }
        big = {
                inode_ratio = 32768
        }
        huge = {
                inode_ratio = 65536
        }
        news = {
                inode_ratio = 4096
        }
        largefile = {
                inode_ratio = 1048576
                blocksize = -1
        }
        largefile4 = {
                inode_ratio = 4194304
                blocksize = -1
        }
        hurd = {
             blocksize = 4096
             inode_size = 128
             warn_y2038_dates = 0
        }

可以确定，内存里加载过后的superblock的字段s_def_mount_opt和磁盘上super block的字段s_default_mount_opts实际上并非对应的关系，在内核挂载阶段的内核代码里会对ext4_sb_info -> s_def_mount_opt进行设置。有意思的是，可以在磁盘上的superblock里设置EXT4_DEFM_NODELALLOC，从而改变挂载时默认delalloc的逻辑。

接下来梳理了一下内核代码关于mount option的设置流程：sys_mount() -> do_mount() -> path_mount() -> do_new_mount() -> vfs_get_tree() -> ext4_get_tree() -> get_tree_bdev() -> ext4_fill_super -> __ext4_fill_super()。

// fs/ext4/super.c
static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
{
	struct ext4_super_block *es = NULL;
	struct ext4_sb_info *sbi = EXT4_SB(sb);
	struct flex_groups **flex_groups;
	ext4_fsblk_t block;
	struct inode *root;
	struct ext4_fs_context *ctx = fc->fs_private;

	// 加载sbi->s_es，es指向磁盘上布局的super block数据
	err = ext4_load_super(sb, &logical_sb_block, silent);

	es = sbi->s_es;

	// 解析es->s_default_mount_opts，这也是mkfs时可以设置的挂载options
	// 这个字段为字符串
	ext4_set_def_opts(sb, es);

	// 解析es->s_mount_opts
	err = parse_apply_sb_mount_options(sb, ctx);
	if (err < 0)
		goto failed_mount;

	// 这里的赋值很关键，会把上述解析出来的option都列为s_def_mount_opt
	sbi->s_def_mount_opt = sbi->s_mount_opt;

	// 这里在设置挂载时参数附带的options，不再将其设置为s_def_mount_opt
	ext4_apply_options(fc, sb);

	// 这里也有可能设置sbi->s_def_mount_opt
	if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
		err = ext4_load_and_init_journal(sb, es, ctx);
		if (err)
			goto failed_mount3a;
	}
}

// 通过es->s_default_mount_opts来设置sbi->s_mount_opt
static void ext4_set_def_opts(struct super_block *sb,
			      struct ext4_super_block *es)
{
	unsigned long def_mount_opts;

	/* Set defaults before we parse the mount options */
	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
	set_opt(sb, INIT_INODE_TABLE);
	if (def_mount_opts & EXT4_DEFM_DEBUG)
		set_opt(sb, DEBUG);
	if (def_mount_opts & EXT4_DEFM_BSDGROUPS)
		set_opt(sb, GRPID);
	if (def_mount_opts & EXT4_DEFM_UID16)
		set_opt(sb, NO_UID32);
	/* xattr user namespace & acls are now defaulted on */
	set_opt(sb, XATTR_USER);
#ifdef CONFIG_EXT4_FS_POSIX_ACL
	set_opt(sb, POSIX_ACL);
#endif
	if (ext4_has_feature_fast_commit(sb))
		set_opt2(sb, JOURNAL_FAST_COMMIT);
	/* don't forget to enable journal_csum when metadata_csum is enabled. */
	if (ext4_has_metadata_csum(sb))
		set_opt(sb, JOURNAL_CHECKSUM);

	if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_DATA)
		set_opt(sb, JOURNAL_DATA);
	else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_ORDERED)
		set_opt(sb, ORDERED_DATA);
	else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
		set_opt(sb, WRITEBACK_DATA);

	if (le16_to_cpu(es->s_errors) == EXT4_ERRORS_PANIC)
		set_opt(sb, ERRORS_PANIC);
	else if (le16_to_cpu(es->s_errors) == EXT4_ERRORS_CONTINUE)
		set_opt(sb, ERRORS_CONT);
	else
		set_opt(sb, ERRORS_RO);
	/* block_validity enabled by default; disable with noblock_validity */
	set_opt(sb, BLOCK_VALIDITY);
	if (def_mount_opts & EXT4_DEFM_DISCARD)
		set_opt(sb, DISCARD);

	if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
		set_opt(sb, BARRIER);

	/*
	 * enable delayed allocation by default
	 * Use -o nodelalloc to turn it off
	 */
	if (!IS_EXT3_SB(sb) && !IS_EXT2_SB(sb) &&
	    ((def_mount_opts & EXT4_DEFM_NODELALLOC) == 0))
		set_opt(sb, DELALLOC);

	if (sb->s_blocksize == PAGE_SIZE)
		set_opt(sb, DIOREAD_NOLOCK);
}

// 如果之前没有设置过EXT4_MOUNT_JOURNAL_DATA字段，也会将其设置成默认字段
static int ext4_load_and_init_journal(struct super_block *sb,
				      struct ext4_super_block *es,
				      struct ext4_fs_context *ctx)
{
	// ...

	/* We have now updated the journal if required, so we can
	 * validate the data journaling mode. */
	switch (test_opt(sb, DATA_FLAGS)) {
	case 0:
		/* No mode set, assume a default based on the journal
		 * capabilities: ORDERED_DATA if the journal can
		 * cope, else JOURNAL_DATA
		 */
		if (jbd2_journal_check_available_features
		    (sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE)) {
			set_opt(sb, ORDERED_DATA);
			sbi->s_def_mount_opt |= EXT4_MOUNT_ORDERED_DATA;
		} else {
			set_opt(sb, JOURNAL_DATA);
			sbi->s_def_mount_opt |= EXT4_MOUNT_JOURNAL_DATA;
		}
		break;
	case EXT4_MOUNT_ORDERED_DATA:
	case EXT4_MOUNT_WRITEBACK_DATA:
		// ...
	}
	// ...
}

可以看到，在挂载的过程中，sbi->s_def_mount_opt和es->s_mount_opts并非简单的对应关系，ext4会对sbi->s_def_mount_opt进行额外的设置。

总结

Ext4文件系统的mount option设置来源有几个地方，一是通过mkfs时在磁盘上的superblock写入，二是挂载时的参数。但是这些和内存中文件系统相关的mount options和default mount options都不是简单的对应关系，ext4会根据其他的信息设置options。

查看ext4挂载选项认准/proc/fs/ext4/{device}/options。

Linux访问控制模型和进程凭证

Posted on 2021-09-01 Edited on 2023-10-29 In Linux

Linux的访问控制模型

Linux传统的访问控制模型是DAC（Discretionary Access Control，自主访问控制）。DAC Model是根据自主访问控制策略建立的一种模型，允许合法用户以用户或用户组的身份访问策略规定的客体，同时阻止非授权用户访问客体，某些用户还可以自主地把自己所拥有的客体的访问权限授予其他用户。在Linux中，这里用户和用户组就对应了user和group，客体就代表了文件、文件夹、IPC等共享资源。对于客体，比如文件来说，可以对于不同的主体分别设置rwx权限。当然，对于主体的划分粒度较粗，只能对文件所有者、同组用户、其他用户分别设置，没法针对每个用户单独设立权限。

ps：SELinux上引入了MAC模型，这里不做深入。

进程的用户ID凭证

众所周知，Linux上一切操作都是基于进程来进行，比如常见的在shell里执行命令等。在执行需要权限判断的操作时，进程都会通过某个系统调用陷入内核，由内核来进行权限的判断。那么很自然就能想到，既然Linux的DAC模型基于用户和组做权限控制，那么进程里必然得保存关于用户和组的信息。具体实现上来说，进程都有一套数字来表示它所属于的用户ID和组ID。以下主要讲解用户ID凭证，组ID的原理和实现和用户类似，就不再赘述。这些ID称为进程凭证。对于用户ID来说，具体有三个：

实际用户ID(real user ID)
有效用户ID(effective user ID)
保存的set-user-ID(saved-user-ID)

需要保存三个uid吗？

一个ID够吗？

可能有人会困惑，为什么需要保存三个id。只保存一个启动进程的用户ID可不可行呢？当前用户通过login进程登录之后，保存它的uid。后续再由该用户启动的程序都是login进程的子孙进程，只要让子进程的uid凭证都继承自父进程，uid就此就能保存下来。

以上的设计在大部分场景下够用。但是有些程序的权限需求比较特殊，得让普通用户执行也有文件所有者的权限。比如说用户的密码储存在/etc/shadow中，普通用户不可读写。但是，passwd程序允许用户修改它们自己的密码。也就是，当用户执行passwd，它们可以突然修改/etc/shadow，而且得识别出启动进程的用户，如何实现？如果按照上述的设计，普通用户执行passwd，进程uid为非0，那必然没有/etc/shadow的读写权限。

set-uid,两个ID够吗？

于是，早年的开发者们就想到了，在文件的属性上加了一位做标记，set-user-id位。那么，继续沿用上述的设计，在exec标志set-user-id位的可执行程序时，将进程的uid改成文件所有者，普通用户无法读写/etc/shadow问题就迎刃而解了。但这样引入了了另一个问题，诸如passwd这样的程序无法知道启动进程的用户，都不知道该改哪个用户的密码了。很显然，进程保存一个uid肯定是不够用了，至少得再加一个。一个记录运行程序的用户id，一个记录实际用于权限判断的用户id。实际上，real-uid和effective-uid就是干的这个事情。real-uid为启动进程的用户id，effective-uid为实际用户权限判断的用户id。大部分情况下，real-uid和effective-uid相同。运行设置set-uid的程序，effective-uid会改成程序文件的owner。

这样的设计也不够好，因为effective-uid的更改变成了一锤子买卖。如果有进程需要在启动的用户和文件owner之间反复横跳怎么办？effective-uid改回real-uid之后文件所属用户id就丢失了（进程得根据执行的文件exe大费周折去文件系统的inode里查所属用户）。

最小权限原则，三个ID

这个”反复横跳”的需求也是很有必要的。有个最小权限原则（最早由 Saltzer 和 Schroeder 提出）：

每个程序和系统用户都应该具有完成任务所必需的最小权限集合。
限制代码运行所需的安全权限，有一个非常重要的原因，就是降低你的代码在被恶意用户利用时，造成的损失。如果你的代码仅仅使用最小权限来执行，恶意用户就难以使用它造成损失。如果你需要用户使用管理员权限来执行代码，任何代码中的安全缺陷，都会通过利用该缺陷的恶意用户，潜在造成更大的损失。

根据最小权限原则，只有实际进行关键操作的时候获取权限，其余时候应该禁用。比如说对于passwd程序来说，最好就是只有在读写/etc/shadow的时候获取root权限，其余时候（比如说等待用户输入时）放弃特权。

具体规则

所以，最后就形成了现今的实现，进程里保存了三个uid。这三个uid初始化的规则如下：

real-uid为启动进程的用户。
如果是set-uid程序运行的进程，effective-uid为文件的所有者；否则和real-uid相同，为启动进程的用户。
saved-uid由effective-uid复制而来。

对于普通的非特权用户来说，允许通过一些系统调用让effective-uid在real-uid和saved-uid之间来回变动。进程运行时，权限的检查则都是基于effective uid。对于一个拥有良好安全设计的set-uid程序来说，只有需要使用特殊权限的时候才把effective-uid切换成文件所有者，其余时候都应该为进程启动者。

系统接口

Linux上支持改动修改进程凭证的一些系统接口：

修改进程凭证的一些系统调用

参考：

《Linux系统编程手册》第9章
https://wizardforcel.gitbooks.io/syracuse-sec-lecture-notes/content/3.html

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Posted on 2021-08-19 Edited on 2023-10-29 In Container , 论文笔记

Younge, Andrew J., et al. “A tale of two systems: Using containers to deploy HPC applications on supercomputers and clouds.” 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 2017.

container

Docker
Shifter
Charliecloud
Singularity

DevOps

部署的工作流：

在本地电脑上使用docker容器（因为桌面电脑用win和macOS的比较多，docker都支持），将Dockerfile和项目代码保存到git项目中。
项目推送到远端的仓库，并将容器镜像放进容器注册服务。
在多个平台上（EC2、cluster、supercomputer）拉取代码，在容器中执行。

environment

镜像环境：
- HPCG benchmark
- Intel MPI Benchmark suite (IMB)
- base image: Centos 7, both benchmarks were built using the Intel 2017 Parallel Studio, which includes the latest Intel compilers and Intel MPI library.
  
  拉取镜像：
  1
  docker pull ajyounge/hpcg-container
Cray XC30 supercomputing platform
- hardware:
  
  Volta includes 56 compute nodes packaged in a single enclosure, with each node consisting of two Intel Ivy Bridge E5-2695v2 2.4 GHz processors (24 cores total), 64GB of memory, and a Cray Aries network interface.
- shared file system
  
  Shared file system supportis provided by NFS I/O servers projected to compute nodes via Cray’s proprietary DVS storage infrastructure.
- OS:Cray Compute Node Linux (CNL ver. 5.2.UP04, 基于SUSE Linux 11), linux kernel v3.0.101
```
内核版本过老，需要做出修改才能使用Singularity。具体来说，增加了对loopback设备和EXT3文件系统的支持。
```
  - config：
    
    Specifically, we configure Singularity to mount /opt/cray, as well as /var/opt/cray for each container instance.
    
    In order to leverage the Aries interconnect as well as advanced shared memory intra-node communication mechanisms, we dynamically link Cray’s MPI and associated libraries provided in /opt/cray directly within the container
    
    链接的动态库包括：
    - Cray’s uGNI messaging interface
    - XPMEM shared memory subsystem
    - Cray PMI runtime libraries
    - uDREG registration cache
    - application placement scheduler (ALPS)
    - configure workload manager
    - some Intel Parallel Studio libraries
Amazon EC2: c3.8xlarge
- hardware:
  - cpu: Intel Xeon “Ivy-Bridge” E5-2680 v2 (2.8 GHz, 8 cores, hyperthread) x 2
  - memory: 60GB of RAM
  - disk: 2x320 GB SSDs
  - network: 10 Gb Ethernet network
- OS: RHEL7
  - config:
    使用SR-IOV技术，加载了ixgbevf内核模块。
- Docker: v1.19

benchmark

Benchmarks are reported as the average of 10 trials for IMB and 3 trials for HPCG, with negligible run-to-runvariance that is therefore not shown.

IMB

测试网络的带宽和延迟，对应MPI节点通信的性能。对于全静态链接和动态链接的版本做了测试。
- PingPong bandwidth
  
  Singularity容器中链接CrayMPI，带宽最高，接近native。表明MPI库的选择会严重影响性能，针对特殊机器做过优化的版本最优。
- PingPong Latency
  
  Singularity链接CrayMPI，延迟和native采用动态链接基本一致。静态链接的版本延迟最低。
HPCG

MPI程序的性能

可以观察到，随着rank数量增加，Cray相比EC2的性能优势开始体现；Singularity链接CrayMPI的性能接近native；链接IntelMPI的性能甚至不如kvm虚拟机。

Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

Posted on 2021-08-18 Edited on 2023-10-29 In Container , 论文笔记

Xavier, Miguel G., et al. “Performance evaluation of container-based virtualization for high performance computing environments.” 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE, 2013.

containers

LXC(Linux Container) 2.0.9
docker 17.03.0-ce, build 60ccb22
singularity 2.2.1

singularity相比另外两款容器技术在功能上适当舍弃，比如启动不改变用户、没有使用cgroup等。这些都对性能有积极影响。

environment

CPU model Intel(R) Xeon(R) CPU E5-2683v4 @ 2.10GHz(64-core node); Memory 164 GB DDR3-1,866 MHz, 72-bit wide bus at 14.9 GB/s on P244br anda HPE Dynamic Smart Array B140i Disk; OS Ubuntu 16.04(64-bit) distribution was installed on the host machine.

benchmarks

执行基本命令 echo hello world
HPL

用于测试CPU性能。编译环境：GNU C/C++ 5.4，OpenMPI 2.0.2。

For the HPL benchmark, the performance results dependon two main factors: the Basic Linear Algebra Subprogram(BLAS) library, and the problem size. We used in our experiments the GotoBLAS library, which is one of the bestportable solutions, freely available to scientists. Searchingfor the problem size that can deliver peak performance isextensive; instead, we used the same problem size 10 times(10 N, 115840 Ns) for performance analysis.

BLAS库：GotoBLAS，问题规模：10 N, 115840 Ns

The LXC was not able to achieve native performance presenting an average overheadof 7.76%, Docker overhead was 2.89%, this could be probably caused by the default CPU use restrictions set on the daemon which by default each container is allowed to use a node’s CPU for a predefined amount of time. Singularity was able to achieve a better performance than native with 5.42% because is not emulating a full hardware level virtualization(only the mount namespace) paradigm and as the image itself is only a single metadata lookup this can yield in very high performance benefits.

TODO: singularity为什么比裸机还快？docker或者LXC通过调整cgroup的配置能否进一步释放性能？
IOzone

测试IO。

We ran the benchmark witha file size of 15GB and 64KB for the record size, under two(2) scenarios. The first scenario was a totally containedfilesystem (without any bind or mount volume), and thesecond scenario was a NFS binding from the local cluster.

Docker advanced multi-layered unificationfilesystem (AUFS) has it drawbacks. When an applicationrunning in a container needs to write a single new value toa file on a AUFS, it must copy on write up the file from theunderlying image. The AUFS storage driver searches eachimage layer for the file. The search order is from top to bottom. When it is found, the entire file is copied up to thecontainer’s top writable layer. From there, it can be openedand modified.

Docker读写no-bind普遍比较慢的原因是AUFS。

TODO: 连续读写和随机读写时，bind和no-bind的性能优劣正好反过来。why？（猜测：可能和文件系统、挂载的硬盘有关）
STREAM

测试内存带宽。

singularity性能最优，因为没有cgroup对资源的限制。
MVA-PICH OSU Micro-Benchmarks 5.3.2

测试MPI通信的带宽和延迟。

These results can be explained due to different implementations of the network isolation of the virtualization systems. While Singularity container does not implement virtualized network devices,both Docker and LXC implement network namespace that provides an entire network subsystem. COS network performance degradation is caused by the extra complexity oftransmit and receive packets (e.g. Daemon processes).
NAMD

测试GPU性能
- Environment:
  
  The performance studies were executed on a Dell Po-werEdge R720, with 2*Intel(R) Xeon(R) CPU E5-2603 @1.80GHz (8 cores) and a NVIDIA Tesla K20M.7. Froma system point of view, we used Ubuntu 16.04.2 (64-bit),with NVIDIA cuda 8.0 and the NVIDIA driver version375.26.
- version:
  - Singularity 2.2.1
  - Docker 17.03.0-ce, build 60ccb22
  - LXC 2.0.9
- detail：
  
  We ran those GPU benchmarks on a Tesla K20m with “NAMD x8664 multicoreCUDA version 2017-03-16” [on the stmv dataset (1066628 Atoms)], using the 8 cores and the GPU card, withoutany specific additional configuration, except the use of the“gpu4singularity” code for Singularity and the “nvidia-docker” tool for Docker.
- result:
  
  单位：天/纳秒。越低越好。

source code

作者在github上开源了测试运行的脚本。

https://github.com/ArangoGutierrez/containers-benchs

HPC container runtime performance overhead: At first order, there is none

Posted on 2021-08-17 Edited on 2023-10-29 In Container , 论文笔记

Torrez, Alfred, Reid Priedhorsky, and Timothy Randles. “HPC container runtime performance overhead: At first order, there is none.” (2020).

containters

Charliecloud
Shifter
Singularity

environment

LANL’s CTS-1 clusters Grizzly (1490 nodes,
128 GiB RAM/node; HPCG) and Fog (32 nodes, 256 GiB RAM/node; SysBench, STREAM, and HPCG)

分别在三种容器以及裸机环境中进行测试。

benchmarks

SysBench

CPU性能。36路线程计算低于4000万的质数。

4个环境下耗时几乎相同。
STREAM

内存性能。编译选项 STREAM_ARRAY_SIZE=2,000,000 –cpu_bind=v,core,map_cpu:23。跑了100个单独的线程。

We compiled with STREAM_ARRAY_SIZE set to 2 billion to match the recommended 4× cache and pinned the process to a semi-arbitrary core using the Slurm argument –cpu_bind=v,core,map_cpu:23.

4个环境下测试出的带宽几乎相同
HPCG(High Performance Conjugate Gradients)

We used a cube dimension of 104 and a run time of 60 seconds, all 36 cores per node, one MPI rank per core, and one thread per rank.
memory usage

To understand node memory usage with STREAM, we computed MemTotal – MemFree from /proc/meminfo, sampled at 10-second intervals.

Bare metal total node usage was a median of 50.8 MiB. Charliecloud added 1200 MiB, Shifter 16 MiB, and Singularity 37 MiB.

Charliecloudn内存使用多可能是因为存储在tmpfs里的1.2Gib镜像。

For HPCG, we sampled at 10-second intervals the writeable/private field of pmap(1), which reports memory consumption of individual processes. Median memory usage for all three container technologies is, to two significant figures, 0.64% lower than bare metal at 1 node, 0.53% lower at 8 nodes, 0.53–0.54% lower at 64 nodes, and 1.2% higher at 512 nodes, a minimal difference.

Itanium C++ ABI下member pointer的实现

Posted on 2019-02-19 Edited on 2023-10-29 In C++

Itanium C++ ABI

Itanium C++ ABI是一个用于C++的ABI。作为ABI，它给出了实现该语言的精确规则，确保程序中单独编译的部分能够成功地互操作。尽管它最初是为Itanium架构开发的，但它不是特定于平台的，可以在任意的C ABI之上进行分层移植。因此，它被用作所有主要架构上的许多主要操作系统的标准C++ ABI，并在许多主要的c++编译器中实现，包括GCC和Clang。

简单点来说，x64的Linux上，GCC和Clang都是遵循Itanium C++ ABI的。所以今天就针对这个它来探讨一下member pointer的实现。

pointer to data member

A pointer to data member is an offset from the base address of the class object containing it, represented as a ptrdiff_t. It has the size and alignment attributes of a ptrdiff_t. A NULL pointer is represented as -1.

指向数据成员的指针，实现为在整个类中的偏移量。可以看成是ptrdiff_t类型的数据。

接下来看个例子：

struct Test
{
    int a;
    char b;
    double c;
};

int main()
{
    int Test::*ptr2a = &Test::a;
    char Test::*ptr2b = &Test::b;
    double Test::*ptr2c = &Test::c;

    std::cout << *(std::ptrdiff_t*)(&ptr2a) << std::endl;
    std::cout << *(std::ptrdiff_t*)(&ptr2b) << std::endl;
    std::cout << *(std::ptrdiff_t*)(&ptr2c) << std::endl;
}

输出结果为0,4,8。考虑到对齐，确实为各个成员的偏移量。

pointer to function

A pointer to member function is a pair as follows:

ptr:

For a non-virtual function, this field is a simple function pointer. (Under current base Itanium psABI conventions, that is a pointer to a GP/function address pair.) For a virtual function, it is 1 plus the virtual table offset (in bytes) of the function, represented as a ptrdiff_t. The value zero represents a NULL pointer, independent of the adjustment field value below.

adj:

The required adjustment to this, represented as a ptrdiff_t.

指向成员函数的指针。分为ptr部分和adj部分。ptr可分为指向非虚函数和虚函数的情况。adj表示对于this的调整，可以看成ptrdiff_t类型。

ps：关于这个adj是干什么用的我也不是很清楚，猜测有可能和多继承有关系？ = =。以后知道了再补充吧，现在先主要讲解ptr。

pointer to non-virtual function

对于非虚函数来说，ptr部分就是简单的函数地址。可以通过这个得到成员函数地址，甚至直接调用它：

struct Test
{
    void func() {
        std::cout << this << "  Test::func() is called\n";
    }
};
int main()
{
    Test t;
    
    auto ptr2func = &Test::func;
    
    // 得到func的地址
    uint64_t addr = *(uint64_t*)&ptr2func;
    
    // 内联汇编，等效于下面一行
    asm volatile("leaq %0, %%rdi ; callq *%1" : : "m"(t),"r" (addr) : "rdi" );
    // (t.*ptr2func)();
}

这里将ptr2func定义为成员函数指针，然后提取出它的ptr部分，既函数地址，保存到addr中。然后将t的地址传入rdi寄存器，充当this指针。x64的calling convention中，rdi存储函数调用的第一个参数，所以将this指针作为隐式的第一个参数存进了rdi寄存器。最后通过addr的函数地址，call指令进行调用。最后打印出this，与直接(t.*ptr2func)()效果相同。

pointer to virtual function

对于虚函数来说，ptr部分为函数在虚表中的偏移量(单位为byte)加1。如果为0，表示为NLLL pointer，虚表中没有这个函数的指针。

所以，如果我们知道了虚表的位置(对象的第一个字，就是虚表指针)，结合ptr表示的偏移量，也能得到函数的地址，从而调用它：

struct Test
{
    virtual void f1() {
        std::cout << this << "  Test::f1() is called\n";
    }
    virtual void f2() {
        std::cout << this << "  Test::f2() is called\n";
    }
};

int main() {
    Test t;

    auto ptr2f1 = &Test::f1;

    // 得到虚表的地址
    uint8_t* vtable = *(uint8_t**)(&t);
    // 得到f1函数在虚表中的偏移量
    std::ptrdiff_t f1_offset = *(std::ptrdiff_t*)(&ptr2f1) - 1;
    // 得到f1函数的地址
    uint64_t f1_addr = *(uint64_t*)(vtable + f1_offset);
    // 调用它，相面两行等效
    asm volatile ("leaq %0, %%rdi; callq *%1" : : "m" (t), "r" (f1_addr) : "rdi");
    (t.*ptr2f1)();
}

可以看到，我们首先在对象的首字处得到了虚表的地址vtable，然后通过成员函数指针的ptr部分得到了f1函数在虚表中的偏移量f1_offset。然后解引用得到了f1函数的地址，最后调用它。rdi寄存器存储this指针，这点前面已经谈过。最终结果与(t.*ptr2f1)()等价。

x64上Linux的系统调用

Posted on 2019-02-02 Edited on 2023-10-29 In x86

x64上Linux的系统调用

写在前面：本文希望读者有一定的Linux基础，了解过系统调用和crt的包装函数的区别。可以看我之前写过的关于IA32上Linux系统调用的简介，以及《Linux内核设计与实现》一书中对系统调用的笔记。

众所周知，在IA32上，Linux的系统调用是通过int 0x80中断，访问中断向量表，调用sys_call()。它通过eax传递系统调用号；其他一系列寄存器传递参数，分别存储在ebx，ecx，edx，esi，edi，ebp；返回值存储在eax。

现今，x86 64体系结构引入了一条专用指令syscall。它不访问中断描述符表，速度更快。它通过rax传递系统调用号；其他一系列寄存器传递参数，分别存储在rdi，rsi，rdx，r10，r8，r9；返回值存储在rax。

很明显，系统调用的ABI发生了剧烈的改变。进行系统调用的指令，传递系统调用号的寄存器，传递参数的寄存器，返回值的寄存器，甚至系统调用对应的编号，32位与64位都存在着很大的差异。理论上系统调用表都是向后兼容的，每次更新时只能往后添加系统调用号，已有的系统调用号则保持。我在Stack Exchange上找到了一个回答，解释了从32位到64位系统调用表更改的原因：x86 64体系结构出现时，ABI(传递参数、返回值)是不同的，因此内核开发人员利用这个机会带来了期待已久的更改，为了对高速缓存行使用级别进行优化。比如，常用的sys_read/sys_write/sys_open/sys_close分别位于前四个系统调用号；sys_exit原本很靠前(原本系统调用号为1)，但每个进程都在退出时才调用一次，所以现在是靠后的60作为系统调用号。

目测是为了兼容，我在内核版本为4.14.0的ubuntu上仍然能通int 0x80进行系统调用，下面是测试的代码：

section .data
str: db "Hello world"
str_len equ $-str

section .text
global _start
[bits 64]
_start:
    mov eax, 4           ; sys_write的系统调用号
    mov ebx, 1           ; 第一个参数为int fd
    mov ecx, str         ; 第二个参数为char *buf
    mov edx, str_len     ; 第三个参数为size_t count
    int 0x80

    mov eax, 1           ; sys_exit的系统调用号
    mov ebx, 0           ; 第一个参数为int status
    int 0x80

不过，x86_64的Linux最好还是通过syscall进行系统调用：

section .data
str: db "Hello world"
str_len equ $-str

section .text
global _start
[bits 64]
_start:
    mov eax, 1           ; 代表sys_write
    mov rdi, 1           ; 第一个参数为int fd
    mov rsi, str         ; 第二个参数为char *buf
    mov rdx, str_len     ; 第三个参数为size_t count
    syscall

    mov eax, 60          ; sys_exit的系统调用号
    mov rdi, 0           ; 第一个参数为int status
    syscall

记 TLPI 上一个多线程代码例子的bug

Posted on 2018-08-31 Edited on 2023-10-29 In Linux

ps:TLPI 是 The Linux Programming Interface 一书的缩写。

今天试着跑 TLPI 第30章上一个程序的时候，老是运行时出bug。程序不是很难，主要是讲解 pthread 条件变量的使用：

#include <pthread.h>
#include "tlpi_hdr.h"

static pthread_cond_t thread_died = PTHREAD_COND_INITIALIZER;
static pthread_mutex_t thread_mutex = PTHREAD_MUTEX_INITIALIZER;

static int tot_threads = 0;
static int num_live = 0;

static int num_unjoined = 0;

enum tstate
{
    TS_ALIVE,
    TS_TERMINATED,
    TS_JOINED
};

static struct
{
    pthread_t tid;
    enum tstate state;
    int sleep_time;
} *thread;

static void *thread_func(void *arg)
{
    int idx = *((*int)arg);
    int s;

    sleep(thread[idx].sleep_time);
    printf("Thread %d terminating\n", idx);

    s = pthread_mutex_lock(&thread_mutex);
    if (s != 0)
    {
        errExitEN(s, "pthread_mutex_lock");
    }

    num_unjoined++;
    thread[idx].state = TS_TERMINATED;

    s = pthread_mutex_unlock(&thread_mutex);
    if (s != 0)
        errExitEN(s, "pthread_mutex_unlock");

    s = pthread_cond_signal(&thread_died);
    if (s != 0)
        errExitEN(s, "pthread_cond_signal");

    return NULL;
}

int main(int argc, char *argv[])
{
    int s, idx;

    thread = calloc(argc - 1, sizeof(*thread));
    if (thread == NULL)
        errExit("calloc");
    for (idx = 0; idx < argc-1; ++idx)
    {
        thread[idx].sleep_time = getInt(argv[idx+1], GN_NONNEG, NULL);
        thread[idx].state = TS_ALIVE;
        s = pthread_create(&thread[idx].tid, NULL, thread_func, &idx);
        if (s != 0)
            errExitEN(s, "pthread_create");
    }

    tot_threads = argc - 1;
    num_live = tot_threads;

    while (num_live > 0)
    {
        s = pthread_mutex_lock(&thread_mutex);
        if (s != 0)
            errExitEN(s, "pthread_mutex_lock");

        while (num_unjoined == 0)
        {
            s = pthread_cond_wait(&thread_died, &thread_mutex);
            if (s != 0)
                errExitEN(s, "pthread_cond_wait");
        }

        for (idx = 0; idx < tot_threads; ++idx)
        {
            if (thread[idx].state == TS_TERMINATED)
            {
                s = pthread_join(thread[idx].tid, NULL);
                if (s != 0)
                    errExitEN(s, "pthread_join");

                thread[idx].state = TS_JOINED;
                num_live--;
                num_unjoined--;

                printf("Reaped thread %d (num_live=%d)\n", idx, num_live);
            }
        }

        s = pthread_mutex_unlock(&thread_mutex);
        if (s != 0)
            errExitEN(s, "pthread_mutex_unlock");
    }

    exit(EXIT_SUCCESS);
}

后来我调试的时候，却往往能够正常运行，但运行时候的错误却很一致：

$ ./a.out 1 2 1
Thread 3 terminating
Thread 1 terminating
Thread 2 terminating

然后就卡死…

仔细观察过输出结果，这是每次创建的线程中输出的结果。程序首先通过一个循环创建线程，然后把下标传址给线程作为参数：

// main函数中
for (idx = 0; idx < argc-1; ++idx)
{
    // ...
    s = pthread_create(&thread[idx].tid, NULL, thread_func, &idx);
    // ...
}

然后，在线程中，每次都将对idx解引用，得到下标，并输出：

// 每个新线程的函数中, arg是传进的参数
int idx = *((*int)arg);
// ...
printf("Thread %d terminating\n", idx);
// ...

可能大家也可以看出来了，妥妥的 race condition，多个线程通过指针访问同一个变量，没有进行同步和互斥的工作。有可能新的线程直到循环中的下标自增之后才执行解引用（实际上在我电脑上就是按照这个顺序执行了）。

比较简单的修改方法就是将int类型的下标直接强制转化成void *的类型的参数传值。

// 在main函数里：
for (idx = 0; idx < argc-1; ++idx)
{
    // ...
    s = pthread_create(&thread[idx].tid, NULL, thread_func, (void*)idx);
    // ...
}

// 新线程的函数里
int idx = (int)arg;

这个改法看似简单，但其实有点问题。因为在C标准里整形和指针类型的强制转化是 implementation-defined。一下摘抄自cppreference：

Any integer can be cast to any pointer type. Except for the null pointer constants such as NULL (which doesn’t need a cast), the result is implementation-defined, may not be correctly aligned, may not point to an object of the referenced type, and may be a trap representation.

Any pointer type can be cast to any integer type. The result is implementation-defined, even for null pointer values (they do not necessarily result in the value zero). If the result cannot be represented in the target type, the behavior is undefined (unsigned integers do not implement modulo arithmetic on a cast from pointer)

事实上，在64位x86上，指针类型占8字节，int类型占4字节，用脚趾头都知道它们之间的转化很不安全。

在 TLPI 官网上的勘误中也提到了这个错误，上面还给了两种避免指针和整型转化的方法。

一种解决方法是把当前 thread[idx] 地址传过去，这样只需进行不同指针的转化，这是C语言允许的。
1
s = pthread_create(&thread[idx].tid, NULL, threadFunc, &thread[idx]);
然后线程的函数中只需要对传入的地址和首元素地址进行相减就能得到相应的下标：
1
2
struct tinfo *tptr = arg;
int idx = tptr - thread; /* Obtain index in 'thread' array */
另一个解决方法就是用 uintptr_t 代替 int 类型。unitptr_t 类型是从C99标准开始有的类型，定义在头文件 <stdint.h> 中。它用来表示一个能够容纳指针值的无符号整型。

《Linux内核设计与实现》读书笔记——系统调用

Posted on 2018-08-23 Edited on 2023-10-29 In Linux

写在前面：之前我粗略的整理过linux在IA32处理器上的系统调用的过程…这篇就当做补充和复习了。

与内核通信

系统调用在用户空间进程和硬件设备之间添加了一个中间层。在Linux中，系统调用是用户空间访问内核的唯一手段：除了异常和陷入外，它们是内核唯一的合法入口。

API、POSIX 和 C库

一般情况下，应用程序都是直接调用在用户空间实现的API来编程，而不是直接通过系统调用。这些API与系统调用也并非一一对应(甚至不使用系统调用)。举个简单的例子，应用程序调用C库中的printf()，C库中的printf()再调用C库中的write()，而C库中write()才调用内核提供的write()系统调用。

不难想到，通过一套标准来规范 API ，就能带来源码级的移植性。而在Unix世界里，POSIX标准最为流行。

在Linux中，C库实现了Unix系统的主要API，包括标准库规定的函数以及封装的系统调用接口。

系统调用

进行系统调用(syscall)，通常可以通过C库中的函数来完成。内核必须提供系统调用需要完成的功能，但是在实现上没有规定。这也是Unix哲学中的“separating mechanism and policy”。

举个例子，getpid()：

SYSCALL_DEFINE0(getpid)
{
    return task_tpid_vnr(current);  // return current->tgid
}

这里SYSCALL_DEFINE0是个宏，展开后代码如下：

1	asmlinkage long sys_getpid(void)

这里的asmlinkage是gcc的拓展，用于通知编译器仅从栈中提取该函数的参数。所有系统调用都有这个限定词。其次，函数返回long类型，为了保证32位和64位系统的兼容。系统调用在用户空间返回值类型为int，在内核空间返回值类型为long。最后，形如sys_bar()是在Linux中的命名规则。

系统调用号

Linux中的每个系统调用被赋予一个系统调用号。用户态的进程通过这个号来指明进行哪个系统调用；进程不会提及系统调用的名称。

内核记录了系统调用表中所有已经注册的系统调用的列表，存储在sys_call_table中。

系统调用的性能

Linux系统调用很快，一个原因是上下文切换快，进出内核被优化地简洁高效，另一个原因是系统调用处理程序和系统调用本身也十分简洁。

系统调用处理程序

应用程序通过软中断通知内核，告诉内核需要进行系统调用：引发一个异常来促使系统切换到内核态去执行异常处理程序。x86上系统预定义的软中断号是128，通过int $0x80指令触发该中断。这条指令会触发一个异常导致系统切换到内核态并执行第128号异常处理程序。而该程序正是系统调用处理程序，叫system_call()。它与硬件体系结构密切相关。最近，x86处理器增加了一条叫做sysenter的指令。与int指令相比，这条指令提供了更快，更专业的陷入内核进行系统调用的方式。

制定恰当的系统调用

在x86上系统调用号通过eax寄存器传递给内核。在陷入内核前，用户空间把相应的系统调用号放入eax中。其他体系结构上类似。

system_call()通过将给定的系统调用号与NR_syscalls做比较来检查其有效性。如果它大于或等于NR_syscalls，该函数就返回-ENOSYS。否则，执行相应的系统调用：call *sys_call_table(,%rax,8)。

由于系统调用表中的表项是以64位(8字节)类型存放的，所以内核需要将给定的系统调用号乘以8。x86-32系统上，就用4代替8。

参数传递

除了系统调用号外，大部分系统调用还需要一些外部的参数传入。在x86-32系统上，ebx，ecx，edx，esi按照顺序存放前5个参数。需要6个或者6个以上的情况不多见，此时应该用一个单独的寄存器存放指向这些参数在用户空间地址的指针。

给用户空间的返回值也通过寄存器传递。在x86系统上，它存放在eax寄存器中。

系统调用的实现

一个Linux系统调用在实现时不需要太关心它的系统调用处理程序的关系，给linux添加一个系统调用相对容易。

实现系统调用

一个系统调用应该有明确的用途，不提倡通过传递不同的参数来选择完成不同的工作。ioctl()就是一个反面例子。还要求时刻注意可移植性和健壮性。

参数验证

系统调用必须验证它们所有的参数是否合法有效。系统调用在内核空间执行，如果任由用户将不合法的输入传递个内核，那么安全性和稳定性将没有保障。

最重要的一种检查就是指针是否有效。内核必须保证：

指针指向的内存区域属于用户空间。
指针指向的内存区域在进程的地址空间内。
如果是读，该内存应被标记为可读；如果是写，该内存应被标记为可写；如果是可执行，该内存应被标记为可执行。

内核提供了两个方法用于完成必须的检查和内核空间与用户空间数据的来回拷贝。

copy_to_user()，向用户空间写入数据，需要三个参数。第一个是进程空间中的目的内存地址，第二个是内核空间中的源地址，最后一个是字节数。
copy-from_user()，从用户空间读数据。它的三个参数和copy_to_user()类似。

如果执行失败，这两个函数返回的都是没能完成拷贝的数据字节数；如果成功，返回0。当出现上述错误，系统调用返回标准-EFAULT。注意，这两个函数都有可能引起阻塞，当缺页的时候。

最后一项检查针对是否有合法权限。调用者可以通过capable()函数来检查是否有权能对制定的资源进行操作。如果返回非0就有权，否则无权。

系统调用上下文

内核在执行系统调用的时候处于进程上下文，current指针指向当前任务。

在进程上下文中，内核可以休眠并且可以被抢占。当系统调用返回的时候，控制权仍在system_call()中，它最终会负责切换到用户空间，并让用户进程继续执行下去。

绑定一个系统调用的最后一个步骤

当编写完一个系统调用后，把它注册成一个正式的系统调用：

首先，在系统调用表的最后一项加入一个表项。
对于所支持的各种体系结构，系统调用号必须定义于<asm/unistd.h>中。
系统调用必须被编译进内核映像。这只要把它放进kernel/下的一个相关文件中就可以了，如sys.c，它包含了各种各样的系统调用。

从用户空间访问系统调用

通常，系统调用靠C库支持。用户程序通过包含头文件并和C库链接，就可以使用系统调用。但如果你如果仅仅写出了系统调用，glibc并不提供支持。可以通过Linux本身提供的一组宏，直接对系统调用进行访问。这些宏是_syscalln()，n的范围从0到6，代表需要传递给系统调用的参数个数。对于每个宏，都有2+2×n个参数。第一个参数表示返回值类型，第二个参数是系统调用的名称，接着是按照系统调用参数顺序排列的每个参数的类型和名称。

举个例子：

1
2
3

// 对于 long open(const char *filename, int flags, int mode)
#define _NR_open 5
_syscall3(long, open, const char *, filename, int, flags, int, mode)

这组宏会被拓展成内嵌汇编的C函数。

为什么不通过建立系统调用的方式实现

Linux系统尽量避免没出现一种新的抽象就加入一个新的系统调用，这使得它的系统调用接口十分简洁。