记 TLPI 上一个多线程代码例子的bug

ps:TLPIThe Linux Programming Interface 一书的缩写。


今天试着跑 TLPI 第30章上一个程序的时候,老是运行时出bug。程序不是很难,主要是讲解 pthread 条件变量的使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#include <pthread.h>
#include "tlpi_hdr.h"

static pthread_cond_t thread_died = PTHREAD_COND_INITIALIZER;
static pthread_mutex_t thread_mutex = PTHREAD_MUTEX_INITIALIZER;

static int tot_threads = 0;
static int num_live = 0;

static int num_unjoined = 0;

enum tstate
{
TS_ALIVE,
TS_TERMINATED,
TS_JOINED
};

static struct
{
pthread_t tid;
enum tstate state;
int sleep_time;
} *thread;

static void *thread_func(void *arg)
{
int idx = *((*int)arg);
int s;

sleep(thread[idx].sleep_time);
printf("Thread %d terminating\n", idx);

s = pthread_mutex_lock(&thread_mutex);
if (s != 0)
{
errExitEN(s, "pthread_mutex_lock");
}

num_unjoined++;
thread[idx].state = TS_TERMINATED;

s = pthread_mutex_unlock(&thread_mutex);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");

s = pthread_cond_signal(&thread_died);
if (s != 0)
errExitEN(s, "pthread_cond_signal");

return NULL;
}

int main(int argc, char *argv[])
{
int s, idx;

thread = calloc(argc - 1, sizeof(*thread));
if (thread == NULL)
errExit("calloc");
for (idx = 0; idx < argc-1; ++idx)
{
thread[idx].sleep_time = getInt(argv[idx+1], GN_NONNEG, NULL);
thread[idx].state = TS_ALIVE;
s = pthread_create(&thread[idx].tid, NULL, thread_func, &idx);
if (s != 0)
errExitEN(s, "pthread_create");
}

tot_threads = argc - 1;
num_live = tot_threads;

while (num_live > 0)
{
s = pthread_mutex_lock(&thread_mutex);
if (s != 0)
errExitEN(s, "pthread_mutex_lock");

while (num_unjoined == 0)
{
s = pthread_cond_wait(&thread_died, &thread_mutex);
if (s != 0)
errExitEN(s, "pthread_cond_wait");
}

for (idx = 0; idx < tot_threads; ++idx)
{
if (thread[idx].state == TS_TERMINATED)
{
s = pthread_join(thread[idx].tid, NULL);
if (s != 0)
errExitEN(s, "pthread_join");

thread[idx].state = TS_JOINED;
num_live--;
num_unjoined--;

printf("Reaped thread %d (num_live=%d)\n", idx, num_live);
}
}

s = pthread_mutex_unlock(&thread_mutex);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
}

exit(EXIT_SUCCESS);
}

后来我调试的时候,却往往能够正常运行,但运行时候的错误却很一致:

1
2
3
4
$ ./a.out 1 2 1
Thread 3 terminating
Thread 1 terminating
Thread 2 terminating

然后就卡死…

仔细观察过输出结果,这是每次创建的线程中输出的结果。程序首先通过一个循环创建线程,然后把下标传址给线程作为参数:

1
2
3
4
5
6
7
// main函数中
for (idx = 0; idx < argc-1; ++idx)
{
// ...
s = pthread_create(&thread[idx].tid, NULL, thread_func, &idx);
// ...
}

然后,在线程中,每次都将对idx解引用,得到下标,并输出:

1
2
3
4
5
// 每个新线程的函数中, arg是传进的参数
int idx = *((*int)arg);
// ...
printf("Thread %d terminating\n", idx);
// ...

可能大家也可以看出来了,妥妥的 race condition,多个线程通过指针访问同一个变量,没有进行同步和互斥的工作。有可能新的线程直到循环中的下标自增之后才执行解引用(实际上在我电脑上就是按照这个顺序执行了)。

比较简单的修改方法就是将int类型的下标直接强制转化成void *的类型的参数传值。

1
2
3
4
5
6
7
8
9
10
// 在main函数里:
for (idx = 0; idx < argc-1; ++idx)
{
// ...
s = pthread_create(&thread[idx].tid, NULL, thread_func, (void*)idx);
// ...
}

// 新线程的函数里
int idx = (int)arg;

这个改法看似简单,但其实有点问题。因为在C标准里整形和指针类型的强制转化是 implementation-defined。一下摘抄自cppreference

Any integer can be cast to any pointer type. Except for the null pointer constants such as NULL (which doesn’t need a cast), the result is implementation-defined, may not be correctly aligned, may not point to an object of the referenced type, and may be a trap representation.

Any pointer type can be cast to any integer type. The result is implementation-defined, even for null pointer values (they do not necessarily result in the value zero). If the result cannot be represented in the target type, the behavior is undefined (unsigned integers do not implement modulo arithmetic on a cast from pointer)

事实上,在64位x86上,指针类型占8字节,int类型占4字节,用脚趾头都知道它们之间的转化很不安全。

TLPI 官网上的勘误中也提到了这个错误,上面还给了两种避免指针和整型转化的方法。

  • 一种解决方法是把当前 thread[idx] 地址传过去,这样只需进行不同指针的转化,这是C语言允许的。

    1
    s = pthread_create(&thread[idx].tid, NULL, threadFunc, &thread[idx]);

    然后线程的函数中只需要对传入的地址和首元素地址进行相减就能得到相应的下标:

    1
    2
    struct tinfo *tptr = arg;
    int idx = tptr - thread; /* Obtain index in 'thread' array */
  • 另一个解决方法就是用 uintptr_t 代替 int 类型。unitptr_t 类型是从C99标准开始有的类型,定义在头文件 <stdint.h> 中。它用来表示一个能够容纳指针值的无符号整型。