本文介绍如何根据 Pod 异常状态信息中的 Exit Code 进一步定位问题。
查看 Pod 异常状态信息
执行以下命令,查看异常 Pod 状态信息。
kubectl describe pod <pod name>
返回结果如下:
Containers:
kubedns:
Container ID: docker://5fb8adf9ee62afc6d3f6f3d9590041818750b392dff015d7091eaaf99cf1c945
Image: ccr.ccs.tencentyun.com/library/kubedns-amd64:1.14.4
Image ID: docker-pullable://ccr.ccs.tencentyun.com/library/kubedns-amd64@sha256:40790881bbe9ef4ae4ff7fe8b892498eecb7fe6dcc22661402f271e03f7de344
Ports: 10053/UDP, 10053/TCP, 10055/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
--domain=cluster.local.
--dns-port=10053
--config-dir=/kube-dns-config
--v=2
State: Running
Started: Tue, 27 Aug 2019 10:58:49 +0800
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Tue, 27 Aug 2019 10:40:42 +0800
Finished: Tue, 27 Aug 2019 10:58:27 +0800
Ready: True
Restart Count: 1
在返回结果的容器列表
Last State
字段中,
Exit Code
为程序上次退出时的状态码,该值不为0即表示程序异常退出,可根据退出状态码进一步分析异常原因。
退出状态码说明
状态码需在0 - 255之间。
0表示正常退出。
若因外界中断导致程序退出,则状态码区间为129 - 255。例如,操作系统给程序发送中断信号
kill -9
或
ctrl+c
,导致程序状态变为
SIGKILL
或
SIGINT
。
通常因程序自身原因导致的异常退出,状态码区间在1 - 128。在某些场景下,也允许程序设置使用129 - 255区间的状态码。
若指定的退出状态码不在0 - 255之间(例如,设置
exit(-1)
),此时将会自动执行转换,最终呈现的状态码仍会在0 - 255之间。
若将退出时状态码记为
code
,则不同情况下转换方式如下:
当指定的退出时状态码为负数,转换公式为:
当指定的退出时状态码为正数,转换公式为:
常见异常状态码
137
:表示程序被
SIGKILL
中断信号杀死。异常原因可能为:
通常是由于 Pod 中容器内存达到了其资源限制(
resources.limits
)。例如,内存溢出(OOM)。由于资源限制是通过 Linux 的 cgroup 实现的,当某个容器内存达到资源限制, cgroup 就会将其强制停止(类似于
kill -9
),此时通过
describe pod
可以看到 Reason 是
OOMKilled
。
宿主机本身资源不够用(OOM),则内核会选择停止一些进程来释放内存。
说明
无论是 cgroup 限制,还是因为节点机器本身资源不够导致的进程停止,都可以从系统日志中找到记录。方法如下:
Ubuntu 系统日志存储在目录
/var/log/syslog
,CentOS 系统日志存储在目录
/var/log/messages
中,两者系统日志均可通过
journalctl -k
命令进行查看。
livenessProbe(存活检查)失败,使得 kubelet 停止 Pod。
被恶意木马进程停止。
1和255
:通常表示一般错误,具体原因需要通过容器日志进一步定位。例如,可能是设置异常退出使用
exit(1)
或
exit(-1)
导致的,而-1将会根据规则转换成255。
Linux 标准中断信号
Linux 程序被外界中断时会发送中断信号,程序退出时的状态码为中断信号值加128。例如,
SIGKILL
的中断信号值为9,那么程序退出状态码则为9 + 128 = 137。更多标准信号值参考如下表:
SIGHUP
|
1
|
Term
|
Hangup detected on controlling terminal or death of controlling process
|
SIGINT
|
2
|
Term
|
Interrupt from keyboard
|
SIGQUIT
|
3
|
Core
|
Quit from keyboard
|
SIGILL
|
4
|
Core
|
Illegal Instruction
|
SIGABRT
|
6
|
Core
|
Abort signal from abort(3)
|
SIGFPE
|
8
|
Core
|
Floating-point exception
|
SIGKILL
|
9
|
Term
|
Kill signal
|
SIGSEGV
|
11
|
Core
|
Invalid memory reference
|
SIGPIPE
|
13
|
Term
|
Broken pipe: write to pipe with no readers; see pipe(7)
|
SIGALRM
|
14
|
Term
|
Timer signal from alarm(2)
|
SIGTERM
|
15
|
Term
|
Termination signal
|
SIGUSR1
|
30,10,16
|
Term
|
User-defined signal 1
|
SIGUSR2
|
31,12,17
|
Term
|
User-defined signal 2
|
SIGCHLD
|
20,17,18
|
Ign
|
Child stopped or terminated
|
SIGCONT
|
19,18,25
|
Cont
|
Continue if stopped
|
SIGSTOP
|
17,19,23
|
Stop
|
Stop process
|
SIGTSTP
|
18,20,24
|
Stop
|
Stop typed at terminal
|
SIGTTIN
|
21,21,26
|
Stop
|
Terminal input for background process
|
SIGTTOU
|
22,22,27
|
Stop
|
Terminal output for background process
|
C/C++ 退出状态码
/usr/include/sysexits.h
中进行了退出状态码标准化(仅限 C/C++),如下表:
#define EX_OK
|
0
|
successful termination
|
#define EX__BASE
|
64
|
base value for error messages
|
#define EX_USAGE
|
64
|
command line usage error
|
#define EX_DATAERR
|
65
|
data format error
|
#define EX_NOINPUT
|
66
|
cannot open input
|
#define EX_NOUSER
|
67
|
addressee unknown
|
#define EX_NOHOST
|
68
|
host name unknown
|
#define EX_UNAVAILABLE
|
69
|
service unavailable
|
#define EX_SOFTWARE
|
70
|
internal software error
|
#define EX_OSERR
|
71
|
system error (e.g., can't fork)
|
#define EX_OSFILE
|
72
|
critical OS file missing
|
#define EX_CANTCREAT
|
73
|
can't create (user) output file
|
#define EX_IOERR
|
74
|
input/output error
|
#define EX_TEMPFAIL
|
75
|
temp failure; user is invited to retry
|
#define EX_PROTOCOL
|
76
|
remote error in protocol
|
#define EX_NOPERM
|
77
|
permission denied
|
#define EX_CONFIG
|
78
|
configuration error
|
#define EX__MAX 78
|
78
|
maximum listed value
|
状态码参考
更多状态码含义可参考以下表格:
1
|
Catchall for general errors
|
let "var1 = 1/0"
|
Miscellaneous errors, such as "divide by zero" and other impermissible operations
|
2
|
Misuse of shell builtins (according to Bash documentation)
|
empty_function() {}
|
Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison).
|
126
|
Command invoked cannot execute
|
/dev/null
|
Permission problem or command is not an executable
|
127
|
"command not found"
|
illegal_command
|
Possible problem with $PATH or a typo
|
128
|
Invalid argument to exit
|
exit 3.14159
|
exit takes only integer args in the range 0 - 255 (see first footnote)
|
128+n
|
Fatal error signal "n"
|
kill -9 $PPID of script
|
$? returns 137 (128 + 9)
|
130
|
Script terminated by Control-C
|
Ctl-C
|
Control-C is fatal error signal 2, (130 = 128 + 2, see above)
|
255*
|
Exit status out of range
|
exit -1
|
exit takes only integer args in the range 0 - 255
|