docker run hangs问题排查记录

摘要:
1.故障描述过去两天遇到了一个非常奇怪的问题。现在完整的故障描述如下:1)首先,我的同事告诉我,K8S集群中的一个工作节点将其状态更改为NoReady,并且在节点kubelet_truntime的错误日志中发现了大量此类日志E060301:50:51.45511776268remote。go:332]ExecSync1f0e3ac13faf224129bc48a35d515700403e46

1、故障描述

  这两天遇到一个非常诡异的问题,现在将完整的故障描述如下:

1)最初是同事跟我反馈k8s集群中有个worker node状态变为NoReady,该node的kubelet的error日志中发现大量这种日志

E0603 01:50:51.455117   76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.456039   76268 remote_runtime.go:332] ExecSync e86c1b8d460ae2dfbb3fa0369e1ba6308962561f6c7b1076da35ff1db229ebc6 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523473   76268 remote_runtime.go:332] ExecSync dfddd3a462cf2d81e10385c6d30a1b6242961496db59b9d036fda6c477725c6a '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523491   76268 remote_runtime.go:332] ExecSync a6e8011a7f4a32d5e733ae9c0da58a310059051feb4d119ab55a387e46b3e7cd '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523494   76268 remote_runtime.go:332] ExecSync 0f85e0370a366a4ea90f7f21db2fc592a7e4cf817293097b36607a748191e195 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.935857   76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:52.053326   76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:52.053328   76268 remote_runtime.go:332] ExecSync a944b50db75702b200677511b8e44d839fa185536184812145010859fe4dbe57 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:53.035958   76268 remote_runtime.go:332] ExecSync 5bca3245ed12b9c470cce5b48490839761a021640e7cf97cbf3e749c3a81f488 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:54.438308   76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.478244   76268 remote_runtime.go:332] ExecSync c09247eb9167dfc9f0956a5de23f5371c95a030b0eaafdf8518bc494c41bea9f 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.478529   76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.955916   76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:04.668234   76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:07.306240   76268 remote_runtime.go:332] ExecSync 08807433ab5376c75501f9330a168a87734c0f738708e1c423ff4de69245d604 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:17.296389   76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:37.267301   76268 remote_runtime.go:332] ExecSync e5e029786289b2efe8c0ddde19283e0e36fc85c235704b2bbe9133fb520cb57c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:49.835358   76268 remote_runtime.go:332] ExecSync ee846bc29ffbd70e5a7231102e5fd85929cdac9019d97303b12510a89f0743d8 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:52.468602   76268 remote_runtime.go:332] ExecSync 4ca67d88a771ef0689c206a2ea706770b75889fddedf0d38e0ce016ac54c243d '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:05.470375   76268 remote_runtime.go:332] ExecSync 165d53f51c0e611e95882cd2019ef6893de63eaab652df77e055d8f3b17e161e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:07.475034   76268 remote_runtime.go:115] StopPodSandbox "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0603 01:52:07.475126   76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59"}
E0603 01:52:07.475208   76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:07.475270   76268 pod_workers.go:186] Error syncing pod 1b4efdb0-82c5-11e9-bae1-005056a23aab ("app-2034f7b2f71a91f71d2ac3115ba33a4afe9dfe27-1-59747f99cf-zv75k_maxhub-fat-fat(1b4efdb0-82c5-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:20.880257   76268 remote_runtime.go:115] StopPodSandbox "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0603 01:52:20.880367   76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348"}
E0603 01:52:20.880455   76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:20.880472   76268 pod_workers.go:186] Error syncing pod 98adf988-840f-11e9-bae1-005056a23aab ("app-f8a857f59f6784bb87ed44c2cd13d86e0663bd29-2-68dd78fc7f-h7qq4_project-394f23ca5e64aad710030c7c78981ec294a1bf59(98adf988-840f-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:21.672344   76268 remote_runtime.go:332] ExecSync cdb69e42aa1c2f261c1b30a9d4e511ec2be2f50050938f943fd714bfad71f44b 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:22.132342   76268 remote_runtime.go:332] ExecSync c1e134e598dae5dcd439c036b13d289add90726b32fe90acda778b524b68f01c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:22.362812   76268 remote_runtime.go:332] ExecSync 8881290b09a1f88d8b323a9be1236533ac6750a58463a438a45a1cd9c44aa7b3 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.649141   76268 remote_runtime.go:332] ExecSync ba1af801f817bc3cba324b5d14af7215acbff2f79e5b204bd992a3203c288d9e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.875760   76268 remote_runtime.go:332] ExecSync 3a04819fc488f5bb1d7954a00e33a419286accadc0c7aa739c7b81f264d7c3c0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.876992   76268 remote_runtime.go:332] ExecSync f61dfa21713d74f9f8c72df9a13b96a662feb1582f84b910204870c05443cfe0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

2) 查看message 日志,关于dockerd的日志包含以下错误

jun  4 11:10:16 k8s-node145 dockerd: time="2019-06-04T11:10:16.894554055+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.453079842+08:00" level=info msg="shim reaped" id=19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458578126+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458628597+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.500849138+08:00" level=error msg="19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd cleanup: failed to delete container from containerd: no such container"
Jun  4 11:15:27 k8s-node145 dockerd: time="2019-06-04T11:15:27.809076915+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.252794583+08:00" level=info msg="shim reaped" id=226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257559564+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257611410+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.291278605+08:00" level=error msg="226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3 cleanup: failed to delete container from containerd: no such container"
Jun  4 11:15:39 k8s-node145 dockerd: time="2019-06-04T11:15:39.794587143+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.077775311+08:00" level=info msg="shim reaped" id=e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0
Jun  4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.079700724+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  4 11:16:57 k8s-node145 dockerd: time="2019-06-04T11:16:57.262180392+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/16ea66bd6a288acaf44b98179f5d1533ae0e5df683d8e6bcfff9b19d8840b6c5/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:04 k8s-node145 dockerd: time="2019-06-04T11:17:04.279961690+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.634709458+08:00" level=info msg="shim reaped" id=f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea
Jun  4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.636388105+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.241859584+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.980239680+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/5cdd5bf269b7b08e2a8f971e386dd52b398fd7f4d8a7c5b70276e8386a980343/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:25:31 k8s-node145 dockerd: time="2019-06-04T11:25:31.821280121+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.330601768+08:00" level=info msg="shim reaped" id=b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868161+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868997+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.374385142+08:00" level=error msg="b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0 cleanup: failed to delete container from containerd: no such container"
Jun  4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.918871781+08:00" level=info msg="shim reaped" id=e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554
Jun  4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.926022215+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

3)、docker服务正常,docker ps -a查看容器状态,发现新创建的容器为created状态,也就是创建失败了。

4)、手动创建容器,出现以下报错,好像是卡在了read docker daemon 响应结果的阶段,但是docker 服务是running状态,执行docker ps命令也正常。

# strace docker run --rm registry.gz.cvte.cn/egg-demo/dev:dev-635f82b ls
futex(0x56190f2b6490, FUTEX_WAKE, 1) = 1 read(3, "HTTP/1.1 201 Created Api-Versio"..., 4096) = 297 futex(0xc4204d6548, FUTEX_WAKE, 1) = 1 read(3, 0xc420639000, 4096) = -1 EAGAIN (Resource temporarily unavailable) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) futex(0x56190f2b70e8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc4204ef548, FUTEX_WAKE, 1) = 1

5、再次查看message日志,发现不断出现以下报错。

Jun  4 10:42:01 k8s-node145 systemd-logind: Failed to start session scope session-413369.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:43:01 k8s-node145 systemd-logind: Failed to start session scope session-413370.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:44:01 k8s-node145 systemd-logind: Failed to start session scope session-413371.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413372.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413373.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:46:01 k8s-node145 systemd-logind: Failed to start session scope session-413374.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:47:01 k8s-node145 systemd-logind: Failed to start session scope session-413375.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:48:01 k8s-node145 systemd-logind: Failed to start session scope session-413376.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:49:01 k8s-node145 systemd-logind: Failed to start session scope session-413377.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413378.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413379.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:51:01 k8s-node145 systemd-logind: Failed to start session scope session-413380.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:52:01 k8s-node145 systemd-logind: Failed to start session scope session-413381.scope: The maximum number of pending replies per connection has been reached

2、故障处理

1)、根据message日志中的dockerd的报错(msg="stream copy error: reading from a closed fifo")搜索,有人遇到类似问题是因为docker 容器设置limits太低,导致docker容器进程被oom,但应该不会导致我这里遇到的docker run hangs的情况。这条线索中断。

2)、也有人反馈是docker的bug,于是查看了一下这个node安装的docker版本如下:

docker-ce.x86_64                        3:18.09.2-3.el7                installed
docker-ce-cli.x86_64                    1:18.09.5-3.el7                installed

  这不是docker-ce-stable repo中的版本,再看看master node的docker版本:

docker-ce-17.03.2.ce-1.el7.centos.x86_64
docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch

  版本居然不一致。。。

3)、对故障的worker node重启docker服务,测试,发现kubelet服务正常了,但是通过docker ps -a发现有三个daemonset pod 状态为created,看来还是有问题。

4)、这个问题的处理时间已经有4个小时,实在没有办法,更换一个docker 版本测试。选择的是docker-ce-stable源中的以下版本:

docker-ce-18.09.6-3.el7.x86_64
docker-ce-cli-18.09.6-3.el7.x86_64

  发现docker run依然失败,容器一直为created状态

5)、此时该node的docker info信息如下:

Containers: 36
 Running: 31
 Paused: 0
 Stopped: 5
Images: 17
Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.14.4.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 251.4GiB
Name: k8s-172-17-84-144
ID: XQYD:6IMZ:IGRL:L4TO:J53F:GYMA:VCWL:2DCT:YZVA:RHAQ:MT2D:F6Q7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

6)、是不是跟存储驱动有关呢?于是修改docker.service文件取消-s overlay2 --storage-opt overlay2.override_kernel_check=true,启动参数,以overlay 存储驱动启动。算命果然不靠谱。

7)、实在没有其他办法,来个干脆点的。卸载docker,把/var/lib/docker、/var/lib/docker-engine和/var/run/docker目录改名,再重新安装docker。发现问题依然存在,还是算命。

8)、做了6、7两步以后直觉告诉我,这个问题跟系统有关了。于是回到了前面的The maximum number of pending replies per connection has been reached报错,这个报错虽然是systemd-logind产生的,但两者会不会有什么关联呢?这个报错没有遇到过,于是google一下。

9)、结果如下:

On 15/06/16 19:05, marcin at saepia.net wrote:
> I have recently started to get the error response
> 
> "The maximum number of pending replies per connection has been reached"
> 
> to my method calls.

The intention of this maximum is to prevent denial-of-service by a bus
client. The dbus-daemon allows exactly one reply to each message that
expects a reply, therefore it must allocate memory every time it
receives a message that expects a reply, to record that fact. That
memory can be freed when it sees the reply, or when the process from
which it expects a reply disconnects (therefore there can be no reply
and there is no longer any point in tracking/allowing it).

To avoid denial of service, the dbus-daemon limits the amount of memory
that it is prepared to allocate on behalf of any particular client. The
limit is relatively small for the system bus, very large for the session
bus, and configurable (look for max_replies_per_connection in
/etc/dbus-1/session.conf).

好像是系统为了防止程序占用过多系统资源导致拒绝服务而做的限制。看看/etc/dbus-1/session.conf文件属于哪个包,包含哪些文件

[root@k8s-node-145 eden]# ls /var/lib/docker^C
[root@k8s-node-145 eden]# rpm -qf /etc/dbus-1/session.conf
dbus-1.10.24-7.el7.x86_64
[root@k8s-172-17-84-144 eden]# rpm -ql dbus-1.10.24-7.el7.x86_64
/etc/dbus-1
/etc/dbus-1/session.conf
/etc/dbus-1/session.d
/etc/dbus-1/system.conf
/etc/dbus-1/system.d
/run/dbus
/usr/bin/dbus-cleanup-sockets
/usr/bin/dbus-daemon
/usr/bin/dbus-monitor
/usr/bin/dbus-run-session
/usr/bin/dbus-send
/usr/bin/dbus-test-tool
/usr/bin/dbus-update-activation-environment
/usr/bin/dbus-uuidgen
/usr/lib/systemd/system/dbus.service
/usr/lib/systemd/system/dbus.socket
/usr/lib/systemd/system/messagebus.service
/usr/share/dbus-1/session.conf

发现/usr/share/dbus-1/session.conf文件末尾有个max_replies_per_connection参数跟message日志中报错类似,是不是这个参数限制导致的?默认是50000,修改为100000试试,重启dbus.service服务。

10)再次跑一下docker run命令,发现成功了。那么问题来了,其他node max_replies_per_connection也是设置为50000,是什么原因触发了这个问题呢?尝试将max_replies_per_connection改回50000,再重启dbus.service,发现docker run也是正常的。只能等一段时间再看会不会再次出现了。

免责声明:文章转载自《docker run hangs问题排查记录》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇beegovue+jspdf+html2canvas导出PDF文件下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

使用Docker搭建MySQL主从复制(一主一从)

简介 因为个人资源有限,手里没有太多的服务器,只能通过docker来进行mysql的主从搭建。原理基本上都是一致的,在实际生产中,也可以按照该方式进行搭建。如果对Docker还不是很了解,请移步Docker官网进行学习! 使用Docker搭建主从 使用Docker拉取MySQL镜像,使用5.7版本 我们可以先使用search命令查询一下mysql镜像,...

nvidia-docker操作命令

# nvidia-docker安装部署以及操作手册前言 docker和nvidia-docker的区别由于我们深度学习需要用到GPU,使用docker时,需要映射设备等等,docker容器对宿主机的依赖就会很多也就失去了便捷,并不能让我们很舒服的迁移环境,nvidia-docker则很好的封装了这些,只需要容器内的cuda版本和宿主机相同就行(这个要求很低...

用一个实际例子理解Docker volume工作原理

要了解Docker Volume,首先我们需要理解Docker文件系统的工作原理。Docker镜像是由多个文件系统的只读层叠加而成。当一个容器通过命令docker run启动时,Docker会加载只读镜像层并在镜像栈顶部添加一个读写层。如果运行中的容器修改了现有的一个已经存在的文件,那该文件将会从读写层下面的只读层复制到读写层,但是该文件的只读版本依然存在...

Java中的异常处理机制的简单原理和应用?

程序运行过程中可能出现各种“非预期”情况,这些非预期情况可能导致程序非正常结束。 为了提高程序的健壮性,Java提供了异常处理机制: try { s1... s2... s3... } catch(Exception ex) { //对异常情况的修复处理 } 对于上面处理流程,当程序执行try块里的s1、s2、s3遇到异常时,Java虚拟机将会把这个异常...

spring boot中的日志入门

日志通常不会在需求阶段作为一个功能单独提出来,也不会在产品方案中看到它的细节。但是,这丝毫不影响它在任何一个系统中的重要地位。 报警系统与日志系统的关系 为了保证服务的高可用,发现问题一定要及时,定位并解决问题一定要迅速。 生产环境一旦出现问题,预警系统就会通过邮件,短信甚至电话的方式实施多维轰炸模式,确保相关负责人不会错过每一个可能的Bug。 而预警系统...

基于docker安装superset

    检查是否已安装docker(docker version)root@VM-32-248-ubuntu:~# docker versionClient: Version:      1.13.1 API version:  1.26 Go version:   go1.6.2 Git commit:   092cba3 Built:        T...