tornado底层epoll陷入死循环,cpu100%
该进程所有fd详情信息:其中fd对应信息如下所示
4:tornado的epoll监听树的(epfd)句柄
3:listen监听socket
5:怀疑是gevent启动的epoll监听树的(epfd)句柄
lrwx------ 1 root root 64 Feb 22 15:42 0 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 1 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 10 -> socket:[3858679084]
lrwx------ 1 root root 64 Feb 22 15:42 11 -> socket:[3858531359]
lrwx------ 1 root root 64 Feb 22 15:42 12 -> socket:[3893062439]
lrwx------ 1 root root 64 Feb 22 15:42 13 -> socket:[4021272049]
lrwx------ 1 root root 64 Feb 22 15:42 14 -> socket:[3736829588]
lrwx------ 1 root root 64 Feb 22 15:42 15 -> socket:[3858821377]
lrwx------ 1 root root 64 Feb 19 13:42 2 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 3 -> socket:[3736762451]
lrwx------ 1 root root 64 Feb 22 15:42 35 -> socket:[3858851998]
lrwx------ 1 root root 64 Feb 22 15:42 36 -> socket:[3858856577]
lrwx------ 1 root root 64 Feb 22 15:42 37 -> socket:[3858548454]
lrwx------ 1 root root 64 Feb 22 15:42 38 -> socket:[3858769612]
lrwx------ 1 root root 64 Feb 22 15:42 39 -> socket:[3858868371]
lrwx------ 1 root root 64 Feb 22 15:42 4 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 Feb 22 15:42 40 -> socket:[3858870568]
lrwx------ 1 root root 64 Feb 22 15:42 41 -> socket:[3858870569]
lrwx------ 1 root root 64 Feb 22 15:42 5 -> anon_inode:[eventfd]
l-wx------ 1 root root 64 Feb 22 15:42 6 -> /opt/log/api_20220225102901.log
lrwx------ 1 root root 64 Feb 22 15:42 7 -> socket:[3736762452]
lrwx------ 1 root root 64 Feb 22 15:42 8 -> socket:[3736745829]
lrwx------ 1 root root 64 Feb 22
tornado服务,监听fd为:3和7
python 32 root 3u IPv4 3736762451 0t0 TCP *:http (LISTEN)
python 32 root 7u IPv6
堆栈一直在循环打印下面的信息(10,35,36,37,15,39,11,38)
15:24:57.286761 epoll_wait(4, [{EPOLLIN, {u32=10, u64=145376053035018}}, {EPOLLIN, {u32=35, u64=119443040501795}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=36, u64=105853763977252}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=37, u64=92676804313125}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=15, u64=163247411953679}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=39, u64=47201690583079}}, {EPOLLIN, {u32=11, u64=156190780686347}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=38, u64=66795331387430}}], 64, 140) = 8
15:24:57.286845 epoll_ctl(4, EPOLL_CTL_MOD, 36, {EPOLLIN, {u32=36, u64=105853763977252}}) = 0
15:24:57.286902 epoll_ctl(4, EPOLL_CTL_MOD, 37, {EPOLLIN, {u32=37, u64=92676804313125}}) = 0
15:24:57.286960 epoll_ctl(4, EPOLL_CTL_MOD, 15, {EPOLLIN, {u32=15, u64=163247411953679}}) = 0
15:24:57.287029 epoll_ctl(4, EPOLL_CTL_MOD, 39, {EPOLLIN, {u32=39, u64=47201690583079}}) = 0
15:24:57.287088 epoll_ctl(4, EPOLL_CTL_MOD, 38, {EPOLLIN, {u32=38, u64=66795331387430}}) = 0
疑惑点
如果是陷入tornado epoll的死循环,epoll_wait为啥不能接受新的连接请求(也就是获取到客户端的请求),因为epoll_wait没有看到fd=3的读事件,按理说是可以的。
是fd=3 socket已经重红黑树里面删掉了还是有其他原因。当然也有可能是没看到,如果有的话,应该会多处于CLOSE_WAIT状态的socket
其中fd=10,fd=35,fd=11的都是tcp连接,服务端TCP状态为Close_WAIT(远程客户端调用服务端的请求,信息发送完毕,客户端已经发起并关闭socket,服务端socket接收缓冲区有数据),不知道为啥不会处理?
下面是这两个fd对应的socket信息:netstat -p 查出来的数据
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1586 0 zte-rdcloud-testce:http 172.30.0.4:60970 CLOSE_WAIT 32/python
tcp 4088 0 zte-rdcloud-testce:http 172.30.0.4:43076 CLOSE_WAIT 32/python
tcp 4088 0 zte-rdcloud-testce:http 172.30COMMAND PID USER FD TYPE .0.4:43076 CLOSE_WAIT 32/python
fd:10,35,31用lsof查出来的数据:
lsof:
python 32 root 10u IPv4 3858679084 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:60970 (CLOSE_WAIT)
python 32 root 35u IPv4 3858851998 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:43076 (CLOSE_WAIT)
python 32 root 11u IPv4 3858531359 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.3:40016 (CLOSE_WAIT)
其他fd(15,36,37,38,39):netstat 查不出来,lsof查出来的数据
lsof:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 32 root 15u sock 0,7 0t0 3858821377 protocol: TCP
python 32 root 36u sock 0,7 0t0 3858856577 protocol: TCP
python 32 root 37u sock 0,7 0t0 3858548454 protocol: TCP
python 32 root 38u sock 0,7 0t0 3858769612 protocol: TCP
python 32 root 39u sock 0,7 0t0 3858868371 protocol: TCP
---------------------------------------------------------------------------------------------------------------------------------
新进展:
进程号:41
tornado:死循环调用epoll_wait,又不处理socket里面的数据
strace堆栈信息:循环打印下面的信息,显示一直循环调用。这这个socket都是客户端的连接,并且已经发送完数据
epoll_wait(4, [{EPOLLIN, {u32=24, u64=523904405733400}}, {EPOLLIN, {u32=25, u64=551181743030297}}, {EPOLLIN, {u32=15, u64=528938107404303}}, {EPOLLIN, {u32=37, u64=382690176008229}}, {EPOLLIN, {u32=27, u64=446019468787739}}, {EPOLLIN, {u32=26, u64=499624955609114}}, {EPOLLIN, {u32=38, u64=326211356065830}}], 64, 256) = 7
netstat:查看这几个socket,tcp状态信息(显示接受队列有数据,但是不读取)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1 0 zte-rdcloud-testce:http 172.30.0.2:50762 CLOSE_WAIT 41/python
tcp 1830 0 zte-rdcloud-testce:http 172.30.0.4:38716 CLOSE_WAIT 41/python
tcp 884 0 zte-rdcloud-testce:http 172.30.0.3:51108 CLOSE_WAIT 41/python
tcp 1011 0 zte-rdcloud-testce:http 172.30.0.3:36216 CLOSE_WAIT 41/python
tcp 1 0 zte-rdcloud-testce:http 172.30.0.2:53818 CLOSE_WAIT 41/python
tcp 1 0 zte-rdcloud-testce:http 172.30.0.4:56326 CLOSE_WAIT 41/python
tcp 6979 0 zte-rdcloud-testce:http 172.30.0.3:50746 CLOSE_WAIT 41/python