[zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue
Sash Nagarkar
sash at dronedeploy.com
Thu Jun 12 04:15:43 CEST 2014
Hello ZMQ devs,
We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
for a service we're providing. Sorry if this is too verbose, and I
hope this is the right place to ask the question.
TL;DR: ROUTER socket doesn't receive messages from a DEALER even
though netstat shows several megabytes in the TCP receive queue
(nothing in the send queue). Other connected DEALERs work fine.
The ROUTER socket is running on a server with ample CPU & memory
headroom, with several DEALER clients that connect, exchange messages,
and can abruptly disconnect repeatedly. We're exclusively using
multipart messages with the first part always being the ZMQ socket
identity, which persists across DEALER connect/disconnects. In other
words, each DEALER client uses the same socket identity across many
connects and disconnects.
Most of the time, things hum along smoothly (several thousand messages
exchanged, several dozen connect/disconnects). However, every once in
a rare while, we see that one of the DEALER clients connects and sends
messages to the ROUTER that end up never making it to the ROUTER
process. The ROUTER process continues to receive messages from other
DEALER clients.
Further debugging on the ROUTER server shows one (or more) TCP
connections from the client DEALER that are in the CLOSE_WAIT state
with several megabytes of data sitting in the receive queue to the
ROUTER. We also see one connection from the client DEALER in the
ESTABLISHED state with a receive queue that is growing.
It's clear that the DEALER client died abruptly once, but then
returned with the same identity and resumed sending messages to the
ROUTER. However, none of the subsequent messages are delivered to the
ROUTER process. Any ideas on why this would be the case?
I would have provided a test case, but we aren't able to consistently
reproduce the issue. I've copied the output from netstat (with
obfuscated IPs) below, in case it helps.
Questions:
- What would cause the receive queue to fill up like this on a ROUTER
while it continues to receive messages from other clients? It's clear
that the messages are all making it to the ROUTER machine.
- Is it safe for DEALER sockets to abruptly disconnect and then reuse
their socket identity?
- How can we mitigate this situation? The closest thing I see is
ZMQ_LINGER, but that applies only to the outgoing queue and not the
incoming one.
- Is there anything I could investigate myself to figure out whether
this is an issue in PyZMQ vs. libzmq? Where should I start?
Other potentially relevant info:
- The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
problem socket and some others. All other nodes in the system
continue to send and receive messages just fine.
- The ROUTER's send queues are pretty much empty.
- We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 14.04.
$ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:12501 *:* LISTEN
tcp 1816956 0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
tcp 1551036 0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
tcp 0 0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
tcp 5265541 0 server-ip.:12501 clientA-ip:43469 ESTABLISHED
Please let me if further information would help. Thank you for
helping build ZMQ, it's been a huge pleasure to work with so far.
Cheers,
Sash
More information about the zeromq-dev
mailing list