[zeromq-dev] Cleaning up file descriptors for dead router peers
Marcin Romaszewicz
marcin at brkt.com
Tue Jun 23 22:08:39 CEST 2015
Hi All,
I've got an issue with ZMQ_ROUTER sockets which I'm having a hard time
working around, and I'd love some advice, but I suspect the answer is that
what I want to do isn't possible.
Say I have a router socket listening on a port, and I have peers connecting
and disconnecting randomly over TCP. These peers have random identities for
all intents and purposes.
Most of the time, a peer will disconnect "cleanly", meaning the TCP
connection is terminated via FIN or RST packets, ZMQ cleans up the file
descriptor.
However, some of the time, my peer will die silently, effectively due to
network outage or power outage or something.
In these cases, the router socket keeps the file descriptor around forever.
I know that the peer is dead because all my peers heartbeat to each other,
and the heartbeats have gone away. I thought that trying to send some data
to a dead peer would tear down that connection, since the underlying TCP
socket would eventually start erroring, but it doesn't, zmq must be
dropping my packet before sending it to the underlying socket.
The socket monitor tells me that someone has connected to the router socket
on on its bound port with a specific file descriptor, but I've got so many
of these coming in that I can't associate a specific file descriptor with a
specific peer.
TCP keep-alives don't work all that well in raising errors in a dead
connection.
What I know on the app side due to my heartbeats is that peer XYZ is dead.
I'd like to tell the router socket to close the underlying file descriptor.
What I know via the monitor is that I have a bunch of file descriptors
open, but I can't map them to peers. If I could, I'd just call os.close()
on that file descriptor and hopefully ZMQ would handle this gracefully.
Eventually, in a few hours of uptime, my process hits the os file
descriptor limit, and stops receiving new connections on the zeromq level.
I can have the process quit when it detects this, but that forces all the
functioning peers to reconnect and re-do some work, so I'd like to avoid it.
I scanned the previous discussions about it, and there has been mention of
exposing this somehow, but I don't see anything along these lines in the
latest API. (looking at 4.1.2 release).
Any suggestions on how I could work around this?
I'm thinking of extending the socket monitor to have a new event type, like
ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file
descriptor, but I've not gone through the zmq code enough yet to know how
much work this would be.
Thanks in advance,
-- Marcin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150623/ab15f5b5/attachment.htm>
More information about the zeromq-dev
mailing list