[zeromq-dev] Cleaning up file descriptors for dead router peers
Marcin Romaszewicz
marcin at brkt.com
Wed Jun 24 18:48:53 CEST 2015
Yes, you can easily reproduce this by pulling a network cable or shutting
the host down before it can do any sort of TCP connection cleanup. I'm
seeing it in AWS when instances get terminated, because they're given so
little time to respond to TERM that connections aren't cleaned up.
The iptables approach which Francis mentioned should work as well.
I'll see if I can come up with a simple example of reproducing this. It
might be even possible to repro this on a single machine simply by
suspending a peer.
-- Marcin
On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <ph at imatix.com> wrote:
> Do you think there's any way to reproduce this in the lab, e.g.
> killing a peer before it can shut down TCP properly?
>
> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <marcin at brkt.com>
> wrote:
> > Hi All,
> >
> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard time
> > working around, and I'd love some advice, but I suspect the answer is
> that
> > what I want to do isn't possible.
> >
> > Say I have a router socket listening on a port, and I have peers
> connecting
> > and disconnecting randomly over TCP. These peers have random identities
> for
> > all intents and purposes.
> >
> > Most of the time, a peer will disconnect "cleanly", meaning the TCP
> > connection is terminated via FIN or RST packets, ZMQ cleans up the file
> > descriptor.
> >
> > However, some of the time, my peer will die silently, effectively due to
> > network outage or power outage or something.
> >
> > In these cases, the router socket keeps the file descriptor around
> forever.
> > I know that the peer is dead because all my peers heartbeat to each
> other,
> > and the heartbeats have gone away. I thought that trying to send some
> data
> > to a dead peer would tear down that connection, since the underlying TCP
> > socket would eventually start erroring, but it doesn't, zmq must be
> dropping
> > my packet before sending it to the underlying socket.
> >
> > The socket monitor tells me that someone has connected to the router
> socket
> > on on its bound port with a specific file descriptor, but I've got so
> many
> > of these coming in that I can't associate a specific file descriptor
> with a
> > specific peer.
> >
> > TCP keep-alives don't work all that well in raising errors in a dead
> > connection.
> >
> > What I know on the app side due to my heartbeats is that peer XYZ is
> dead.
> > I'd like to tell the router socket to close the underlying file
> descriptor.
> > What I know via the monitor is that I have a bunch of file descriptors
> open,
> > but I can't map them to peers. If I could, I'd just call os.close() on
> that
> > file descriptor and hopefully ZMQ would handle this gracefully.
> >
> > Eventually, in a few hours of uptime, my process hits the os file
> descriptor
> > limit, and stops receiving new connections on the zeromq level. I can
> have
> > the process quit when it detects this, but that forces all the
> functioning
> > peers to reconnect and re-do some work, so I'd like to avoid it.
> >
> > I scanned the previous discussions about it, and there has been mention
> of
> > exposing this somehow, but I don't see anything along these lines in the
> > latest API. (looking at 4.1.2 release).
> >
> > Any suggestions on how I could work around this?
> >
> > I'm thinking of extending the socket monitor to have a new event type,
> like
> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file
> > descriptor, but I've not gone through the zmq code enough yet to know how
> > much work this would be.
> >
> > Thanks in advance,
> > -- Marcin
> >
> >
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150624/9fafd299/attachment.htm>
More information about the zeromq-dev
mailing list