[zeromq-dev] Cleaning up file descriptors for dead router peers

Marcin Romaszewicz marcin at brkt.com
Wed Jun 24 20:16:56 CEST 2015


Thanks, this probably would solve our problem, however, I'm reluctant to
deploy the bleeding edge from your git repo into our production systems,
even if it does work on my test cluster.

When I detect that a peer is dead with my own heartbeats, why is it that
attempting to send data to the dead peer doesn't force some kind of
connection cleanup or reset? The underlying os sockets should error out
eventually.

On Wed, Jun 24, 2015 at 10:52 AM, Pieter Hintjens <ph at imatix.com> wrote:

> For what it's worth, we just merged a pull request that adds
> connection heartbeating. It could be fun to see if this solves your
> problem. (In theory it should...)
>
> https://github.com/zeromq/libzmq/pull/1448
>
>
> On Wed, Jun 24, 2015 at 6:48 PM, Marcin Romaszewicz <marcin at brkt.com>
> wrote:
> > Yes, you can easily reproduce this by pulling a network cable or shutting
> > the host down before it can do any sort of TCP connection cleanup. I'm
> > seeing it in AWS when instances get terminated, because they're given so
> > little time to respond to TERM that connections aren't cleaned up.
> >
> > The iptables approach which Francis mentioned should work as well.
> >
> > I'll see if I can come up with a simple example of reproducing this. It
> > might be even possible to repro this on a single machine simply by
> > suspending a peer.
> >
> > -- Marcin
> >
> > On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <ph at imatix.com> wrote:
> >>
> >> Do you think there's any way to reproduce this in the lab, e.g.
> >> killing a peer before it can shut down TCP properly?
> >>
> >> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <marcin at brkt.com>
> >> wrote:
> >> > Hi All,
> >> >
> >> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard time
> >> > working around, and I'd love some advice, but I suspect the answer is
> >> > that
> >> > what I want to do isn't possible.
> >> >
> >> > Say I have a router socket listening on a port, and I have peers
> >> > connecting
> >> > and disconnecting randomly over TCP. These peers have random
> identities
> >> > for
> >> > all intents and purposes.
> >> >
> >> > Most of the time, a peer will disconnect "cleanly", meaning the TCP
> >> > connection is terminated via FIN or RST packets, ZMQ cleans up the
> file
> >> > descriptor.
> >> >
> >> > However, some of the time, my peer will die silently, effectively due
> to
> >> > network outage or power outage or something.
> >> >
> >> > In these cases, the router socket keeps the file descriptor around
> >> > forever.
> >> > I know that the peer is dead because all my peers heartbeat to each
> >> > other,
> >> > and the heartbeats have gone away. I thought that trying to send some
> >> > data
> >> > to a dead peer would tear down that connection, since the underlying
> TCP
> >> > socket would eventually start erroring, but it doesn't, zmq must be
> >> > dropping
> >> > my packet before sending it to the underlying socket.
> >> >
> >> > The socket monitor tells me that someone has connected to the router
> >> > socket
> >> > on on its bound port with a specific file descriptor, but I've got so
> >> > many
> >> > of these coming in that I can't associate a specific file descriptor
> >> > with a
> >> > specific peer.
> >> >
> >> > TCP keep-alives don't work all that well in raising errors in a dead
> >> > connection.
> >> >
> >> > What I know on the app side due to my heartbeats is that peer XYZ is
> >> > dead.
> >> > I'd like to tell the router socket to close the underlying file
> >> > descriptor.
> >> > What I know via the monitor is that I have a bunch of file descriptors
> >> > open,
> >> > but I can't map them to peers. If I could, I'd just call os.close() on
> >> > that
> >> > file descriptor and hopefully ZMQ would handle this gracefully.
> >> >
> >> > Eventually, in a few hours of uptime, my process hits the os file
> >> > descriptor
> >> > limit, and stops receiving new connections on the zeromq level. I can
> >> > have
> >> > the process quit when it detects this, but that forces all the
> >> > functioning
> >> > peers to reconnect and re-do some work, so I'd like to avoid it.
> >> >
> >> > I scanned the previous discussions about it, and there has been
> mention
> >> > of
> >> > exposing this somehow, but I don't see anything along these lines in
> the
> >> > latest API. (looking at 4.1.2 release).
> >> >
> >> > Any suggestions on how I could work around this?
> >> >
> >> > I'm thinking of extending the socket monitor to have a new event type,
> >> > like
> >> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file
> >> > descriptor, but I've not gone through the zmq code enough yet to know
> >> > how
> >> > much work this would be.
> >> >
> >> > Thanks in advance,
> >> > -- Marcin
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > zeromq-dev mailing list
> >> > zeromq-dev at lists.zeromq.org
> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >> >
> >> _______________________________________________
> >> zeromq-dev mailing list
> >> zeromq-dev at lists.zeromq.org
> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> >
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150624/c1a445dd/attachment.htm>


More information about the zeromq-dev mailing list