[zeromq-dev] Cleaning up file descriptors for dead router peers

Marcin Romaszewicz marcin at brkt.com
Fri Jun 26 00:35:21 CEST 2015


I've done some testing of github head version with the heartbeat code, and
it didn't work for me, I have monotonically increasing file descriptor
counts, but I'm not sure if I set up my test scenario properly. I had the
following setup.

1 ZMQ_ROUTER socket with
  ZMQ_HEARTBEAT_IVL = 3000 (3 seconds)
  ZMQ_HEARTBEAT_TTL = 30000 (30 seconds)
  ZMQ_HEARTBEAT_TIMEOUT = 30000 (30 seconds)

However, due to logistical reasons, my clients which were connecting to
this ZMQ socket were on ZMQ 4.1.2. Was this a valid test scenario? It would
take me a couple of days to set up the AMI's to test Router(4.2.0) <->
client(4.2.0)

Another question:
If I switch a router socket into ZMQ_ROUTER_RAW mode, send it a disconnect
fame (peer identity followed by empty frame), then switch off RAW mode,
would I be doing something completely unsupported, or is it worth a try? My
tests take a very long time and a lot of work to set up right now, so I'm
reluctant to try something if it's probably a waste of time.

Thanks,
-- Marcin


On Wed, Jun 24, 2015 at 1:01 PM, Pieter Hintjens <ph at imatix.com> wrote:

> The underlying sockets should indeed error out. Presumably the code
> isn't handling this properly.
>
>
> On Wed, Jun 24, 2015 at 8:16 PM, Marcin Romaszewicz <marcin at brkt.com>
> wrote:
> > Thanks, this probably would solve our problem, however, I'm reluctant to
> > deploy the bleeding edge from your git repo into our production systems,
> > even if it does work on my test cluster.
> >
> > When I detect that a peer is dead with my own heartbeats, why is it that
> > attempting to send data to the dead peer doesn't force some kind of
> > connection cleanup or reset? The underlying os sockets should error out
> > eventually.
> >
> > On Wed, Jun 24, 2015 at 10:52 AM, Pieter Hintjens <ph at imatix.com> wrote:
> >>
> >> For what it's worth, we just merged a pull request that adds
> >> connection heartbeating. It could be fun to see if this solves your
> >> problem. (In theory it should...)
> >>
> >> https://github.com/zeromq/libzmq/pull/1448
> >>
> >>
> >> On Wed, Jun 24, 2015 at 6:48 PM, Marcin Romaszewicz <marcin at brkt.com>
> >> wrote:
> >> > Yes, you can easily reproduce this by pulling a network cable or
> >> > shutting
> >> > the host down before it can do any sort of TCP connection cleanup. I'm
> >> > seeing it in AWS when instances get terminated, because they're given
> so
> >> > little time to respond to TERM that connections aren't cleaned up.
> >> >
> >> > The iptables approach which Francis mentioned should work as well.
> >> >
> >> > I'll see if I can come up with a simple example of reproducing this.
> It
> >> > might be even possible to repro this on a single machine simply by
> >> > suspending a peer.
> >> >
> >> > -- Marcin
> >> >
> >> > On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <ph at imatix.com>
> wrote:
> >> >>
> >> >> Do you think there's any way to reproduce this in the lab, e.g.
> >> >> killing a peer before it can shut down TCP properly?
> >> >>
> >> >> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <
> marcin at brkt.com>
> >> >> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard
> >> >> > time
> >> >> > working around, and I'd love some advice, but I suspect the answer
> is
> >> >> > that
> >> >> > what I want to do isn't possible.
> >> >> >
> >> >> > Say I have a router socket listening on a port, and I have peers
> >> >> > connecting
> >> >> > and disconnecting randomly over TCP. These peers have random
> >> >> > identities
> >> >> > for
> >> >> > all intents and purposes.
> >> >> >
> >> >> > Most of the time, a peer will disconnect "cleanly", meaning the TCP
> >> >> > connection is terminated via FIN or RST packets, ZMQ cleans up the
> >> >> > file
> >> >> > descriptor.
> >> >> >
> >> >> > However, some of the time, my peer will die silently, effectively
> due
> >> >> > to
> >> >> > network outage or power outage or something.
> >> >> >
> >> >> > In these cases, the router socket keeps the file descriptor around
> >> >> > forever.
> >> >> > I know that the peer is dead because all my peers heartbeat to each
> >> >> > other,
> >> >> > and the heartbeats have gone away. I thought that trying to send
> some
> >> >> > data
> >> >> > to a dead peer would tear down that connection, since the
> underlying
> >> >> > TCP
> >> >> > socket would eventually start erroring, but it doesn't, zmq must be
> >> >> > dropping
> >> >> > my packet before sending it to the underlying socket.
> >> >> >
> >> >> > The socket monitor tells me that someone has connected to the
> router
> >> >> > socket
> >> >> > on on its bound port with a specific file descriptor, but I've got
> so
> >> >> > many
> >> >> > of these coming in that I can't associate a specific file
> descriptor
> >> >> > with a
> >> >> > specific peer.
> >> >> >
> >> >> > TCP keep-alives don't work all that well in raising errors in a
> dead
> >> >> > connection.
> >> >> >
> >> >> > What I know on the app side due to my heartbeats is that peer XYZ
> is
> >> >> > dead.
> >> >> > I'd like to tell the router socket to close the underlying file
> >> >> > descriptor.
> >> >> > What I know via the monitor is that I have a bunch of file
> >> >> > descriptors
> >> >> > open,
> >> >> > but I can't map them to peers. If I could, I'd just call os.close()
> >> >> > on
> >> >> > that
> >> >> > file descriptor and hopefully ZMQ would handle this gracefully.
> >> >> >
> >> >> > Eventually, in a few hours of uptime, my process hits the os file
> >> >> > descriptor
> >> >> > limit, and stops receiving new connections on the zeromq level. I
> can
> >> >> > have
> >> >> > the process quit when it detects this, but that forces all the
> >> >> > functioning
> >> >> > peers to reconnect and re-do some work, so I'd like to avoid it.
> >> >> >
> >> >> > I scanned the previous discussions about it, and there has been
> >> >> > mention
> >> >> > of
> >> >> > exposing this somehow, but I don't see anything along these lines
> in
> >> >> > the
> >> >> > latest API. (looking at 4.1.2 release).
> >> >> >
> >> >> > Any suggestions on how I could work around this?
> >> >> >
> >> >> > I'm thinking of extending the socket monitor to have a new event
> >> >> > type,
> >> >> > like
> >> >> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file
> >> >> > descriptor, but I've not gone through the zmq code enough yet to
> know
> >> >> > how
> >> >> > much work this would be.
> >> >> >
> >> >> > Thanks in advance,
> >> >> > -- Marcin
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > zeromq-dev mailing list
> >> >> > zeromq-dev at lists.zeromq.org
> >> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >> >> >
> >> >> _______________________________________________
> >> >> zeromq-dev mailing list
> >> >> zeromq-dev at lists.zeromq.org
> >> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > zeromq-dev mailing list
> >> > zeromq-dev at lists.zeromq.org
> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >> >
> >> _______________________________________________
> >> zeromq-dev mailing list
> >> zeromq-dev at lists.zeromq.org
> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> >
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150625/c839d35e/attachment.htm>


More information about the zeromq-dev mailing list