[zeromq-dev] Cleaning up file descriptors for dead router peers

Pieter Hintjens ph at imatix.com
Wed Jun 24 22:01:48 CEST 2015


The underlying sockets should indeed error out. Presumably the code
isn't handling this properly.


On Wed, Jun 24, 2015 at 8:16 PM, Marcin Romaszewicz <marcin at brkt.com> wrote:
> Thanks, this probably would solve our problem, however, I'm reluctant to
> deploy the bleeding edge from your git repo into our production systems,
> even if it does work on my test cluster.
>
> When I detect that a peer is dead with my own heartbeats, why is it that
> attempting to send data to the dead peer doesn't force some kind of
> connection cleanup or reset? The underlying os sockets should error out
> eventually.
>
> On Wed, Jun 24, 2015 at 10:52 AM, Pieter Hintjens <ph at imatix.com> wrote:
>>
>> For what it's worth, we just merged a pull request that adds
>> connection heartbeating. It could be fun to see if this solves your
>> problem. (In theory it should...)
>>
>> https://github.com/zeromq/libzmq/pull/1448
>>
>>
>> On Wed, Jun 24, 2015 at 6:48 PM, Marcin Romaszewicz <marcin at brkt.com>
>> wrote:
>> > Yes, you can easily reproduce this by pulling a network cable or
>> > shutting
>> > the host down before it can do any sort of TCP connection cleanup. I'm
>> > seeing it in AWS when instances get terminated, because they're given so
>> > little time to respond to TERM that connections aren't cleaned up.
>> >
>> > The iptables approach which Francis mentioned should work as well.
>> >
>> > I'll see if I can come up with a simple example of reproducing this. It
>> > might be even possible to repro this on a single machine simply by
>> > suspending a peer.
>> >
>> > -- Marcin
>> >
>> > On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <ph at imatix.com> wrote:
>> >>
>> >> Do you think there's any way to reproduce this in the lab, e.g.
>> >> killing a peer before it can shut down TCP properly?
>> >>
>> >> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <marcin at brkt.com>
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard
>> >> > time
>> >> > working around, and I'd love some advice, but I suspect the answer is
>> >> > that
>> >> > what I want to do isn't possible.
>> >> >
>> >> > Say I have a router socket listening on a port, and I have peers
>> >> > connecting
>> >> > and disconnecting randomly over TCP. These peers have random
>> >> > identities
>> >> > for
>> >> > all intents and purposes.
>> >> >
>> >> > Most of the time, a peer will disconnect "cleanly", meaning the TCP
>> >> > connection is terminated via FIN or RST packets, ZMQ cleans up the
>> >> > file
>> >> > descriptor.
>> >> >
>> >> > However, some of the time, my peer will die silently, effectively due
>> >> > to
>> >> > network outage or power outage or something.
>> >> >
>> >> > In these cases, the router socket keeps the file descriptor around
>> >> > forever.
>> >> > I know that the peer is dead because all my peers heartbeat to each
>> >> > other,
>> >> > and the heartbeats have gone away. I thought that trying to send some
>> >> > data
>> >> > to a dead peer would tear down that connection, since the underlying
>> >> > TCP
>> >> > socket would eventually start erroring, but it doesn't, zmq must be
>> >> > dropping
>> >> > my packet before sending it to the underlying socket.
>> >> >
>> >> > The socket monitor tells me that someone has connected to the router
>> >> > socket
>> >> > on on its bound port with a specific file descriptor, but I've got so
>> >> > many
>> >> > of these coming in that I can't associate a specific file descriptor
>> >> > with a
>> >> > specific peer.
>> >> >
>> >> > TCP keep-alives don't work all that well in raising errors in a dead
>> >> > connection.
>> >> >
>> >> > What I know on the app side due to my heartbeats is that peer XYZ is
>> >> > dead.
>> >> > I'd like to tell the router socket to close the underlying file
>> >> > descriptor.
>> >> > What I know via the monitor is that I have a bunch of file
>> >> > descriptors
>> >> > open,
>> >> > but I can't map them to peers. If I could, I'd just call os.close()
>> >> > on
>> >> > that
>> >> > file descriptor and hopefully ZMQ would handle this gracefully.
>> >> >
>> >> > Eventually, in a few hours of uptime, my process hits the os file
>> >> > descriptor
>> >> > limit, and stops receiving new connections on the zeromq level. I can
>> >> > have
>> >> > the process quit when it detects this, but that forces all the
>> >> > functioning
>> >> > peers to reconnect and re-do some work, so I'd like to avoid it.
>> >> >
>> >> > I scanned the previous discussions about it, and there has been
>> >> > mention
>> >> > of
>> >> > exposing this somehow, but I don't see anything along these lines in
>> >> > the
>> >> > latest API. (looking at 4.1.2 release).
>> >> >
>> >> > Any suggestions on how I could work around this?
>> >> >
>> >> > I'm thinking of extending the socket monitor to have a new event
>> >> > type,
>> >> > like
>> >> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file
>> >> > descriptor, but I've not gone through the zmq code enough yet to know
>> >> > how
>> >> > much work this would be.
>> >> >
>> >> > Thanks in advance,
>> >> > -- Marcin
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > zeromq-dev mailing list
>> >> > zeromq-dev at lists.zeromq.org
>> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> >> >
>> >> _______________________________________________
>> >> zeromq-dev mailing list
>> >> zeromq-dev at lists.zeromq.org
>> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > zeromq-dev mailing list
>> > zeromq-dev at lists.zeromq.org
>> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> >
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>



More information about the zeromq-dev mailing list