[zeromq-dev] Cleaning up file descriptors for dead router
Marcin Romaszewicz
marcin at brkt.com
Mon Jun 29 20:25:50 CEST 2015
Hi Jonathan,
Your heartbeat code does indeed work in my little test, but I don't know
why it didn't work in the wild for me.
Your code, though, gave me an idea to fix my problem slightly differently
on top of ZMQ 4.1.2. I already have heartbeats going back and forth, and
they propagate some peer information, so I have to send them irrespective
of whether your code sends ZMQ-internal heartbeats. I'm going to do
something similar in the stream engine, where if the tcp send returns a
size of 0 and the reason is that the send would block or fail, I'll start a
timer, then cancel if if we ever have a subsequent successful send or
receive something. If the timer goes off, we disconnect. This should fix my
problem without two layers of heartbeats.
Once 4.2.0 is stable and tested, I'll move to using your heartbeat stuff
and remove our own heartbeats.
-- Marcin
On Sat, Jun 27, 2015 at 9:06 AM, Jonathan Reams <jbreams at gmail.com> wrote:
> Hi Marcin,
>
> I tried running your test case with the new heartbeats turned on and I saw
> what I think should be the correct behavior. I set the heartbeat interval,
> timeout, and TTL to 500 ms, and less than a second after setting iptables
> to DROP, all the sockets on the peer side went from ESTABLISHED to
> SYN_SENT, indicating that they were trying to reconnect, and all the
> ESTABLISHED sockets on the router side were closed. After flushing the
> INPUT iptables chain, the peers eventually recovered. I put my updated copy
> of your test script here
> https://gist.github.com/jbreams/7f507beff87987afad98. I haven't tried
> this with 4.2.0 talking to 4.1.2 though, although in your configuration I
> think it would do almost the right thing - I'd expect the router side to
> work fine and the peers to never close their sockets.
>
> Jonathan
>
> On Fri, Jun 26, 2015 at 4:58 PM, Marcin Romaszewicz <marcin at brkt.com>
> wrote:
>
>> Hi All,
>> I've gota trivial bit of code to reproduce this issue on a single host
>> using iptables to simulate network partition.
>> https://s3-us-west-2.amazonaws.com/marcin-zmq-example/zmq_test.cpp
>> The file has comments on how to run the executable, but the short version
>> is that you start a ZMQ_ROUTER listener which accepts connections from
>> other peers, and remembers their identities and pings them every 5
>> seconds.
>> Then, you start a number of peers which connect to this router and start
>> pinging it every few seconds.
>> Once you use the iptables command (also in the comments in the file), the
>> router can't ping the peers, and the peers can't ping the router. The file
>> descriptors and connections remain open forever on both sides.
>> Furthermore, when you undo the iptables block, the connections never come
>> back.
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150629/75098df2/attachment.htm>
More information about the zeromq-dev
mailing list