[zeromq-dev] MDP protocol, detecting dead workers

Gyorgy Szekely hoditohod at gmail.com
Tue Feb 14 16:33:05 CET 2017


Hi,
The implemented protocol (ZMQ-RFC 7/MDP) has application level mutual
heartbeating between the broker and the worker. And this works fine: both
parties detect if the other side dies via missing heartbeats. The problem
appears when the worker is assigned a long running job, heartbeating is
_disabled_ while the job is being processed (as per 7/MDP specifies). This
enables the worker to be single threaded, and avoids typical multithreaded
issues (eg. processing thread hangs, heartbeating thread runs; worker in
inconsistent state).

When a worker crashes during job processing my application doesn't realize
this since no messages are flowing (the broker is waiting for the job
result), but the libzmq detects this, as the socket is always closed. My
goal is to always keep in sync the number of underlying sockets in libzmq
and Worker related objects in my application.

I've googled around and found a few libzmq features that would suit my
needs:
- ZMQ_IDENTITY_FD - this was introduced and shortly removed from the lib
- ZMQ_SRCFD - deprecated, but it's exactly what I need!
- "Peer-Address" metadata, the recommended replacement for ZMQ_SRCFD, but
not suitable for my needs

I know fd's should be handled with care (monitor events are asynchronous,
fd's get reused), but ZMQ_SRCFD solves my problem with the following
ruleset:
1. When a Worker registers (first message over a connection) save the
underlying fd
- and -
2. Check that this fd is in use by another Worker, if it is: that Worker is
dead since libzmq reused its file descriptor

3. If a Worker's fd is in closed state for a longer period (heartbeat
expiry time), then it crashed and the fd was not re-used (get this info
from monitor)

I don't know if this is considered as an ugly hack by hardcore zeromq
users, but it looks like a legitimate ZMQ_SRCFD use-case to me. It would be
nice if it wasn't removed in the upcoming versions.
Any feedback welcome!

Regards,
  Gyorgy



On Mon, Feb 13, 2017 at 10:21 PM, Greg Young <gregoryyoung1 at gmail.com>
wrote:

> I believe the term here is application level heartbeats.
>
> It should also be supported that clients can heartbeat to server. It
> is not always that all clients want similar heartbeat timeouts.
>
> On Mon, Feb 13, 2017 at 4:07 PM, Michal Vyskocil
> <michal.vyskocil at gmail.com> wrote:
> > Hi,
> >
> > You can take inspiration from malamute broker
> > https://github.com/zeromq/malamute
> >
> > There clients pings server regularly. The same does MQTT (just it's a
> > server, who pings clients).
> >
> > Sadly malamute is vulnerable to the same problem, that received service
> > request may get lost. Solution would be to let client to send a request
> > again after timeout, however wasn't yet implemented.
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> --
> Studying for the Turing test
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170214/9e2d2c49/attachment.htm>


More information about the zeromq-dev mailing list