[zeromq-dev] MDP protocol, detecting dead workers

Doron Somech somdoron at gmail.com
Tue Feb 14 21:18:22 CET 2017


Using srcfd is prolemtic, zeromq handle reconnection and the srcfd might
change.

To solve your problem I would change the design and continue sending
heartbeat during long job and change to worker to two threads model.

Alternatively you set a maximum time for a job, after which you consider
the worker dead. If not dead you can handle the reconnection then.

On Feb 14, 2017 17:33, "Gyorgy Szekely" <hoditohod at gmail.com> wrote:

> Hi,
> The implemented protocol (ZMQ-RFC 7/MDP) has application level mutual
> heartbeating between the broker and the worker. And this works fine: both
> parties detect if the other side dies via missing heartbeats. The problem
> appears when the worker is assigned a long running job, heartbeating is
> _disabled_ while the job is being processed (as per 7/MDP specifies). This
> enables the worker to be single threaded, and avoids typical multithreaded
> issues (eg. processing thread hangs, heartbeating thread runs; worker in
> inconsistent state).
>
> When a worker crashes during job processing my application doesn't realize
> this since no messages are flowing (the broker is waiting for the job
> result), but the libzmq detects this, as the socket is always closed. My
> goal is to always keep in sync the number of underlying sockets in libzmq
> and Worker related objects in my application.
>
> I've googled around and found a few libzmq features that would suit my
> needs:
> - ZMQ_IDENTITY_FD - this was introduced and shortly removed from the lib
> - ZMQ_SRCFD - deprecated, but it's exactly what I need!
> - "Peer-Address" metadata, the recommended replacement for ZMQ_SRCFD, but
> not suitable for my needs
>
> I know fd's should be handled with care (monitor events are asynchronous,
> fd's get reused), but ZMQ_SRCFD solves my problem with the following
> ruleset:
> 1. When a Worker registers (first message over a connection) save the
> underlying fd
> - and -
> 2. Check that this fd is in use by another Worker, if it is: that Worker
> is dead since libzmq reused its file descriptor
>
> 3. If a Worker's fd is in closed state for a longer period (heartbeat
> expiry time), then it crashed and the fd was not re-used (get this info
> from monitor)
>
> I don't know if this is considered as an ugly hack by hardcore zeromq
> users, but it looks like a legitimate ZMQ_SRCFD use-case to me. It would be
> nice if it wasn't removed in the upcoming versions.
> Any feedback welcome!
>
> Regards,
>   Gyorgy
>
>
>
> On Mon, Feb 13, 2017 at 10:21 PM, Greg Young <gregoryyoung1 at gmail.com>
> wrote:
>
>> I believe the term here is application level heartbeats.
>>
>> It should also be supported that clients can heartbeat to server. It
>> is not always that all clients want similar heartbeat timeouts.
>>
>> On Mon, Feb 13, 2017 at 4:07 PM, Michal Vyskocil
>> <michal.vyskocil at gmail.com> wrote:
>> > Hi,
>> >
>> > You can take inspiration from malamute broker
>> > https://github.com/zeromq/malamute
>> >
>> > There clients pings server regularly. The same does MQTT (just it's a
>> > server, who pings clients).
>> >
>> > Sadly malamute is vulnerable to the same problem, that received service
>> > request may get lost. Solution would be to let client to send a request
>> > again after timeout, however wasn't yet implemented.
>> >
>> > _______________________________________________
>> > zeromq-dev mailing list
>> > zeromq-dev at lists.zeromq.org
>> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>>
>> --
>> Studying for the Turing test
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170214/266bf8f2/attachment.htm>


More information about the zeromq-dev mailing list