[zeromq-dev] MDP protocol, detecting dead workers
Gyorgy Szekely
hoditohod at gmail.com
Wed Feb 15 11:06:21 CET 2017
Hi,
I assume the following:
When a dealer socket (worker) reconnects to a router socket (broker) due to
a transient network issue (reconnection happens on libzmq level), the new
connection _always_ gets a new identity in the router socket, and _may_ get
a different file descriptor (fd might get reused). Workers don't specify
their identity.
Is this correct?
If it is, then I can deal with identity->fd associations fine.
And yes, you're right about the protocol improvements, I'll consider this
option too.
Regards,
Gyorgy
On Tue, Feb 14, 2017 at 9:18 PM, Doron Somech <somdoron at gmail.com> wrote:
> Using srcfd is prolemtic, zeromq handle reconnection and the srcfd might
> change.
>
> To solve your problem I would change the design and continue sending
> heartbeat during long job and change to worker to two threads model.
>
> Alternatively you set a maximum time for a job, after which you consider
> the worker dead. If not dead you can handle the reconnection then.
>
> On Feb 14, 2017 17:33, "Gyorgy Szekely" <hoditohod at gmail.com> wrote:
>
>> Hi,
>> The implemented protocol (ZMQ-RFC 7/MDP) has application level mutual
>> heartbeating between the broker and the worker. And this works fine: both
>> parties detect if the other side dies via missing heartbeats. The problem
>> appears when the worker is assigned a long running job, heartbeating is
>> _disabled_ while the job is being processed (as per 7/MDP specifies). This
>> enables the worker to be single threaded, and avoids typical multithreaded
>> issues (eg. processing thread hangs, heartbeating thread runs; worker in
>> inconsistent state).
>>
>> When a worker crashes during job processing my application doesn't
>> realize this since no messages are flowing (the broker is waiting for the
>> job result), but the libzmq detects this, as the socket is always closed.
>> My goal is to always keep in sync the number of underlying sockets in
>> libzmq and Worker related objects in my application.
>>
>> I've googled around and found a few libzmq features that would suit my
>> needs:
>> - ZMQ_IDENTITY_FD - this was introduced and shortly removed from the lib
>> - ZMQ_SRCFD - deprecated, but it's exactly what I need!
>> - "Peer-Address" metadata, the recommended replacement for ZMQ_SRCFD, but
>> not suitable for my needs
>>
>> I know fd's should be handled with care (monitor events are asynchronous,
>> fd's get reused), but ZMQ_SRCFD solves my problem with the following
>> ruleset:
>> 1. When a Worker registers (first message over a connection) save the
>> underlying fd
>> - and -
>> 2. Check that this fd is in use by another Worker, if it is: that Worker
>> is dead since libzmq reused its file descriptor
>>
>> 3. If a Worker's fd is in closed state for a longer period (heartbeat
>> expiry time), then it crashed and the fd was not re-used (get this info
>> from monitor)
>>
>> I don't know if this is considered as an ugly hack by hardcore zeromq
>> users, but it looks like a legitimate ZMQ_SRCFD use-case to me. It would be
>> nice if it wasn't removed in the upcoming versions.
>> Any feedback welcome!
>>
>> Regards,
>> Gyorgy
>>
>>
>>
>> On Mon, Feb 13, 2017 at 10:21 PM, Greg Young <gregoryyoung1 at gmail.com>
>> wrote:
>>
>>> I believe the term here is application level heartbeats.
>>>
>>> It should also be supported that clients can heartbeat to server. It
>>> is not always that all clients want similar heartbeat timeouts.
>>>
>>> On Mon, Feb 13, 2017 at 4:07 PM, Michal Vyskocil
>>> <michal.vyskocil at gmail.com> wrote:
>>> > Hi,
>>> >
>>> > You can take inspiration from malamute broker
>>> > https://github.com/zeromq/malamute
>>> >
>>> > There clients pings server regularly. The same does MQTT (just it's a
>>> > server, who pings clients).
>>> >
>>> > Sadly malamute is vulnerable to the same problem, that received service
>>> > request may get lost. Solution would be to let client to send a request
>>> > again after timeout, however wasn't yet implemented.
>>> >
>>> > _______________________________________________
>>> > zeromq-dev mailing list
>>> > zeromq-dev at lists.zeromq.org
>>> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>> --
>>> Studying for the Turing test
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170215/6d0e97a9/attachment.htm>
More information about the zeromq-dev
mailing list