[zeromq-dev] How to do reliable fire and forget with a HWM?

Schaller, Edward B schallee at lanl.gov
Fri Sep 2 18:36:12 CEST 2011


Last month I submitted bug https://zeromq.jira.com/browse/LIBZMQ-229 push/pull + HWM + slow puller + zmq_close = lost message. Recently a comment was posted suggesting that this be closed "won't fix." I've been meaning to ask about this so nows the time;)

The basic problem is that if the puller on a push/pull socket with a HWM is slow, the pusher has no way to know if the message was ever sent before shutting down. Contrary to the documentation, if the HWM on the puller has been reached, shutting down the context will drop the message.

This seems to mean that a simple fire and forget scenario is not reliable with ZMQ if a HWM is set. I have searched and found no way for the pusher to figure out from the api if it is safe to shut down the context. Am I correct in concluding that when using a HWM ZMQ doesn't provide reliable fire and forget? I'm hoping this isn't the case.

The only work around with push/pull that I have been able to find is to do a sleep before the context is closed. Aside from being ugly this actually doesn't work in the generic case because the client has no way to know how long it needs to sleep.

In my situation I'm dealing with 85M messages at a rate of two per second or higher. A producer sends messages to a bufferer that then sends messages to a consumer. The bufferer writes incoming messages to persistent storage so that the consumer can run asynchronously from the consumer. All sides of the two push/pull connections need a HWM set or they will quickly run out of memory at the message rate. When a batch of messages is finished a final termination message is sent through the two queues. Because of this bug neither the producer nor the bufferer know whether they can shut down or not after pushing the termination message.

Switching from push/pull to request/reply would work around this bug. Although this should fix the problem it is moves away from fire and forget and toward more synchronous system.

Another alternative would be a reverse message flow of "I got message x" messages. Again this is more complex but it also suffers from the same bug. These systems run for hours and process tens of thousands of messages. Every queue needs a HWM which means now the puller has no way to know if the pusher received the "I got the message" message for the same reason and can't shut down itself.

Is there another solution that I am missing that doesn't require switching from a simple push/pull message flow?

The problem is more complicated in my situation because each of there components is being implemented by different organizations. The simplicity of asynchronous fire and forget messaging for this flow influenced the design. ZMQ was selected because of it's light weight nature and it's performance. This allowed only the message formats to need detailed design. Changes now require large quantities bureaucratic red tape.

Any suggestions would be greatly appreciated. Thanks!


More information about the zeromq-dev mailing list