[zeromq-dev] fault-tolerance and butterfly example

Павел Гуща pavimus at gmail.com
Thu Mar 19 10:20:31 CET 2009

First, sorry for my bag english, it is not my native language :-)

I plan to develop fault-tolerant distributed computing cluster with many
running applications and many servers.
0MQ has many big performance/resource usage/scalability advantages, i want
to disscuss about fault-tolerance techniques that may be
used with 0MQ.
Now fault-tolerance in 0MQ means, that applications try to reconnect to
other applications using same address and port.
Now take look at butterfly example.
Fault some(not all) instances of component1 and component2 doesn't stop
whole system, but there are some SPOF (Single point
of failure).Take closer look to 'intermediate' part of butterfly example.
When server with this application goes down, whole system stops.
How we can avoid this? my idea:
1) We create two copies of intermediate, first(intermediate1) creates
exchange INTERMEDIATE_IN1 and queue INTERMEDIATE_OUT1, second
(intermediate2)- exchange INTERMEDIATE_IN2 and queue INTERMEDIATE_OUT2.
2) Component1 binds their load-balancing local exchange to INTERMEDIATE_IN1
3) Component2 binds their local queue to INTERMEDIATE_OUT1 and

When both intermediate1 and intermediate2 are up, they working in
load-balancing maner without any problem.
Now what happening, when i intermediate2 goes down? I think, next scenario
will take place:
1) Component1 has one pipe for INTERMEDIATE_IN1 and one for
INTERMEDIATE_IN2. After crashing of intermediate2 it's pipe collects
messages in internal buffer. when there are no free place in buffer,
load-balanced exchange sends messages only in pipe of intermediate2. System
doesn't stop, component1 regularly tries to connect to intermediate2.
2) Component2 has one pipe for INTERMEDIATE_OUT1 and one for
INTERMEDIATE_OUT2. After crashing of intermediate2, no messages is receiving
from intermediate2, only from intermediate1.  System doesn't stop,
component2 regularly tries to connect to intermediate2.
When server with intermediate2 will be restored, messages from component1
pipe will be flushed and system will fully restored.

But what we can do in case, when we can't restore work of intermediate2 on
old server (we must repair hardware, this will take a lot of time), but can
start intermediate2 on another server (different IP,port)? As i understood,
it will be successfully started, and zmq_server will store new pointer to
global objects of intermediate2. Newly started applications will see correct
location of intermediate2. Already running
applications must be restarted in any case (because application doesn't ask
zmq_server for new location of global object).
restarting of all running applications in big system may be a problem.
Is there a possibility to clean shutdown application (disconnect from
message sources, process all messages in internal buffers, send
messages-responses to destinations, terminate application)?
Without possibility of clean shutdown many messages may be lost when
restarting all running applications. Not only messages, stored in internal
buffers of component1 instances for intermediate2. In really big system this
may be very big problem.

Now, my questions:
1) Will work my idea with two intermediate applications?
2) How i can clean shutdown application (without losing messages in internal
3) How i can force application to read new location of global object from
zmq_server, when connection to this object is lost?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20090319/edb818af/attachment.htm>

More information about the zeromq-dev mailing list