[zeromq-dev] weirdness

Andrew Hume andrew at research.att.com
Wed Feb 2 10:22:48 CET 2011

i hate to ask a fuzzy question like this, but i am nearly desperate.

i have a fairly ordinary convoy of processes dsitributed across 3 linux systems.
there is a central process which has a pub/sub control socket to
every other process, and each of those processes has a push/pull socket back to the
central process. this latter socket contains heartbeats so that the central guy knows who is alive.

so far, so good. i can control these processes; bring them up and make them die
(by sending a quit command over the control socket.) this all works multiple times.
then for no obvious reason, all of the processes on one specific system are
now disconnected from teh control process. commands go out to them just fine,
but their heartbeats (sent successfully) don't arrive at the conrtrol process.

i think this has to be a zeromq thing; some forwarding thread or process is wedged.
all nodes are identical and run identical software. ordinarily i would just reboot the system in question,
but is there something i can look for before i do so?

