[zeromq-dev] thoughts on pub-sub messaging and "reliability"

Holger Joukl Holger.Joukl at LBBW.de
Fri Oct 17 11:38:26 CEST 2014


Hi,

I realize I'm sometimes confused by terms used by myself and others.

So, in an attempt to clarify my thoughts, first and foremost to myself, by
writing them down here's how I tend to think about pub-sub messaging:

(0) Publish-subscribe (pub-sub) messaging:
- one or many sender(s) publish data on one or more "channels"
- one or many listener(s) subscribe to one or more channels
- communication is asynchronous
- listeners might not receive all messages even if up-and-running when
the messages are being sent (due to network glitches or whatever)

(1) Reliable pub-sub messaging:
- one or many sender(s) publish data on one or more "channels"
- one or many listener(s) subscribe to one or more channels
- communication is asynchronous
- as long as a listener is up-and-running when the messages are being sent
it will
receive all messages from all senders publishing on the channels they've
subscribed to,
in order (order per sender)

(2) Guaranteed/Certified pub-sub messaging:
- one or many sender(s) publish data on one or more "channels"
- one or many listener(s) subscribe to one or more channels
- communication is asynchronous
- as long as a listener is up-and-running when the messages are being sent
it will
receive all messages from all senders publishing on the channels they've
subscribed to,
in order (order per sender)
- even if a listener is *not* up-and-running when messages are being sent a
listener
will still get all the messages, i.e. the missed messages will get
re-delivered,
in order (order per sender); the listener will not receive new messages
until it
has received all missed message
- as a consequence, messages need to get persisted:
  - either each listeners' subscriptions need to get registered when
opening a
subscription, or by predefined configuration, so persisted messages can be
safely
deleted from persistent storage when all registered listeners have received
them
an explicitly acknowledged that fact
  - or all sent messages need to get persisted for a period of time so a
listener
can request missed messages for retransmission

Sometimes the "reliability" distinction is called different "qualities of
service"
(QoS).

Note that I've deliberately ignored any problems that might arise in
pub-sub
communications, e.g. slow consumers in (unreliable) high frequency
scenarios
or the like.

Pub-sub messaging can be implemented over a variety of transports and
protocols:
(UDP) broadcast, multicast, TCP, ...
The transport + protocol used determines the properties of
pub-sub-messaging, e.g.:
- plain UDP broadcast is unreliable
- PGM or NORM multicast is reliable
- a protocol on top of TCP/UDP/reliable multicast is ususally necessary for
guaranteed/certified messaging

Design approaches you encounter in the wild:

- central broker with queues, optionally persistent (e.g. AMQP brokers, JMS
providers, IBM MQ, ...)
  - senders do not know anything about listeners
  - listeners do not know anything about senders
  - the broker is "the rendezvous point" for communication, often called
"message bus"
  - senders connect to the broker (usually TCP)
  - listeners connect to the broker (usually TCP)
  - channels are usually called "topics" (basically a special case of queue
that allows
    for many listeners to receive the topic messages)
  - broker knows registered guaranteed/certified listeners: When all known
listeners
    have retrieved and acknowledged a certain message on a topic this
message will get
    deleted from the queue
  - queue persistence to make queued messages survive broker failure
  - broker might be distributed, i.e. multiple broker working in
cooperation, e.g. for reasons
    of throughput scaling, partitioning of data channels, replication
  - broker is a single point of failure so will normally get clustered and
replicated for
    some notion of cold/hot standby for mission critical communication

- central broker with a persistent commit log or journal (e.g. Apache
Kafka, ZPER, ...)
  - senders do not know anything about listeners
  - listeners do not know anything about senders
  - the broker is "the rendezvous point" for communication
  - senders connect to the broker (usually TCP)
  - listeners connect to the broker (usually TCP)
  - messages sent on the channels are persisted by the broker as sequenced
continous message
    "commit logs" for a configurable period of time
  - listeners can get retransmission of any historic message within the
configured period of
    time
  - listeners are responsible for their state, i.e. which messages they
have already
    processed
  - listener state is basically just a pointer to its current position in
the commit log
   - broker might be distributed, i.e. multiple broker working in
cooperation, e.g. for reasons
    of throughput scaling, partitioning of data channels, replication
  - broker is a single point of failure so will normally get clustered and
replicated for
    some notion of cold/hot standby for mission critical communication

- distributed message bus (e.g. TIB/Rendezvous, http://iris.karalabe.com/
, ...)
  - consists of a network of "micro brokers", sometimes called daemons or
agent or relay process,
    ususally one per host
  - senders connect to a (local) broker (usually IPC or TCP)
  - listeners connect to a (local) broker (usually IPC or TCP)
  - the "network of brokers" is "the rendezvous point" for communication,
i.e. the brokers
    collectively form a "message bus"
  - senders do not know anything about listeners in case of reliable
pub-sub
  - listeners do not know anything about senders in case of reliable
pub-sub
  - senders know about registered listeners in case of guaranteed/certified
pub-sub
  - message persistence is local to the sending clients (called "ledger"
file e.g. in TIB/Rv
    terms); might also be possible local to the local broker but I haven't
seen such an approach
    yet
  - registered listeners need to acknowledge messages with their senders so
they can remove them
    from persistent storage
  - brokers are not single points of failure
  - brokers do not need separate clustering/replication apart from what
needs to be done
    for the mission critical applications running on their host systems
anyway; they are
    basically guarded by the same safety measures taken for the host
system, ranging from none
    to whatever

- somewhat hybrid forms, e.g. brokers that scale out elastically like
http://roq-messaging.org/

Hope no one's annoyed by the
lengthy-not-really-a-question-nor-problem-description
kind of post. Glad to get anyone's thoughts or totally different views or
hints on the
glaringly obvious that I missed.

Best regards
Holger

Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart




More information about the zeromq-dev mailing list