[zeromq-dev] Questions to better understand zproto and FileMQ

Mario Steinhoff steinhoff.mario at gmail.com
Sun Jan 31 23:01:26 CET 2016

2016-01-31 15:14 GMT+01:00 Pieter Hintjens <ph at imatix.com>:

> On Thu, Jan 28, 2016 at 6:30 PM, Mario Steinhoff
> <steinhoff.mario at gmail.com> wrote:
> > It would be great if someone could confirm them, or if statements do not
> > hold true, clarify the inner zproto workings.
> Your description is 100% accurate. Well done.
> Feel free to add to the README.txt if you feel you can make it clearer
> to newcomers.

Sure :)

> > With a single client, my implementation works just fine. But when there
> are
> > multiple clients connected, and a large file is sent to one of the
> clients,
> > other clients timeout.
> This is an interesting problem. I've not tried this test in the
> original code so perhaps the best is that you study it and figure out
> what's happening. Where is it blocking?

During the last few days, I mitigated the problem by moving the code that
sends heartbeats to the client side and raising the expiry timeout to 30s.

What I currently have is a system that distributes a set of 'records':

- One record is sent in one message, while record are limited in size
- One publisher distributes records to many (or should I say few, expected
to be < 100) subscribers.
- The records are grouped via path-like names and transferred in batches
with a clear start and end, using credit-based flow control.
- A client subscribes to the server for all record sets its interested in.
- Server and clients calculate MD5 hashes for all record groups and
transmit/check them during subscriptions, so only actual changes are sent.
- For now, clients can subscribe to a server without any form of

So I'd say its very similar to FileMQ although not exactly the same.

Today I created a little test case to demonstrate the problem:

1. Set heartbeat interval to 1s, expiry timeout to 3s.
2. Launch a large enough number of 'empty' client processes (clients that
have not received any data during a previous run, 8 clients seem to be
3. Launch the server process.

On the server side, I added logging code to calculate the time it takes to
execute all actions within a poll loop iteration, e.g. execute client FSM,
remove stale connections, monitor server, etc. It also gives me a warning
when that time exceeds one second.

On the client side, I added logging code to calculate the time it takes
between heartbeat request and response and a warning if this exceeds 100ms.

Server logfile: http://pastebin.com/raw/rcpZEUwX
Client logfile: http://pastebin.com/raw/19KjirFz (similar to other client

And now we can observe the following behavior:

- Client starts, connects to socket, sends hello, waits for response
- Server starts, binds on socket, receives hello from all clients, sends
hello ok to all clients
- Clients receive hello ok, send subscriptions, send credit
- Server receives subscribe from all clients, stores subscription requests,
sends subscribe ok to all clients (subscription *requests*, because mounts
can be added later on)
- Clients receive subscribe ok (no further action required)
- Server receives credit from all clients, but has nothing to send yet
- Server receives a few heartbeats, sends heartbeat ok, everything cool
- Server adds the first mount with record data
- Server finds that there are pending subscription requests and adds them
to the mount
- Server finds that the MD5 sums from the subscription requests differ,
finds it has credits for all clients and starts sending records for that
mount to _all_ clients

In this case, sending the first record set to all clients takes ~7 seconds,
blocking the poll loop.

During this time, the server will queue up heartbeat messages from clients
but can not send heartbeat ok back. When the expiry timeout on the client
is too low, a client will think the server is gone, expire the connection,
reset its internal state back to connecting and send a hello. After the
blocking action is done, the server still thinks it can send data and
happily sends a payload, while the client expects a hello ok and throws a
protocol error.

So I'd say this is not a problem with the zproto engines per se, all
single-threaded event driven systems will show such behavior when the event
thread is blocked.

Possible solutions I can think of:

1. Raise the expiry timeout on client side.

More like a workaround, because blocking still occurs but the clients won't
care anymore.
Will cause timeouts if server actions block longer than the timeout.

2. Change the logic how I notify clients about changed data.

Currently my server actor receives a change event when a record set has
changed and then sends records to all clients in one go. A better solution
would be to have some sort of internal messaging where the engine would
consume the change event, update its mount, and then send an internal
message for each client to the poll loop. That way, blocking would still
occur but is limited to one client and the time required to send one record

Will cause timeouts if sending a single record set takes longer than the

3. Make server multi-threaded

Change the server thread to be a proxy that handles client connections and
heartbeats and offloads the data sending logic to producer threads that can
block as long as they want.

Will avoid timeouts completely, because the server engine then only cares
about protocol validation, heartbeating and forwarding messages from
producer threads to clients.

Then we get something like this on the server side:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20160131/f81c3b31/attachment.htm>

More information about the zeromq-dev mailing list