[zeromq-dev] Logging HWM events?

Tom Wilberding tom at wilberding.com
Wed Sep 28 16:47:41 CEST 2011


Hi Martin,

I'm not sure what caused the delay. That is our big mystery.

We have a simple app that does a TCP SUB and simply repeats via TCP PUB
to several clients.

The memory usage on that repeater application had reached 2Gb or so
(usually is hovers around 20 Mb in stead state). During this time the
CPU usage on the server was pegged by the various clients (2 or which
were running on the same host as the repeater and 2 were running on
other hosts).

The incoming TCP traffic to the repeater is on the order of 10Mbps and
spikes occasionally. My hypothesis is that we had a large spike of
incoming data and we reached the HWM on the repeater and entered the
exceptional state where we started to write to disk, which would
accelerate the problem. We also may have been using up enough memory
that we were slowing down and making the problem worse. It happened 2
days in a row and we had to hard reboot the machine yesterday because
the CPU was completely pegged.

We've moved the repeater to its own host and it seems much more behaved
today. The CPU usage is very low on both hosts, but the memory usage on
the repeater is currently hovering around 700Mb (I see it expand and
contract, but there is a lot of incoming data right now, so it must be
buffering a fair amount).

What is the HWM for a TCP PUB/SUB? Can I increase this via socket
options?

I'm going to gather memory usuage stats from the repeater every 30 sec
and try to understand how much memory I need properly handle the input
traffic and still have capacity for large spikes.

Thanks,
Tom

-----Original Message-----
From: Martin Sustrik <sustrik at 250bpm.com>
To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
Cc: Tom Wilberding <tom at wilberding.com>
Subject: Re: [zeromq-dev] Logging HWM events?
Date: Wed, 28 Sep 2011 16:30:03 +0200

Hi Tom,

> I have an app using PUB/SUB over TCP. Is there a way to log when HWM is
> reached and we have entered into an exceptional state?

There's no in-built mechanism for that.

Also note that there's a separate HWM for each underlying connection, so 
even if you would see "HWM reached" notification it wouln't be clear 
which peer it applies to.

> We had a problem today where it appears that the PUB/SUB link went quiet
> for almost 30 minutes (it is usually processing hundreds or thousands of
> messages per second). Then we saw a slow trickle of very stale (i.e. 30
> minutes old) messages that lasted for another 30 minutes or so before I
> figured out that there was a problem and restarted the client.

Was the delay caused by 0MQ or the application?

In the former case it's a bug and we should try to solve it.

Martin





More information about the zeromq-dev mailing list