[zeromq-dev] Handling OOM

Paul Colomiets paul at colomiets.name
Wed May 18 01:20:47 CEST 2011


Despite of few Martin attempts to scare me, I'm working on fixing zeromq
work in out of memory conditions. And I really want to discuss semantics of

First assertion pointed to by Martin is:


This is not an assertion and has two sides. One when user calls
zmq_msg_init_size(), if no memory can be allocated error will be just
propagated to the user. When it's called by the code, handling network on
OOM decoder closes underlying connection (e.g. somewhere along the lines
76-80 in decoder.cpp). This leads to lots of errors described later. This
thing is very convenient if you haven't set max message size and header with
big message size received from network, but for small messages closing
connection seems overkill and it provocates all the memory assertions in the
reconnection code. Should that be fixed to something more reasonable? The
way I'm thinking of is close connection if message size is bigger than some
arbitrary fixed value (e.g. 64Kb or 1Mb, actual value doesn't matter, but it
should somehow compensate with all the cost of reconnecting) if max message
size is not specified. When connection should not be closed, just wait some
time (or wait until some message will be sent, which is probably hard and
unreasonable to do).

Next group of assertions:

    poller_base.cpp:52 (bad alloc exception)

Seems all related to reconnection code (there are plenty of other I've seen
before, these just repeated today), and partially will be fixed if
reconnection code will be more rare in OOM conditions. The complexity of
fixing them was described in previous email: they are either allocactions in
constructors, or exceptions in standard classes. Martin, they are really
happen in the test programs, so I *really* need a hint how to implement
them, before I've started fixing them in a wrong way :)

And the last issue today:


This also has two sides. One when we are receiving messages, at the first
time I plan to turn assertion into disconnect, which will then probably
fixed in a way similar to solution of the first issue in this letter.

The other side of this assertion is when user send messages. zmq_send in
this case should return -1, setting errno to ENOMEM. This brings an issue as
people used to rely on having no errors while sending second, third, etc.
message parts. But good code has assert for this case anyway, so it will
fail with same assertion a bit later in the user code, which is a good
thing. But if user want to continue to use socket we have three variants:

1. All message parts sent so far are discarded, so user must start from the
2. All message parts sent so far are kept, so user can send each message
part in a loop
3. All messages are discarded, but continue to be discarded until message
without ZMQ_SNDMORE will be clear

Seems (2) is not very useful (there is a chance that memory will be freed by
IO code, or other threads, but it can also lead to a deadlock). Third is
quite strange, but something similar is implemented in zeromq for the case
when connection dropped in the middle of the multipart message. I see a good
reason for (1). If in python you use:


It's polite raise MemoryError when getting ENOMEM, and there are no way to
discover after which message an error happened. So given (1) will be
implemented you can recover from MemoryError gracefully, by starting always
in a clean state (and you usually can recover from memory error in python by
catching exception). Discarding message parts sent so far will also free
some resources. But care must be taken to always discard messages if ENOMEM
encountered. Also yqueue is probably not suited for backtracking queue.
Anyway my vote will be for (1) if it can be implemented.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20110518/bab2e429/attachment.htm>

More information about the zeromq-dev mailing list