[zeromq-dev] [PATCH] Fixed OOM handling while writing to a pipe

Martin Lucina mato at kotelna.sk
Fri May 20 18:09:18 CEST 2011


sustrik at 250bpm.com said:
> On 05/20/2011 12:30 PM, Pieter Hintjens wrote:
> 
> > My own experience goes strongly against handling OOM in any way except
> > assertion. We explored this quite exhaustively in OpenAMQ and found
> > that returning errors in case of OOM was very fragile. It is not even
> > clear that an application can deal with such errors sanely, since many
> > system calls will themselves fail if memory is exhausted. We tried
> > hard to make this work, and in the end had to choose for "assert" as
> > the only robust answer.
> >
> > It's particularly important for services because most of the time
> > there is a problem that must be raised and resolved, whether it's the
> > too-low default VM size, or the lack of HWMs on queues, or too-slow
> > subscribers, etc.
> >
> > The only exception to assertion, afaics, is for allocation requests
> > that are clearly unreasonable. And even then, assertion seems the
> > right response if these requests are internal. If they're driven by
> > user data (i.e. someone sending a 4GB message to a service), the
> > correct response is detecting over-sized messages and discarding them
> > (and we have this code in 2.2 and 3.0).
> >
> > tl,dr - +1 for asserting on OOM, -1 for returning ENOMEM.
> 
> +1 for asserts
> 
> Still, some heuristics on handling OOM can be used. Say "if you can't 
> allocate engine for a new connection, close the connection". Assert only 
> if closing the connection fails.

I've no idea which is the better approach here; assertions are generally
the easier way out. Having said that, e.g. the system malloc() generally
does not assert if it cannot allocate memory.

I'd suggest that a good guideline would be:

1) If it is possible to clearly return ENOMEM to the calling API, do so.
This counts for user-initiated allocations.

2) If not, e.g. the allocation is internal and has no clear "caller", then
an assertion is probably the best option unless there is a clear recovery
path (e.g. drop the connection).

> 
> The obvious question is whether it's good for anything. Even if we are 
> able to recover from this allocation failure, a next one is likely to 
> happen immediately afterwards. And I am not even mentioning that the 
> process is most likely to be in OOM killer's crosshairs at that point.

Minor point:

That's assuming the OOM condition is due to a system OOM condition also; at
that point all bets are off.

However, the OOM condition could also be caused by a resource limit set by
the administrator.

-mato



More information about the zeromq-dev mailing list