[zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?

Luca Boccassi luca.boccassi at gmail.com
Fri Nov 25 11:50:29 CET 2016


On Fri, 2016-11-25 at 10:37 +0100, zmqdev wrote:
> * Background
> 
> I have a service that starts workers on demand with fork+exec.
> The requests arrive over zeromq sockets.
> 
> After the fork, before the exec, I close all file descriptors > 2, 
> keeping only stdin/out/err. I then exec the requested program.
> 
> 
> * Problem
> 
> It works. Except that I get some rare core dumps (of the service) with 
> the following assertion failure:
> 
> 	Bad file descriptor (src/epoll.cpp:90)
> 
> and the backtrace:
> 
>      #0  0xf77f5430 in __kernel_vsyscall ()
>      #1  0xf743f1f7 in raise () from /lib/libc.so.6
>      #2  0xf7440a33 in abort () from /lib/libc.so.6
>      #3  0xf7067134 in zmq::zmq_abort(char const*) () from $LIBS/libzmq.so.5
>      #4  0xf7065e6c in zmq::epoll_t::rm_fd(void*) () from $LIBS/libzmq.so.5
>      #5  0xf7068823 in zmq::io_object_t::rm_fd(void*) () from 
> $LIBS/libzmq.so.5
>      #6  0xf70958af in zmq::stream_engine_t::unplug() () from 
> $LIBS/libzmq.so.5
>      #7  0xf7098711 in 
> zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) () 
> from $LIBS/libzmq.so.5
>      #8  0xf7098867 in zmq::stream_engine_t::timer_event(int) () from 
> $LIBS/libzmq.so.5
>      #9  0xf707f972 in zmq::poller_base_t::execute_timers() () from 
> $LIBS/libzmq.so.5
>      #10 0xf7066209 in zmq::epoll_t::loop() () from $LIBS/libzmq.so.5
>      #11 0xf7066467 in zmq::epoll_t::worker_routine(void*) () from 
> $LIBS/libzmq.so.5
>      #12 0xf709d67e in thread_routine () from $LIBS/libzmq.so.5
>      #13 0xf7619b2c in start_thread () from /lib/libpthread.so.0
>      #14 0xf750808e in clone () from /lib/libc.so.6
> 
> This is with zeromq-4.1.4 on RHEL 7.3 x86_64.
> 
> So I wonder: is there some interaction between parent and child?
> 
> 
> * Documentation
> 
> The Guide and the FAQ do not address explicitly the fork+exec point.
> 
> The question has been asked several times on the mailing list in various 
> forms, without a definitive answer (for dummies like me at least).
> 
> 
> * Questions:
> 
> Do I need to zmq_close the sockets in the child?
> Or is zmq_term in the child enough?
> Does closing the file descriptors in the child cause problems in the parent?
> 
> What is the correct way to handle this?

Hi,

I have not dealt with this case personally, so perhaps other folks who
have can chip in.

What I can say is that we have a unit test for this situation:

https://github.com/zeromq/libzmq/blob/master/tests/test_fork.cpp

And the child closes the (TCP) socket explicitly before the context.
Which is in fact what should happen in all cases.

The parent then can receive messages on the sockets just fine.

Maybe it's a linger issue? By default a socket has 30s of linger grace
period.

Try setting ZMQ_LINGER to 0 in the socket in the child, close the socket
and then terminate the context perhaps.

Kind regards,
Luca Boccassi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20161125/4c6eb957/attachment.sig>


More information about the zeromq-dev mailing list