[zeromq-dev] [bug+workaround] socket_t not closed properly, still used

Thijs Terlouw thijsterlouw at gmail.com
Fri Jan 21 11:23:57 CET 2011


> From: Martin Sustrik <sustrik at 250bpm.com>
> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Date: Thu, 20 Jan 2011 07:55:37 +0100
> Subject: Re: [zeromq-dev] [PATCH] Fix handle connection reset during session init
>
>> I've also tried your suggestion with ZMQ_LINGER and set ZMQ_LINGER on
>> all sockets I have. Unfortunately I still have problems. It seems that
>> after rebuild the zmq_socket, sometimes the zmq_poll function will not
>> notify me of ZMQ_POLLIN events. The same code worked correctly with
>> 2.0.10. I used tcpdump to check if the data is sent on the network to
>> the 'server' and it appears to be ok. The server application code just
>> doesn't receive the event from ZMQ.... so now continue debugging
>> this....
>
> That definitely looks like a bug.
>
> Martin

After 2 hard days of debugging (and learning a lot about the internals
of ZeroMQ) I believe I finally understand the bug!
First of all, it's indeed a ZeroMQ bug.

How to replicate the bug?
- create a simple Hello World server + client (REQ / REP) like in the
manual for example is ok. Make sure you set LINGER=0 on both sides
- let the client loop forever , sending requests to the server
- every Xth loop entry in the client (for example 5th) you close your
socket ( zmq_close() ) and create a new socket, which will be used
next iteration
- lsof -p <PID_CLIENT> to see the open sockets from the client application
- you would expect 1 socket at a time, or perhaps 2 very shortly, but
not 2 sockets all the time


I believe this is what is wrong:
- between the socket_t and the connect_session_t is a pair of pipes
for communication
- the socket  has a reader + writer, the session also has a reader + writer
- when we want to shutdown the socket, all of it's children need to terminate
- this means the connect_session needs to terminate, but also the
pipes need to terminate
- if the pipes do not terminate, the socket will lack an ack, which
will prevent it from terminating fully
- the shutdown procedure of the pipes is this (I believe):

1. writer::terminate() = send a DELIM msg to the reader and calls
flush() (which activates the reader)
2. reader::read() = recv a DELIM, will call terminate() which will
call send_pipe_term(writer)
3. the process_command for this message is invoked via the
process_commands() in the socket_base_t
4. normally this would deliver it to the writer, which calls
process_pipe_term() which in turn does a send_pipe_term_ack() to the
reader
5. and normally process_pipe_term_ack() would be invoked in the reader
and also it gets deleted

problem: unfortunately since the socket was destructed,
process_commands() has no more chance to run normally, because the
socket is not used anymore (it's in zombie state)

My workaround:
1. in the context object in ctx_t::create_socket(int type_) I noticed
a dezombify() call, but it will not cleanup the previous zombie if
it's still in progress
2. so I added a while(!zombies.empty()) around it, just like in the
terminate() function
3. this will force the old socket to be cleaned up, when a new socket is created
ctx_t::dezombify() calls the socket_base_t::dezombify() method which
will call process_commands() so we get a chance to get the messages
delivered

Perhaps this workaround it not ideal for when you want to have your
old socket linger longer, but for me it's important to close it.

The consequence of having two sockets open, is that the when you load
balance between several sockets, the zombie socket also seems to get
used. So I suspect the *real solution* will be to make sure the zombie
socket doesn't get used anymore. For me removing the zombie socket is
a good workaround for now.

I hope that with this detailed description, you will be able to debug it easily.

Thijs



More information about the zeromq-dev mailing list