[zeromq-dev] duped sockets and fork

Selim Ciraci ciraci at gmail.com
Thu Sep 19 08:03:24 CEST 2013


Hi,

Here is some more info on the error:

After forking a child-child-child...child process (whose parents are
terminated cleanly using zmq_term), zmq_connect fails. For instance:
pid 1 forks pid 2
pid2, connects to server and does some work.
pid2  asks pid 1 terminate, pid1 terminates (zmq_term() is called),
pid2 forks id3.
pid3 connects to server and does some work
pid 3 asks pid2 to terminate, pid3 terminates.
....
pid 10 forks pid 11
pid 11 tries to connect to the server, zmq_connect fails with EINVAL.
Further trace on the error shows that the call to getaddrinfo from
tcp_address::resolve_hostname() fails.

Our code passes tcp://localhost:5555 as the address to connect (the value
does not change, it is a constant string). The connection works on all
child processes, until we reach a certain depth. At that point getaddrinfo
on localhost fails with "no address associated with that name". This is
kind of weird. I don't know what might cause this. In fact, I verified the
parameters passed to getaddrinfo and all seems ok.

On a side note, I think the sockets inherited from the parents are not
closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember).
Moreover, I see that the server (with the router socket) removes the pipes
associated with dead parent ids when the child-child-child..-child process
terminates successfully (i.e., when it calls zmq_term). For the error in
getaddrinfo, I think the system is running out of fds so an fd operation is
failing. I might be wrong though. Any comments?

Any help is greatly appreciated! The code I'm using is around 250000lines
of code so it is abit hard to get a test case. But I'm working on it.

Best,
Selim Ciraci

Best,
Selim Ciraci


On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <matt.connolly at me.com> wrote:

> There's two types of sockets used by zeromq as far as I understand:
> external connections and internal pipes used to communicate between the io
> threads and the host application.
>
> My patch for zmq_term closes all of the internal pipes with new ones. This
> allows the termination process to complete without affecting the pipes that
> were inherited from the parent process, which caused asserts in the parent.
>
> Returning EINTR was intended so that terminating the context would behave
> the same as if the process received a signal. (It could be receiving
> signals for other reasons, eg usr signal)
>
> If there are connected zmq sockets (to some other machine for example)
> then those sockets would also be inherited but I thought they would have
> been closed correctly by the termination process. This may not be working
> right and activity on these sockets between fork and terminate in the child
> may interfere with the parent context's ability to use these sockets.
> Perhaps these sockets are not actually being closed properly and causing
> this problem.
>
> I'll take a closer look later in the week and see...
>
>
> Regards,
> Matt.
>
> On 17 Sep 2013, at 8:22 am, Selim Ciraci <ciraci at gmail.com> wrote:
>
> Hi Matt,
>
> Another things is, sorry if I'm wrong, but zmq_term in the child always
> returns EINTR. This is because most of the sockets operations return EINTR
> when pid!= getpid(). With your patch signaler will create a new eventfd
> (correct me if I'm wrong) and then return. It is up to the reaper thread to
> close the sockets right? but since most operations just return EINTR, I
> wonder if the sockets are really closed after the fork.
>
> Best,
> Selim Ciraci
>
>
> On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <ciraci at gmail.com> wrote:
>
>> Hi Matt,
>>
>> It is not an assertion fail. The problem occurs in connections between
>> router-dealer sockets. The send function in router.cpp returns no route to
>> host because it cannot find the host_id in the outpipes_t. A careful debug
>> shows that actually the pipe from dealer to the router has not been
>> established. I put a printf to xidentify_peer method in router.cpp, the new
>> client ids are inserted to the outpipes_t in this method as far as I know.
>> The aim here is compare the child process ids with the ids the router
>> socket received. The comparison actually showed that some child ids went
>> missing (router socket never received them). I must add that the ids went
>> missing after a parent process terminates. Though I need further testing to
>> prove this.
>>
>> Any ideas what might be going wrong here? I'm going to try to implement a
>> simple test case.
>>
>> Thanks,
>> Selim
>>
>>
>> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <matt.connolly at me.com>wrote:
>>
>>> Hi Selim,
>>>
>>> I don’t have any ideas yet about why the parent would stop sending
>>> messages after forking a second child.
>>>
>>> Is it possible to reproduce this in a simple test case?
>>>
>>> And when the no route to host error occurs, is that an assertion? If so,
>>> can you provide a stack trace?
>>>
>>> -Matt
>>>
>>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <ciraci at gmail.com> wrote:
>>>
>>> > Hi Matt,
>>> >
>>> > Thanks for your reply. I have actually found out about your patch
>>> after the email. I have updated zmq to head from github and tried with my
>>> program. The parent sockets seems to have closed. But the problem is every
>>> now and then I get "no route to host" errors in zmq_send. This happens
>>> usually when:
>>> > parent forks a child, child calls zmq_term(parent_context) does work
>>> and then terimantes (closes its context).
>>> > parent in parallel uses parent_context, does work, learns the child
>>> has terminated, forks a new child child2.
>>> > child2 zmq_term(parent_context) does work and then terimantes (closes
>>> its context).
>>> > after child2 terminates parent cannot receive messages. Even though
>>> the parent is active, zmq_send in the server fails with no route to host.
>>> >
>>> > I have no idea why this fails. Any ideas what might be causing this?
>>> >
>>> > Best,
>>> > Selim Ciraci
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>
>>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20130918/ee26f0a2/attachment.htm>


More information about the zeromq-dev mailing list