[zeromq-dev] duped sockets and fork

Selim Ciraci ciraci at gmail.com
Wed Oct 2 10:46:17 CEST 2013


Hi,

The only solution we could find to the leaking sockets problem is to
destroy parent context before fork. Then, we re-initialize the parent
context after fork. Sometimes the context initialization fails at the
parent, somehow the router-dealer connections are not established. We are
looking at this problem now.

Best,
Selim


On Wed, Sep 18, 2013 at 11:03 PM, Selim Ciraci <ciraci at gmail.com> wrote:

> Hi,
>
> Here is some more info on the error:
>
> After forking a child-child-child...child process (whose parents are
> terminated cleanly using zmq_term), zmq_connect fails. For instance:
> pid 1 forks pid 2
> pid2, connects to server and does some work.
> pid2  asks pid 1 terminate, pid1 terminates (zmq_term() is called),
> pid2 forks id3.
> pid3 connects to server and does some work
> pid 3 asks pid2 to terminate, pid3 terminates.
> ....
> pid 10 forks pid 11
> pid 11 tries to connect to the server, zmq_connect fails with EINVAL.
> Further trace on the error shows that the call to getaddrinfo from
> tcp_address::resolve_hostname() fails.
>
> Our code passes tcp://localhost:5555 as the address to connect (the value
> does not change, it is a constant string). The connection works on all
> child processes, until we reach a certain depth. At that point getaddrinfo
> on localhost fails with "no address associated with that name". This is
> kind of weird. I don't know what might cause this. In fact, I verified the
> parameters passed to getaddrinfo and all seems ok.
>
> On a side note, I think the sockets inherited from the parents are not
> closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember).
> Moreover, I see that the server (with the router socket) removes the pipes
> associated with dead parent ids when the child-child-child..-child process
> terminates successfully (i.e., when it calls zmq_term). For the error in
> getaddrinfo, I think the system is running out of fds so an fd operation is
> failing. I might be wrong though. Any comments?
>
> Any help is greatly appreciated! The code I'm using is around 250000lines
> of code so it is abit hard to get a test case. But I'm working on it.
>
> Best,
> Selim Ciraci
>
> Best,
> Selim Ciraci
>
>
> On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <matt.connolly at me.com>wrote:
>
>> There's two types of sockets used by zeromq as far as I understand:
>> external connections and internal pipes used to communicate between the io
>> threads and the host application.
>>
>> My patch for zmq_term closes all of the internal pipes with new ones.
>> This allows the termination process to complete without affecting the pipes
>> that were inherited from the parent process, which caused asserts in the
>> parent.
>>
>> Returning EINTR was intended so that terminating the context would behave
>> the same as if the process received a signal. (It could be receiving
>> signals for other reasons, eg usr signal)
>>
>> If there are connected zmq sockets (to some other machine for example)
>> then those sockets would also be inherited but I thought they would have
>> been closed correctly by the termination process. This may not be working
>> right and activity on these sockets between fork and terminate in the child
>> may interfere with the parent context's ability to use these sockets.
>> Perhaps these sockets are not actually being closed properly and causing
>> this problem.
>>
>> I'll take a closer look later in the week and see...
>>
>>
>> Regards,
>> Matt.
>>
>> On 17 Sep 2013, at 8:22 am, Selim Ciraci <ciraci at gmail.com> wrote:
>>
>> Hi Matt,
>>
>> Another things is, sorry if I'm wrong, but zmq_term in the child always
>> returns EINTR. This is because most of the sockets operations return EINTR
>> when pid!= getpid(). With your patch signaler will create a new eventfd
>> (correct me if I'm wrong) and then return. It is up to the reaper thread to
>> close the sockets right? but since most operations just return EINTR, I
>> wonder if the sockets are really closed after the fork.
>>
>> Best,
>> Selim Ciraci
>>
>>
>> On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <ciraci at gmail.com> wrote:
>>
>>> Hi Matt,
>>>
>>> It is not an assertion fail. The problem occurs in connections between
>>> router-dealer sockets. The send function in router.cpp returns no route to
>>> host because it cannot find the host_id in the outpipes_t. A careful debug
>>> shows that actually the pipe from dealer to the router has not been
>>> established. I put a printf to xidentify_peer method in router.cpp, the new
>>> client ids are inserted to the outpipes_t in this method as far as I know.
>>> The aim here is compare the child process ids with the ids the router
>>> socket received. The comparison actually showed that some child ids went
>>> missing (router socket never received them). I must add that the ids went
>>> missing after a parent process terminates. Though I need further testing to
>>> prove this.
>>>
>>> Any ideas what might be going wrong here? I'm going to try to implement
>>> a simple test case.
>>>
>>> Thanks,
>>> Selim
>>>
>>>
>>> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <matt.connolly at me.com>wrote:
>>>
>>>> Hi Selim,
>>>>
>>>> I don’t have any ideas yet about why the parent would stop sending
>>>> messages after forking a second child.
>>>>
>>>> Is it possible to reproduce this in a simple test case?
>>>>
>>>> And when the no route to host error occurs, is that an assertion? If
>>>> so, can you provide a stack trace?
>>>>
>>>> -Matt
>>>>
>>>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <ciraci at gmail.com> wrote:
>>>>
>>>> > Hi Matt,
>>>> >
>>>> > Thanks for your reply. I have actually found out about your patch
>>>> after the email. I have updated zmq to head from github and tried with my
>>>> program. The parent sockets seems to have closed. But the problem is every
>>>> now and then I get "no route to host" errors in zmq_send. This happens
>>>> usually when:
>>>> > parent forks a child, child calls zmq_term(parent_context) does work
>>>> and then terimantes (closes its context).
>>>> > parent in parallel uses parent_context, does work, learns the child
>>>> has terminated, forks a new child child2.
>>>> > child2 zmq_term(parent_context) does work and then terimantes (closes
>>>> its context).
>>>> > after child2 terminates parent cannot receive messages. Even though
>>>> the parent is active, zmq_send in the server fails with no route to host.
>>>> >
>>>> > I have no idea why this fails. Any ideas what might be causing this?
>>>> >
>>>> > Best,
>>>> > Selim Ciraci
>>>>
>>>> _______________________________________________
>>>> zeromq-dev mailing list
>>>> zeromq-dev at lists.zeromq.org
>>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>
>>>
>>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20131002/662c65bc/attachment.htm>


More information about the zeromq-dev mailing list