[zeromq-dev] duped sockets and fork
Erik Aronesty
erik at q32.com
Mon Dec 16 17:18:21 CET 2013
This works before forking... close all sockets directly, don't use ZMQ
close in parent or children:
https://zeromq.jira.com/browse/LIBZMQ-441
On Wed, Oct 2, 2013 at 4:46 AM, Selim Ciraci <ciraci at gmail.com> wrote:
> Hi,
>
> The only solution we could find to the leaking sockets problem is to
> destroy parent context before fork. Then, we re-initialize the parent
> context after fork. Sometimes the context initialization fails at the
> parent, somehow the router-dealer connections are not established. We are
> looking at this problem now.
>
> Best,
> Selim
>
>
> On Wed, Sep 18, 2013 at 11:03 PM, Selim Ciraci <ciraci at gmail.com> wrote:
>
>> Hi,
>>
>> Here is some more info on the error:
>>
>> After forking a child-child-child...child process (whose parents are
>> terminated cleanly using zmq_term), zmq_connect fails. For instance:
>> pid 1 forks pid 2
>> pid2, connects to server and does some work.
>> pid2 asks pid 1 terminate, pid1 terminates (zmq_term() is called),
>> pid2 forks id3.
>> pid3 connects to server and does some work
>> pid 3 asks pid2 to terminate, pid3 terminates.
>> ....
>> pid 10 forks pid 11
>> pid 11 tries to connect to the server, zmq_connect fails with EINVAL.
>> Further trace on the error shows that the call to getaddrinfo from
>> tcp_address::resolve_hostname() fails.
>>
>> Our code passes tcp://localhost:5555 as the address to connect (the value
>> does not change, it is a constant string). The connection works on all
>> child processes, until we reach a certain depth. At that point getaddrinfo
>> on localhost fails with "no address associated with that name". This is
>> kind of weird. I don't know what might cause this. In fact, I verified the
>> parameters passed to getaddrinfo and all seems ok.
>>
>> On a side note, I think the sockets inherited from the parents are not
>> closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember).
>> Moreover, I see that the server (with the router socket) removes the pipes
>> associated with dead parent ids when the child-child-child..-child process
>> terminates successfully (i.e., when it calls zmq_term). For the error in
>> getaddrinfo, I think the system is running out of fds so an fd operation is
>> failing. I might be wrong though. Any comments?
>>
>> Any help is greatly appreciated! The code I'm using is around 250000lines
>> of code so it is abit hard to get a test case. But I'm working on it.
>>
>> Best,
>> Selim Ciraci
>>
>> Best,
>> Selim Ciraci
>>
>>
>> On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <matt.connolly at me.com>wrote:
>>
>>> There's two types of sockets used by zeromq as far as I understand:
>>> external connections and internal pipes used to communicate between the io
>>> threads and the host application.
>>>
>>> My patch for zmq_term closes all of the internal pipes with new ones.
>>> This allows the termination process to complete without affecting the pipes
>>> that were inherited from the parent process, which caused asserts in the
>>> parent.
>>>
>>> Returning EINTR was intended so that terminating the context would
>>> behave the same as if the process received a signal. (It could be receiving
>>> signals for other reasons, eg usr signal)
>>>
>>> If there are connected zmq sockets (to some other machine for example)
>>> then those sockets would also be inherited but I thought they would have
>>> been closed correctly by the termination process. This may not be working
>>> right and activity on these sockets between fork and terminate in the child
>>> may interfere with the parent context's ability to use these sockets.
>>> Perhaps these sockets are not actually being closed properly and causing
>>> this problem.
>>>
>>> I'll take a closer look later in the week and see...
>>>
>>>
>>> Regards,
>>> Matt.
>>>
>>> On 17 Sep 2013, at 8:22 am, Selim Ciraci <ciraci at gmail.com> wrote:
>>>
>>> Hi Matt,
>>>
>>> Another things is, sorry if I'm wrong, but zmq_term in the child always
>>> returns EINTR. This is because most of the sockets operations return EINTR
>>> when pid!= getpid(). With your patch signaler will create a new eventfd
>>> (correct me if I'm wrong) and then return. It is up to the reaper thread to
>>> close the sockets right? but since most operations just return EINTR, I
>>> wonder if the sockets are really closed after the fork.
>>>
>>> Best,
>>> Selim Ciraci
>>>
>>>
>>> On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <ciraci at gmail.com> wrote:
>>>
>>>> Hi Matt,
>>>>
>>>> It is not an assertion fail. The problem occurs in connections between
>>>> router-dealer sockets. The send function in router.cpp returns no route to
>>>> host because it cannot find the host_id in the outpipes_t. A careful debug
>>>> shows that actually the pipe from dealer to the router has not been
>>>> established. I put a printf to xidentify_peer method in router.cpp, the new
>>>> client ids are inserted to the outpipes_t in this method as far as I know.
>>>> The aim here is compare the child process ids with the ids the router
>>>> socket received. The comparison actually showed that some child ids went
>>>> missing (router socket never received them). I must add that the ids went
>>>> missing after a parent process terminates. Though I need further testing to
>>>> prove this.
>>>>
>>>> Any ideas what might be going wrong here? I'm going to try to implement
>>>> a simple test case.
>>>>
>>>> Thanks,
>>>> Selim
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <matt.connolly at me.com>wrote:
>>>>
>>>>> Hi Selim,
>>>>>
>>>>> I don’t have any ideas yet about why the parent would stop sending
>>>>> messages after forking a second child.
>>>>>
>>>>> Is it possible to reproduce this in a simple test case?
>>>>>
>>>>> And when the no route to host error occurs, is that an assertion? If
>>>>> so, can you provide a stack trace?
>>>>>
>>>>> -Matt
>>>>>
>>>>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <ciraci at gmail.com> wrote:
>>>>>
>>>>> > Hi Matt,
>>>>> >
>>>>> > Thanks for your reply. I have actually found out about your patch
>>>>> after the email. I have updated zmq to head from github and tried with my
>>>>> program. The parent sockets seems to have closed. But the problem is every
>>>>> now and then I get "no route to host" errors in zmq_send. This happens
>>>>> usually when:
>>>>> > parent forks a child, child calls zmq_term(parent_context) does work
>>>>> and then terimantes (closes its context).
>>>>> > parent in parallel uses parent_context, does work, learns the child
>>>>> has terminated, forks a new child child2.
>>>>> > child2 zmq_term(parent_context) does work and then terimantes
>>>>> (closes its context).
>>>>> > after child2 terminates parent cannot receive messages. Even though
>>>>> the parent is active, zmq_send in the server fails with no route to host.
>>>>> >
>>>>> > I have no idea why this fails. Any ideas what might be causing this?
>>>>> >
>>>>> > Best,
>>>>> > Selim Ciraci
>>>>>
>>>>> _______________________________________________
>>>>> zeromq-dev mailing list
>>>>> zeromq-dev at lists.zeromq.org
>>>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20131216/2127b3b4/attachment.htm>
More information about the zeromq-dev
mailing list