[zeromq-dev] Handling of fd-limit in Java

Johan Ström johan at stromnet.se
Sat Nov 12 11:34:05 CET 2011


Hi Martin, 

On Nov 12, 2011, at 8:40 AM, Martin Sustrik wrote:

> Hi Johan,
> 
>> Now, going out of FDs is expected, but the problem is, whenever I hit
>> the FD limit, I hit an assert:
>> 
>> [junit] Too many open files [junit] rc == 0 (signaler.cpp:330)
>> 
>> And an assert like this yields a total termination of the JVM it
>> seems. Not so nice in an app server context for example.
> 
> What else would you like it to do? Note that the code that fails to open
> a socket is async, ie. runs in a background while you application is
> doing some other work.

Ah, of course.. That make it a bit more problematic indeed. I'm not really sure what a good solution would be, without making the usage of the library much more difficult. Not too sure about how the internals are working, so I'm afraid I cannot be of that much help. Maybe mark the socket as failed and on later calls (connect, bind, recv/send etc) fail with an error? Maybe not that easy to facilitate.

However, I checked the source a bit.. From what I can read, a call to zmq_socket would yield a direct call down to signaler.cpp:330, unless I'm missing something here:

zmq_socket calls ctx->create_socket
create_socket calls socket_base_t::create
create calls (for example) req_t constructor
req_t inherits down to socket_base_t which has the member 'mailbox_t mailbox'
mailbox_t have the member signaler_t signaler.

Wouldn't that chain automatically call the constructor on all these members, down to signaler_t?
And then, n signaler_t's constructor, the call to make_fdpair would be executed, and then we're at signaler.cpp:330 where the socketpair failed.

Although, even if that succeeds the actual transport socket could of course fail later in the background thread, and then we're back at square one.. But (if my understanding of the code is correct), at least the above crash could have been mitigated. Maybe not worth a change though, since the out-of-fd scenario would probably have the same chance to happen later in the background..


> 
>> Any thoughts on alternative ways to handle this kind of situations
>> such as returning null, letting the caller decide wether to assert or
>> otherwise handle it gracefully?
> 
> Same as above. The problem happens in async manner. No call to libzmq
> may be happening at the time, so there nothing to return null from.
> 
>> Also, I've seen some weird crashes related in the same tests, my app
>> opens a REQ socket, connects to the ROUTER, waits for reply, and then
>> closes it. If i put a heavy load on these sockets, I eventually crash
>> with the following error:
>> 
>> Device or resource busy (mutex.hpp:91)
> 
> Please, do make sure that you are not using the same socket from two threads in parallel. If you are not, the above is definitely a 0MQ bug.

The socket created is in a single thread, using inproc to connect to the router in the other thread.

> 
>> From the code it looks like it's pthread_mutex_destroy which fails.
>> I'll probably go for keeping the socket in a ThreadLocal for now,
>> avoiding the re-creation of the socket, but just wanted to report it
>> anyway.
> 
> Yes, please. Create a minimal reproducible test case and report the problem via bug-tracker. That way it'll be solved rather than forgotten.

Done!
https://zeromq.jira.com/browse/LIBZMQ-281

Weird thing, when testing that sample out on my FreeBSD box, I managed to get NULL from zmq_socket with errno set to Out of FDs.. Never on the Linux box though.


Hope this can help to improve an already great product! :)

Johan

> 
>> 
>> Please reply directly to me, I only get the digests, will be hard to
>> do followup answers with only that.
> 
> You're on cc.
> 
> Martin
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20111112/8f6532c9/attachment.htm>


More information about the zeromq-dev mailing list