[zeromq-dev] frequent ZeroMQ crashes - how to diagnose?

Nick Kravitz nick at dymcapital.com
Fri Jun 18 18:45:03 CEST 2010


We are a small financial startup using messaging for communication between our various applications.
We chose ZeroMQ because of speed and flexibility - both the set of languages and number of distributed systems we have is growing and yet to be determined.

One problem that we are having seems to be frequent ZeroMQ crashes during either sending or receiving messages.
We are having trouble diagnosing the reason for this; sometimes our application runs for hours sending and receiving hundreds of thousands of messages; sometimes it crashes within a few minutes.

Our diagnosis is complicated by the fact that we are using java language bindings, and using TCP for our communication protocol between multiple servers (but on the same local network).
Most of the time, the crash simply results in the jvm showing a generic error box with something like "libzmq.dll has experienced an unexpected error; the jvm needs to shut down"

I have traced one of the errors back to the following code: (the wsa_assert occasionally fails below, but no way to debug what the actual error is)

We have also played a bit with blocking versus non-blocking sends and receives.

What should our diagnosis strategy be to chase this difficult bug down?

thanks in advance

Nick Kravitz
nick at dymcapital.com

int zmq::tcp_socket_t::read (void *data, int size)
{
    int nbytes = recv (s, (char*) data, size, 0);

    //  If not a single byte can be read from the socket in non-blocking mode
    //  we'll get an error (this may happen during the speculative read).
    if (nbytes == SOCKET_ERROR && WSAGetLastError () == WSAEWOULDBLOCK)
        return 0;

    //  Connection failure.
    if (nbytes == -1 && (
          WSAGetLastError () == WSAENETDOWN ||
          WSAGetLastError () == WSAENETRESET ||
          WSAGetLastError () == WSAECONNABORTED ||
          WSAGetLastError () == WSAETIMEDOUT ||
          WSAGetLastError () == WSAECONNRESET ||
          WSAGetLastError () == WSAECONNREFUSED ||
          WSAGetLastError () == WSAENOTCONN))
        return -1;

    wsa_assert (nbytes != SOCKET_ERROR); // occasionally this assert fails, which causes the jvm to halt

    //  Orderly shutdown by the other peer.
    if (nbytes == 0)
        return -1;

    return (size_t) nbytes;
}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20100618/bf648bb2/attachment.html>


More information about the zeromq-dev mailing list