[zeromq-dev] frequent ZeroMQ crashes - how to diagnose?

Nick Kravitz nick at dymcapital.com
Sat Jun 19 03:15:17 CEST 2010


ZeroMQ crashed today.

This is a Win32 build of both ZMQ and myApp.
myApp was running fine with several thousand messages, when the memcpy code line below threw the following exception. 

"Unhandled exception at 0x6404edd6 (msvcr90d.dll) in myApp.exe: 0xC0000005: Access violation reading location 0xfeeefeee."

debugging shows the following values:
-		buffer	0x00d9b570 "%"	unsigned char *
		pos	2	unsigned int
+		write_pos	0xfeeefeee <Bad Ptr>	unsigned char *
		to_copy	8190	unsigned int

looks like a bad pointer.

encoder.hpp

                //  If there are no data in the buffer yet and we are able to
                //  fill whole buffer in a single go, let's use zero-copy.
                //  There's no disadvantage to it as we cannot stuck multiple
                //  messages into the buffer anyway. Note that subsequent
                //  write(s) are non-blocking, thus each single write writes
                //  at most SO_SNDBUF bytes at once not depending on how large
                //  is the chunk returned from here.
                //  As a consequence, large messages being sent won't block
                //  other engines running in the same I/O thread for excessive
                //  amounts of time.
                if (!pos && !*data_ && to_write >= buffersize) {
                    *data_ = write_pos;
                    *size_ = to_write;
                    write_pos = NULL;
                    to_write = 0;
                    return;
                }

                //  Copy data to the buffer. If the buffer is full, return.
                size_t to_copy = std::min (to_write, buffersize - pos);
=======>        memcpy (buffer + pos, write_pos, to_copy); 
                pos += to_copy;
                write_pos += to_copy;
                to_write -= to_copy;
                if (pos == buffersize) {
                    *data_ = buffer;
                    *size_ = pos;
                    return;
                }


Hi Nick,

> We are a small financial startup using messaging for communication 
> between our various applications.
> 
> We chose ZeroMQ because of speed and flexibility - both the set of 
> languages and number of distributed systems we have is growing and yet 
> to be determined.

Understood.

> What should our diagnosis strategy be to chase this difficult bug down?

The only problem here seems to be that Windows returns some error we 
haven't expected. The only thing that needs to be done is find out what 
the error is and add it to the list (the long if statement in the code 
you've sent).

wsa_assert should print the error to stderr -- can you check it in the 
console?

Let me know what the error was so that I can fix it in the trunk.

Thanks!
Martin



More information about the zeromq-dev mailing list