[zeromq-dev] are zmq::atomic_ptr_t<> Helgrind warnings known?

Bill Torpey wallstprog at gmail.com
Sat Feb 24 21:49:42 CET 2018


Please see my comments below.  Full disclaimer: I don’t pretend to understand ZMQ internals, and a lot of this information came via trial and error.  YMMV, but I hope this helps.  If you find out more, it would be much appreciated if you post back to this thread — thanks!



> On Feb 24, 2018, at 11:34 AM, Francesco <francesco.montorsi at gmail.com> wrote:
> Actually even building zeromq with -fsanitize=thread I still get a lot of data races about ZMQ atomics:
> WARNING: ThreadSanitizer: data race (pid=25770)
>   Atomic write of size 4 at 0x7d4000017e48 by thread T4:
>     #0 __tsan_atomic32_fetch_add ../../../../gcc-5.3.0/libsanitizer/tsan/tsan_interface_atomic.cc:611 (libtsan.so.0+0x00000005860f)
>     #1 zmq::atomic_counter_t::add(unsigned int) src/atomic_counter.hpp:116 (libzmq.so.5+0x00000001a087)
>     #2 zmq::poller_base_t::adjust_load(int) src/poller_base.cpp:53 (libzmq.so.5+0x00000006d20d)
>     #3 zmq::epoll_t::add_fd(int, zmq::i_poll_events*) src/epoll.cpp:91 (libzmq.so.5+0x00000003c229)
>     #4 zmq::io_object_t::add_fd(int) src/io_object.cpp:66 (libzmq.so.5+0x0000000400ef)
>     #5 zmq::tcp_connecter_t::start_connecting() src/tcp_connecter.cpp:204 (libzmq.so.5+0x0000000abf30)
>     #6 zmq::tcp_connecter_t::process_plug() src/tcp_connecter.cpp:94 (libzmq.so.5+0x0000000ab6cb)
>     #7 zmq::object_t::process_command(zmq::command_t&) src/object.cpp:90 (libzmq.so.5+0x000000059e37)
>     #8 zmq::io_thread_t::in_event() src/io_thread.cpp:85 (libzmq.so.5+0x000000040cb3)
>     #9 zmq::epoll_t::loop() src/epoll.cpp:188 (libzmq.so.5+0x00000003ce6d)
>     #10 zmq::epoll_t::worker_routine(void*) src/epoll.cpp:203 (libzmq.so.5+0x00000003cfb6)
>     #11 thread_routine src/thread.cpp:109 (libzmq.so.5+0x0000000af00b)
>   Previous read of size 4 at 0x7d4000017e48 by thread T2:
>     #0 zmq::atomic_counter_t::get() const src/atomic_counter.hpp:210 (libzmq.so.5+0x000000061d20)
>     #1 zmq::poller_base_t::get_load() src/poller_base.cpp:47 (libzmq.so.5+0x00000006d1c6)
>     #2 zmq::io_thread_t::get_load() src/io_thread.cpp:72 (libzmq.so.5+0x000000040bef)
>     #3 zmq::ctx_t::choose_io_thread(unsigned long) src/ctx.cpp:451 (libzmq.so.5+0x0000000184ff)
>     #4 zmq::object_t::choose_io_thread(unsigned long) src/object.cpp:199 (libzmq.so.5+0x00000005a67f)
>     #5 zmq::socket_base_t::bind(char const*) src/socket_base.cpp:610 (libzmq.so.5+0x00000008bca6)
>     #6 zmq_bind src/zmq.cpp:326 (libzmq.so.5+0x0000000c8735)
>    ….

Since I don’t see any references to mutexes above, I assume that the code does not use mutexes.  Otherwise you would get a warning that looks more like this:

  Previous write of size 8 at 0x7b08000064b8 by thread T5 (mutexes: write M88, write M90, write M87):

TSAN keeps track of the mutexes held at the time of each memory access — if two memory accesses occur on two different threads, but they have at least one mutex in common, TSAN accepts the access as race-free.  If the two accesses don’t have at least one mutex in common, then the access is considered a data race.

If T2 and T4 were application code, this would be a clear violation of ZMQ’s threading rules (assuming “legacy”, non-thread-safe socket types).  For instance, one technique would be to perform initialization (e.g., bind’s) in the main thread, and only after that is done spin up other threads to do processing.  In this case, TSAN wouldn't have any way to know that the application guarantees that accesses cannot result in a race, so TSAN would flag it.  I’ve gotten in the habit of using mutexes to protect code like this even though it should not strictly be needed, just to keep tools like TSAN happy, and also because I don’t know the ZMQ code well enough to be 100% certain that the mutexes are not necessary — better safe than sorry!

This situation is different, though, since T4 is not an application thread — it’s an internal ZMQ worker thread.   So, I think in this case we kind of have to accept that ZMQ is doing the right thing here.

At least, that’s the approach I’ve been taking.  When I instrument my apps and libraries with TSAN I specifically do NOT instrument ZMQ, and I also use the “called_from_lib:libzmq.so” suppression (which is listed as an example for TSAN: https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions <https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions>).

Instrumenting libzmq and/or omitting the suppression causes a LOT of warnings, esp. in the ZMQ worker threads.  So, unless I'm willing to commit the time and effort to go through and investigate each of these warnings, I feel I have little choice but to accept that at this point in its lifetime ZMQ should be race-free for all practical purposes.

FWIW, I’ve done fairly extensive testing, and specifically stress testing, and have yet to find anything that looks like an honest-to-goodness bug in ZMQ.  (Which is not to say that the docs are always clear about what to expect in certain situations ;-)  I did have one problem which appears to have been a bug in epoll, and which was resolved by upgrading Linux, but that’s it.

> or about ypipe:
> WARNING: ThreadSanitizer: data race (pid=25770)
>   Read of size 8 at 0x7d640001e0c0 by thread T4:
>     #0 zmq::ypipe_t<zmq::command_t, 16>::read(zmq::command_t*) src/ypipe.hpp:170 (libzmq.so.5+0x000000047e49)
>     #1 zmq::mailbox_t::recv(zmq::command_t*, int) src/mailbox.cpp:73 (libzmq.so.5+0x000000047793)
>     #2 zmq::io_thread_t::in_event() src/io_thread.cpp:86 (libzmq.so.5+0x000000040cd5)
>     #3 zmq::epoll_t::loop() src/epoll.cpp:188 (libzmq.so.5+0x00000003ce6d)
>     #4 zmq::epoll_t::worker_routine(void*) src/epoll.cpp:203 (libzmq.so.5+0x00000003cfb6)
>     #5 thread_routine src/thread.cpp:109 (libzmq.so.5+0x0000000af00b)
>   Previous write of size 8 at 0x7d640001e0c0 by thread T2 (mutexes: write M500):
>     #0 zmq::ypipe_t<zmq::command_t, 16>::write(zmq::command_t const&, bool) src/ypipe.hpp:85 (libzmq.so.5+0x000000047f05)
>     #1 zmq::mailbox_t::send(zmq::command_t const&) src/mailbox.cpp:62 (libzmq.so.5+0x0000000476dc)
>     #2 zmq::ctx_t::send_command(unsigned int, zmq::command_t const&) src/ctx.cpp:438 (libzmq.so.5+0x000000018420)
>     #3 zmq::object_t::send_command(zmq::command_t&) src/object.cpp:474 (libzmq.so.5+0x00000005c07b)
>     #4 zmq::object_t::send_plug(zmq::own_t*, bool) src/object.cpp:220 (libzmq.so.5+0x00000005a7ef)
>     #5 zmq::own_t::launch_child(zmq::own_t*) src/own.cpp:87 (libzmq.so.5+0x00000006134f)
>     #6 zmq::socket_base_t::add_endpoint(char const*, zmq::own_t*, zmq::pipe_t*) src/socket_base.cpp:1006 (libzmq.so.5+0x00000008e081)
>     #7 zmq::socket_base_t::bind(char const*) src/socket_base.cpp:630 (libzmq.so.5+0x00000008be89)
>     #8 zmq_bind src/zmq.cpp:326 (libzmq.so.5+0x0000000c8735)
>     ...

As mentioned above, instrumenting ZMQ and/or omitting the “called_from_lib:libzmq.so” suppression cause a lot of warnings, and many of them are ZMQ worker threads, and specifically in the “pipe” methods, which handle communication between the ZMQ worker threads and application threads.  I’m forced to take it on faith that this code works correctly — it’s at such a basic level that any bugs would likely have surfaced by now.

BTW, there’s an excellent overview of how this all works at http://zeromq.org/whitepapers:architecture <http://zeromq.org/whitepapers:architecture> — although it’s somewhat old, it appears to still be relatively accurate.

> I tried forcing use of C++11 internal atomics and all the warnings about  zmq::atomic_ptr_t<> disappear (I still get tons about ypipe and "pipe").. 
> maybe it's best to prefer C++11 atomics when available?

I would actually suggest using plain old pthread for synchronization primitives — TSAN’s support for pthread is pretty mature. Support for C++11 is a bit less mature, and has certain constraints (https://clang.llvm.org/docs/ThreadSanitizer.html <https://clang.llvm.org/docs/ThreadSanitizer.html>):

ThreadSanitizer is in beta stage. It is known to work on large C++ programs using pthreads, but we do not promise anything (yet). C++11 threading is supported with llvm libc++. The test suite is integrated into CMake build and can be run with make check-tsan command.

In particular, to use C++11 you also need to use llvm’s libcxx, rather than gnu’s libstdc++.  

Last but not least, based on the line numbers in the stack traces above, it looks like ZMQ is using inline assembler to implement atomic ops, which would not be supported by TSAN in any case.

You might want to try forcing ZMQ to use compiler intrinsics (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>) instead — it looks like that is controlled by the ZMQ_HAVE_ATOMIC_INTRINSICS #define.  That should be set automatically if building using autotools, but for a CMake build you would need to specify that directly on the command line:


> Btw I will try to write a suppression file for ThreadSanitizer and ZMQ... I just start to doubt: is all ZMQ code really race-free? :)

As mentioned above, that is likely to be a huge undertaking.  At this point, I think the only confirmation we have that ZMQ is race-free is based on “black-box” testing.  The good news is that ZMQ appears to be used widely enough that the code is well-tested.  (Of course, failure to find a bug doesn’t prove that there isn’t one, so having a rigorous analysis of ZMQ’s threading behavior would be A Good Thing — but a big job). 

> Thanks,
> Francesco
> 2018-02-24 16:44 GMT+01:00 Francesco <francesco.montorsi at gmail.com <mailto:francesco.montorsi at gmail.com>>:
> Hi Bill, 
> indeed I've found ThreadSanitizer to be more effective i.e., it contains much less false positives compared to Helgrind... still I'm getting several race condition warnings out of libzmq 4.2.3... indeed I actually built only my application code with -fsanitize=thread... do you know if I need to rebuild libzmq with that option as well? I will try to see if it makes any difference!
> Thanks,
> Francesco
> 2018-02-24 15:17 GMT+01:00 Bill Torpey <wallstprog at gmail.com <mailto:wallstprog at gmail.com>>:
> I’m using clang’s Thread Sanitizer for a similar purpose, and just happened to notice that the TSAN docs use ZeroMQ as one of the example suppressions:  https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions <https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions>
> I assume that the reason for suppressing libzmq.so has to do with (legacy) sockets not being thread-safe, so the code may exhibit race conditions that are irrelevant given that the code is not intended to be called from multiple threads.
> FWIW, you may want to check out the clang sanitizers — they have certain advantages over valgrind — (faster, multi-threaded, etc.) if you are able to instrument the code at build time.
>> On Feb 23, 2018, at 6:52 AM, Luca Boccassi <luca.boccassi at gmail.com <mailto:luca.boccassi at gmail.com>> wrote:
>> On Fri, 2018-02-23 at 12:22 +0100, Francesco wrote:
>>> Hi all,
>>> I'm trying to further debug the problem I described in my earlier
>>> mail (
>>> https://lists.zeromq.org/pipermail/zeromq-dev/2018-February/032303.ht <https://lists.zeromq.org/pipermail/zeromq-dev/2018-February/032303.ht>
>>> ml) so
>>> I decided to use Helgrind to find race conditions in my code.
>>> My problem is that apparently Helgrind 3.12.0 is reporting race
>>> conditions
>>> against zmq::atomic_ptr_t<> implementation.
>>> Now I know that Helgrind has troubles with C++11 atomics but by
>>> looking at
>>> the code I see that ZMQ is not using them (note: I do have
>>> defined, so the latter wins!).
>>> In particular Helgrind 3.12.0 tells me that:
>>> ==00:00:00:11.885 29399==
>>> ==00:00:00:11.885 29399== *Possible data race during read of size 8
>>> at
>>> 0xB373BF0 by thread #4*
>>> ==00:00:00:11.885 29399== Locks held: none
>>> ==00:00:00:11.885 29399==    at 0x6BD79AB:
>>> *zmq::atomic_ptr_t<zmq::command_t>::cas*(zmq::command_t*,
>>> zmq::command_t*)
>>> (atomic_ptr.hpp:150)
>>> ==00:00:00:11.885 29399==    by 0x6BD7874:
>>> zmq::ypipe_t<zmq::command_t,
>>> 16>::check_read() (ypipe.hpp:147)
>>> ==00:00:00:11.885 29399==    by 0x6BD7288:
>>> zmq::ypipe_t<zmq::command_t,
>>> 16>::read(zmq::command_t*) (ypipe.hpp:165)
>>> ==00:00:00:11.885 29399==    by 0x6BD6FE7:
>>> zmq::mailbox_t::recv(zmq::command_t*, int) (mailbox.cpp:98)
>>> ==00:00:00:11.885 29399==    by 0x6BD29FC:
>>> zmq::io_thread_t::in_event()
>>> (io_thread.cpp:81)
>>> ==00:00:00:11.885 29399==    by 0x6BD05C1: zmq::epoll_t::loop()
>>> (epoll.cpp:188)
>>> ==00:00:00:11.885 29399==    by 0x6BD06C3:
>>> zmq::epoll_t::worker_routine(void*) (epoll.cpp:203)
>>> ==00:00:00:11.885 29399==    by 0x6C18BA5: thread_routine
>>> (thread.cpp:109)
>>> ==00:00:00:11.885 29399==    by 0x4C2F837: mythread_wrapper
>>> (hg_intercepts.c:389)
>>> ==00:00:00:11.885 29399==    by 0x6E72463: start_thread
>>> (pthread_create.c:334)
>>> ==00:00:00:11.885 29399==    by 0x92F901C: clone (clone.S:109)
>>> ==00:00:00:11.885 29399==
>>> ==00:00:00:11.885 29399== This conflicts with a previous write of
>>> size 8 by
>>> thread #2
>>> ==00:00:00:11.885 29399== Locks held: 1, at address 0xB373C08
>>> ==00:00:00:11.885 29399==    at 0x6BD77F4:
>>> *zmq::atomic_ptr_t<zmq::command_t>::set*(zmq::command_t*)
>>> (atomic_ptr.hpp:90)
>>> ==00:00:00:11.885 29399==    by 0x6BD7422:
>>> zmq::ypipe_t<zmq::command_t,
>>> 16>::flush() (ypipe.hpp:125)
>>> ==00:00:00:11.885 29399==    by 0x6BD6DF5:
>>> zmq::mailbox_t::send(zmq::command_t const&) (mailbox.cpp:63)
>>> ==00:00:00:11.885 29399==    by 0x6BB9128:
>>> zmq::ctx_t::send_command(unsigned int, zmq::command_t const&)
>>> (ctx.cpp:438)
>>> ==00:00:00:11.885 29399==    by 0x6BE34CE:
>>> zmq::object_t::send_command(zmq::command_t&) (object.cpp:474)
>>> ==00:00:00:11.885 29399==    by 0x6BE26F8:
>>> zmq::object_t::send_plug(zmq::own_t*, bool) (object.cpp:220)
>>> ==00:00:00:11.885 29399==    by 0x6BE68E2:
>>> zmq::own_t::launch_child(zmq::own_t*) (own.cpp:87)
>>> ==00:00:00:11.885 29399==    by 0x6C03D6C:
>>> zmq::socket_base_t::add_endpoint(char const*, zmq::own_t*,
>>> zmq::pipe_t*)
>>> (socket_base.cpp:1006)
>>> ==00:00:00:11.885 29399==  Address 0xb373bf0 is 128 bytes inside a
>>> block of
>>> size 224 alloc'd
>>> ==00:00:00:11.885 29399==    at 0x4C2A6FD: operator new(unsigned
>>> long,
>>> std::nothrow_t const&) (vg_replace_malloc.c:376)
>>> ==00:00:00:11.885 29399==    by 0x6BB8B8D:
>>> zmq::ctx_t::create_socket(int)
>>> (ctx.cpp:351)
>>> ==00:00:00:11.885 29399==    by 0x6C284D5: zmq_socket (zmq.cpp:267)
>>> ==00:00:00:11.885 29399==    by 0x6143809:
>>> ZmqClientSocket::Config(PubSubSocketConfig const&)
>>> (ZmqRequestReply.cpp:303)
>>> ==00:00:00:11.885 29399==    by 0x6144069:
>>> ZmqClientMultiSocket::Config(PubSubSocketConfig const&)
>>> (ZmqRequestReply.cpp:407)
>>> ==00:00:00:11.885 29399==    by 0x61684EF: client_thread_main(void*)
>>> (ZmqRequestReplyUnitTests.cpp:132)
>>> ==00:00:00:11.886 29399==    by 0x4C2F837: mythread_wrapper
>>> (hg_intercepts.c:389)
>>> ==00:00:00:11.886 29399==    by 0x6E72463: start_thread
>>> (pthread_create.c:334)
>>> ==00:00:00:11.886 29399==    by 0x92F901C: clone (clone.S:109)
>>> ==00:00:00:11.886 29399==  Block was alloc'd by thread #2
>>> Is this a known (and ignorable) issue with  zmq::atomic_ptr_t<>?
>>> Thanks,
>>> Francesco
>> Yeah I started trying to put together a suppression file but never
>> finished it:
>> https://github.com/bluca/libzmq/commit/fb9ee9da7631f9506cbfcd6db29a284ae6e9651e <https://github.com/bluca/libzmq/commit/fb9ee9da7631f9506cbfcd6db29a284ae6e9651e>
>> Hope to have time to finish working on it eventually (feel free to
>> contribute!) as it's very noisy right now, as it can't know about our
>> lock-free queue implementation without the custom suppression file
>> -- 
>> Kind regards,
>> Luca Boccassi_______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20180224/cf351cd7/attachment.htm>

More information about the zeromq-dev mailing list