[zeromq-dev] Contributing native InfiniBand/RDMA support to 0MQ
Gabriele Svelto
gabriele.svelto at gmail.com
Thu Dec 15 19:34:20 CET 2011
Hi Martin,
2011/12/15 Martin Lucina <martin at lucina.net>:
> Hi Gabriele,
>
> When you say "other RDMA-enabled technologies", does that mean there are
> more interconnect technologies which have adopted the IB verbs API?
Yes, iWARP and RoCE both use the same APIs (though with some limitations).
> There appears to be some confusion / overloading of terms at least in
> the IB world, what I've seen is that a lot of people refer to RDMA,
> when what they actually mean is the native IB verbs API. If I
> understand it correctly RDMA in the strict sense (CPU bypass of writes
> to foreign memory) does not gain you much for small messages, so for
> example SDP will use normal packets for small transfers and RDMA for
> large transfers.
>
> I may have misunderstood, am still coming to grips with the technology.
I shared your confusion when I started dealing with these technologies
because they are not very well documented and some terms are often
used interchangeably. By "RDMA-enabled technologies" I mean any kind
of technology that implements the functionality provided by InfiniBand
verbs. The InfiniBand verbs library supports a number of operations
which can be roughly divided in four categories and this is what
usually causes the confusion regarding the term:
- Send/receive operations, these work pretty much like datagrams
except that you can do many of them in parallel and the hardware will
deal with them without the need of intervention from the CPU. Both
endpoints must be aware of the operations so the code will look very
similar to what you do with sockets (one side sends, the other
receives) - these are optimal for small messages and it is what I am
going to use for 0MQ
- RDMA read/write operations, with these you can read/write from the
memory of a remote machine without intervention from the machine
itself, think of it as a memcpy() that takes a network address. When
you do such an operation the other side is oblivious to what is going
on
- Atomic operations, this allow you to do things like atomic
compare-and-swap operations inside the memory of a remote machine,
these are used for implementing distributed locks, queues, barriers,
etc... Again you can do them on a remote machine with the other end
being oblivious to what is going on
- Collective operations, these are the fanciest latest additions that
allow the cards to implement 1-to-N, N-to-1 and even N-to-N MPI
operations such as gather without intervention from the hosts,
everything happens on the cards and switches
> Yes, definitely.
>
> One question; at least for IB, would it not be easier and get us the
> same functionality if we were to add explicit AF_SDP support, by which
> I mean building libzmq with --with-sdp and explicit SDP addressing,
> e.g. sdp://<GUID>. This would allow people to use ZeroMQ without
> dealing with the somewhat messy details of configuring libsdp in
> wrapper (LD_PRELOAD) mode.
I could do that however the last time I tested SDP I was not very
satisfied with its performance (albeit it's been a long time ago) and
it hasn't got much love from the open source community. It is
available only through OFED AFAIK and many distribution do not ship it
by default, for example both CentOS 6 and recent OpenSUSE releases
have dropped it. Most major distributions on the other hand ship both
ibverbs and the rdmacm libraries making them readily available to
users.
Gabriele
Gabriele
More information about the zeromq-dev
mailing list