[zeromq-dev] Contributing native InfiniBand/RDMA support to 0MQ
Martin Lucina
martin at lucina.net
Fri Dec 16 00:05:55 CET 2011
Hi Gabriele,
gabriele.svelto at gmail.com said:
> [...]
> I shared your confusion when I started dealing with these technologies
> because they are not very well documented and some terms are often
> used interchangeably. By "RDMA-enabled technologies" I mean any kind
> of technology that implements the functionality provided by InfiniBand
> verbs. The InfiniBand verbs library supports a number of operations
> which can be roughly divided in four categories and this is what
> usually causes the confusion regarding the term:
Thanks for summarizing this. I also found the following paper
useful as a step-by-step walkthrough of the ibverbs APIs:
http://arxiv.org/abs/1105.1827
> - Send/receive operations, these work pretty much like datagrams
> except that you can do many of them in parallel and the hardware will
> deal with them without the need of intervention from the CPU. Both
> endpoints must be aware of the operations so the code will look very
> similar to what you do with sockets (one side sends, the other
> receives) - these are optimal for small messages and it is what I am
> going to use for 0MQ
Makes sense. Later we could look at using RDMA read/write for large
messages.
> - RDMA read/write operations, with these you can read/write from the
> memory of a remote machine without intervention from the machine
> itself, think of it as a memcpy() that takes a network address. When
> you do such an operation the other side is oblivious to what is going
> on
> - Atomic operations, this allow you to do things like atomic
> compare-and-swap operations inside the memory of a remote machine,
> these are used for implementing distributed locks, queues, barriers,
> etc... Again you can do them on a remote machine with the other end
> being oblivious to what is going on
> - Collective operations, these are the fanciest latest additions that
> allow the cards to implement 1-to-N, N-to-1 and even N-to-N MPI
> operations such as gather without intervention from the hosts,
> everything happens on the cards and switches
Neat, the last one is basically PUB/SUB in hardware.
> I could do that however the last time I tested SDP I was not very
> satisfied with its performance (albeit it's been a long time ago) and
> it hasn't got much love from the open source community. It is
> available only through OFED AFAIK and many distribution do not ship it
> by default, for example both CentOS 6 and recent OpenSUSE releases
> have dropped it. Most major distributions on the other hand ship both
> ibverbs and the rdmacm libraries making them readily available to
> users.
You're right about SDP; I had a look at the state of distribution and
kernel.org support and it does seem that ibverbs is the way to go.
Also, AFAICT SDP is a kernel-space implementation so you don't get the
possible benefits of a kernel bypass.
Martin Sustrik and myself have a small test lab setup, so I will try and
get ibverbs working there; have a couple of machines connected back to back
with Mellanox ConnectX adapters. So far I've managed to get OpenSM, ibping,
IPoIB working, but ibverbs is resisting (cannot find userspace driver...
blah blah).
-mato
More information about the zeromq-dev
mailing list