[zeromq-dev] Parallel sort

Oliver Smith oliver at kfs.org
Tue Aug 3 01:59:43 CEST 2010


On 8/2/2010 5:17 PM, Pieter Hintjens wrote:
>> Not infinite :) I think the issue is one of collection; step X creates N
>> units of work that require processing and then resumption...
>>      
> You can plug pipelines into themselves.
>
> I'm wondering where you store the data while it's being sorted.  I'm
> also wondering why you need to parallelize this.  Is the comparison
> particularly expensive?
>
> Before discussing solutions, could you explain more about the use case?
>    
It's in the Subject line :)

I've been thinking about this a lot today (how you might sub-division 
parallel work into sub-tasks and then perform a subsequent 
reunification). Fundamentally, it is the same principle as packet 
fragmentation and reassembly.

In this case, it is a message (containing a unit of work) being divided 
into 2 or more subunits. Each of which can potentially be executed by a 
separate thread before finally sending an "ack" to the joiner process.

Here's my theoretical model:

INITIATOR:
Initiator dispatches the meta-workload to the worker pool with a static 
ID/token that is recognized as meaning "first workload, singular".

WORKERS:
The first worker receives this load, performs any pre-processing and 
then divides it into discrete sub units.

It forwards the sub-units via multi-part messages to a third thread, the 
last part of the multi-part message is mostly a copy of the source 
message that triggered this step (for callback purposes). It can't be an 
exact copy otherwise ... you'd just create a loop :) Perhaps the "token" 
contains a "stage" so that this could also serve as a message-passing FSM.

COLLATOR:
The collection thread rejoins the multi-part message and assigns each 
piece of work a token (akin to a segment id). It uses the last part of 
the message as a "completion" reference for where to send the completed 
results, and to obtain "depth" information on the recursion.

It then forwards the subunits back to the worker pool.

WORKERS:
Repeat the above process until they actually complete a piece of work. 
When this occurs, they send what amounts to an "ACK" for the source 
token to the COLLATOR.

COLLATOR:
On receiving an ACK for a token, it removes that token from the 
in-flight queue. If this results in no ACKs pending for a given piece of 
work, the "completion" message is forwarded to the worker pool (again, 
this contains the token from which the subunits were created).

WORKERS:
A worker receives the completion message for a group of subunits and 
performs the necessary post-processing. It then ACKs the token in the 
completion message (i.e. the work unit a level above).

COLLATOR:
When an ACK is received for the parent work, it is forwarded to the 
originating thread, which is presumably sitting in a recv().

This /could/ have some embellishments. Such as the 'stage' component of 
the token descriptor.

Also, the COLLATOR could set limits on the number of tokens in flight 
(to increase efficiency and cache hotness, etc). Hence the need to track 
depth; if you set a limit of, say, 4 tokens, then you would become 
"stuck" at the 3rd level of recursion due to exhausting the number of 
tokens available.

Again this is akin to a networking concept: bandwidth. You would 
presumably want to limit the number of tokens in-flight at any given 
depth to, say, twice the number of worker threads available (to ensure 
that there is always work ready when a thread calls recv()).

I think this should probably be controlled at the COLLATOR level, 
though. C.f.

Worker #1 determines that it has 64 sub-units to dispatch. If it blocks 
at a send() then one of your workers is out of action.

Alternatively, if worker #1 can fire off all 64 sub-units into the 
COLLATORs input, it will free itself to return to the worker pool.

I guess the COLLATOR would have to use poll() so that it can receive 
ACKs independently of work issues. That would interfere with the depth 
handling I mentioned unless it either used distinct sockets per depth 
level... Which sounds excessive but you are already using the 
code/memory involved so it has a huge head start over any additional 
second-tier queueing system inside the collator itself...



- Oliver

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20100802/d8d203b1/attachment.htm>


More information about the zeromq-dev mailing list