[zeromq-dev] PUB/SUB on an epgm socket stops receiving eventually …

Ladan Gharai lgharai at gmail.com
Fri Jun 10 21:55:58 CEST 2011


On Fri, Jun 3, 2011 at 5:43 PM, Steven McCoy <steven.mccoy at miru.hk> wrote:

> On 4 June 2011 03:49, Ladan Gharai <lgharai at gmail.com> wrote:
>
>>
>>
>> On Wed, Jun 1, 2011 at 4:41 PM, Steven McCoy <steven.mccoy at miru.hk>wrote:
>>
>>> On 2 June 2011 04:17, Ladan Gharai <lgharai at gmail.com> wrote:
>>>
>>>> I’ve turned on  the openpgm trace/debug messages – afaict  once the
>>>> epgm receiver sustains “a lot” of packet loss its just not able to
>>>> start-over again
>>>>
>>>
>>> Every time the receiver sees packet loss it closes the socket and
>>> schedules a new socket to be created to reconnect to the PGM stream.
>>>
>>
>>    I am not sure I understand this - do you mean the zmq socket gets a new
>> zmq socket if the ePGM receiver experiences unrecoverable loss?  (I dont see
>> any new socket opening I just see the zmq recv  not receiving anymore)
>>
>
> ZMQ creates a new PGM socket.  PGM is a socket based API beneath ZMQ.
>

I see. But the new PGM socket does not seem to reconnect to the receiver?

Also, could you  point out where in the zmq code does this happen?(I'd like
to print out an error message or do something once this happens)


>>>>
>>>> My questions are:
>>>>
>>>>    1.   Is there a way to reset the receiver once this happens?
>>>>
>>>> Reconnects occur with the same engine as TCP reconnects.
>>>
>>>>
>>>>    1.
>>>>    2. Has anyone experimented with changing the size of the rxw (it
>>>>    currently uses 33333) – and the various timers NAK_RB_IVL, NAK_RPT_IVL and
>>>>    NAK_RDATA_IVL  (something akin to TCP tuning?)
>>>>
>>>>
>>> If you find PGM is non-productive you should investigate tightening the
>>> recovery settings so failure is raised sooner rather than later.  The
>>> default settings are friendly towards 10mb networks and so running at high
>>> speed on 1gb networks may pose a problem with high data loss.
>>>
>>> For example, drop the retry count for DATA & NCF from the default 50 to
>>> 2.
>>>
>>> ~line 211 in pgm_socket.cpp:
>>>                    nak_data_retries = 2,
>>>
>>
>>
>>>                   nak_ncf_retries = 2;
>>>
>>
>>     Yes - this seems the most sensible approach, expect now it crashes -
>> Segmentation fault - once it falls into a long series of packet losses.
>>
>
> Can you provide a trace?  A coredump should make it more expedient to
> diagnose the bug.
>

well, I  tried to strip the code to send you a simple piece of code - and in
the process realized I had somehow contaminated the openpgm code. With a
fresh OpenPgm my application  is no longer crashing with the reduced values
of retries :)

But it seems even more of our loss problems were related to having set
ZMQ_RATE to a rather high number (initially 950Mbps and then 500Mbps) - I
have now reduced it to 100Mbps. I am now seeing the following behaviors:




ps: thank you  for the link to

https://zeromq.jira.com/browse/LIBZMQ-205


> --
> Steve-o
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20110610/ffd2c81c/attachment.htm>


More information about the zeromq-dev mailing list