[zeromq-dev] ZeroMQ cleaning dead sockets

Douglas Alves douglas.alves at ebz.tec.br
Fri Aug 8 21:51:24 CEST 2025


Hello ZeroMQ community,

I’m reaching out for advice and best practices on how to manage inactive 
socket behavior in a high-volume router/dealer environment.


*Context:*

  * We have a ZeroMQ router server (Python + pyzmq) that accepts
    connections from multiple dealer clients.

  * Approximately 200 unique hosts connect daily, each using its own
    identity (hostname). But it will scale to 8000 in 2 months.

  * The server keeps track of active identities using an
    active_identities set, in combination with a client_update_timestamp
    stored in our database to monitor liveness.

  * We use ZMQ_ROUTER_HANDOVER = 1 to allow dealer's to reconnect with
    the same identity.


*Code / Repo (for reference):*

  * Project (open source):
    https://github.com/eBZtec/Workday-Session-Management


*Class that configures/maintains the ZeroMQ queues:*

  * https://github.com/eBZtec/Workday-Session-Management/blob/main/WSM-server/WSM-server-router/src/services/simple_route_server_service.py


*Tests:*

  * We run the application and change/disconnect dealer from actual
    network and reconnect into other network, in some cases we found a
    non expected application behavior. The same dealer identity
    connected with 2 sockets (both of this sockets stay "Established"
    when we runs lsof or ss linux command). That is our actual problem.

  * In pontual cases the socket are terminated, but we can't say the
    reason about that.


*The Problem:*

Over time, we are seeing a growth in inactive sockets — identities that 
the router still accepts messages for, despite the client having 
disconnected or crashed. Since router will still enqueue messages for 
these identities, this leads to:

  * Memory usage growth
  * Undelivered message buildup
  * File descriptor exhaustion
  * Event loop slowdown and performance degradation

*Mitigations we've tried so far:*

  * Enabled ZMQ_ROUTER_MANDATORY = 1 to detect disconnected identities
    and catch ZMQError(errno=EHOSTUNREACH).
  * Periodically restart the router context (via context.term() and
    socket.close()) to clear all identity mappings.
  * Use client_update_timestamp to stop sending to stale identities.
  * Considered implementing ping/pong, but want to avoid additional
    message overhead unless necessary.

*Questions for the community:*

  * Is there any way (internal API or safe workaround) to explicitly
    remove an identity from a router socket, without restarting the context?
  * What strategies do you recommend for scaling ROUTER/DEALER setups
    with many thousands of connections per day?
  * Are there architectural recommendations (e.g. moving to another
    pattern or proxy-based design) that better handle high churn
    environments?
  * Any experience, advice, or community patterns for keeping ROUTER
    identity mappings under control in large-scale scenarios?

We’d really appreciate any feedback from others who’ve faced similar 
situations.


Thank you in advance!


Best regards,

Douglas Alves

douglas.alves at ebz.tec.br


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20250808/3a75c204/attachment.htm>


More information about the zeromq-dev mailing list