We deployed OpenIM version 3.5.1 in our Kubernetes environment for our company’s IM scenarios. During development and operation, we encountered some issues. Here, I’ll document the details of the problems and the resolution process.
Problem Trigger Scenario
We wrote a stress testing program to simulate typical user scenarios:
- Establish connection (Online)
- Send message
- Receive reply
- Disconnect (Offline)
Test settings: 100 concurrent user accounts, with each account continuously repeating the above process to simulate high-frequency online/offline scenarios.
Problem Phenomenon
After the stress testing program ran for a while (usually a few minutes to tens of minutes), the following phenomena began to appear:
- New user connection requests had no response, getting stuck at the WebSocket handshake stage.
- Connected users could not go offline normally.
- Server CPU usage was normal, but the number of connections stopped changing.
- The problem temporarily resolved after restarting the
openim-msggatewayservice.
Cause
Conclusion first: The processing of online and offline events caused a deadlock between multiple channels.
Below is the detailed explanation.
To understand this problem, we need to first understand the OpenIM connection link.
“Online” here refers to the OpenIM server accepting a long connection established by the client, which is implemented using WebSocket. Once the long connection is established, the client and server can send messages to each other.
There are two servers on the same client long connection link: openim-msggateway-proxy and openim-msggateway.
openim-msggateway-proxy itself has little business logic; it acts as a load balancer for long connections, facilitating the scaling of openim-msggateway.
The problem appeared in openim-msggateway, which is the core service responsible for handling message sending and receiving over client long connections.
Specifically, the dependency between registerChan and unregisterChan in WsServer caused a deadlock between consumption and production.
Let’s look at the relevant code.
registerChan and unregisterChan are defined in WsServer:
| |
Initialized as buffered channels with a size of 1000:
| |
When a user establishes a websocket connection, it triggers writing data to ws.registerChan:
| |

When the program starts, it starts a goroutine to consume ws.registerChan and ws.unregisterChan.
Note ⚠️: In Go’s select statement, if multiple cases are ready simultaneously, one will be chosen randomly for execution.
| |

The ws.registerClient channel will trigger a series of execution chains. Finally, in WsServer.UnRegister, the client will be written to the ws.unregisterClient channel.
| |
Through the above code analysis, we can determine the root cause of the deadlock.
The scenario triggered by the stress testing program is:
- A large number of users repeatedly go online and offline, frequently triggering the “kick old connection” logic.
- Dependencies appear:
- Every time a new connection is processed, it triggers an operation to disconnect an old connection.
- The disconnect operation needs to write to
unregisterChan. - Messages in
unregisterChanandregisterChanare processed randomly.
- Deadlock forms gradually:
- When
registerChancontinuously has data,unregisterChanmay not get a chance to be processed. unregisterChangradually fills up (capacity of 1000).- New “kick old connection” operations start blocking on writing to
unregisterChan. - The online requests being processed cannot complete, and subsequent requests start piling up.
- Eventually,
registerChanalso fills up, and the system completely freezes.
- When
Solution
The reason for the deadlock in this scenario is circular dependency, i.e.:
- Consumption of
ws.registerChandepends onws.unregisterChanhaving write space. - Consumption of
ws.unregisterChandepends onws.registerChanbeing consumed.

Core Idea
Separate the processing logic of the two interdependent channels into independent goroutines to eliminate circular waiting.
Improved Processing Pattern
- Before: A single goroutine processed both channels randomly, potentially forming a dependency.
- Now: Two independent goroutines process their respective channels in parallel, without affecting each other.
After this improvement:
- Even if
registerChanhas a large amount of pending data,unregisterChancan be processed in time. - The circular dependency is eliminated, avoiding the root cause of the deadlock.
- Concurrent processing capability is improved.

| |
Summary and Reflection
This problem is only triggered in an edge case, i.e., the same batch of users frequently going online and offline. We also triggered this scenario by coincidence. Before analyzing the cause of the problem, we were completely confused, only seeing the surface phenomenon and thinking that OpenIM had a major problem. This problem was difficult to locate. I found the blocked positions of each coroutine by diverting traffic to our local computer and continuously adding logs to the program. It took almost two days to analyze the problem, but only one minute to write the code to solve it. It was also a magical experience.