S2K Commerce - Products Dropdown
Web Content Viewer
Solving the Mystery of Warehouse Device Disconnects

In our warehouse, we use handheld devices to help verify we are picking the correct books for each order. Our current devices are phones running an Android system with a barcode scanner. The devices use a VPN (virtual private network) to securely connect to our systems. Several weeks ago, these devices started having connection issues. Things kept getting worse. In this case, “worse” means the connection disruptions became more and more frequent.
When one of the devices would have a disruption, it would cause the user's access to the main system to lock up. The session would have to be reset on the server, and sometimes the device would also have to be reset on the server. The user would then have to reconnect, figure out where they were in the order, and resume. Each time a connection issue happened, it would take 5 to 15 minutes. Sometimes the next disruption would occur before they were even back in the order. It was quite frustrating.
This particular issue proved to be a challenge to figure out. At first, the disconnects had no obvious correlation. They seemed rather random. We started logging more details about which devices were experiencing the error, when the issue occurred, where they were in the building, what they were picking, etc. We were trying to gather information so we could see if we could find some common cause.
As the frequency of the disconnects increased, one pattern we started to see was several RF devices would drop at the same time. This was not true from the beginning of the issue, but there was a sudden change, with multiple devices simultaneously disconnecting becoming a common occurrence. This seemed to suggest an issue with a WiFi access point (AP). We started monitoring the APs and quickly found that one of them was having some issues. We swapped the AP that was having issues with a spare to see if it would make any difference. Things improved a lot. There were still occasional disconnects, but the frequency reduced so much that it seemed fine.
Only for a few days, however. The following week, the frequent disconnects increased again. This time, however, the issues were different enough that it did not seem to be WiFi AP–related. The devices were remaining connected to the WiFi but having to reconnect to the VPN. Also, devices were disconnecting one at a time again rather than all at once. It was beginning to look like the WiFi AP was not the original cause of the problem, and that the AP was an additional problem that had a similar symptom.
With the data pointing more toward the VPN this time, we went to check the VPN logs. We found that we could not access the VPN server admin interface. Upon further checking, we found we were in the middle of an ongoing brute-force attack on the VPN server. We got a list of all IP addresses involved in the attack and blocked them at the firewall. The attack stopped, and VPN services returned to normal. The volume of brute-force traffic had been too much for the server and slowed things down to the point of not working. We reconfigured settings on the VPN server to help prevent a similar issue in the future. With the attack stopped and VPN services back to normal, the warehouse device disconnects again returned to a less frequent nature, but they were still happening.
We continued logging information with each occurrence. After replacing the one WiFi AP, we had ordered replacements for the rest, as all of the AP units were over five years old. When the new WiFi APs arrived, we replaced the least-used AP just to test and make sure the newest model AP wasn’t going to cause new problems. A few days passed with the new AP running, and things were still fine. Then one Monday, things quickly went from normal to worse than ever. Since I had the new APs, I worked to get the rest of the old APs swapped out. This had no impact on the issues this time.
We had been monitoring the VPN server more closely because of the brute-force attack but decided to triple-check that there was not a different type of attack happening. What we found were odd error messages in the admin interface. It looked like a configuration issue. Researching this, we found that the issue was a problem with a failed upgrade over the weekend. We have multiple systems in a VPN cluster, and one of them was out of sync with the others. In addition to the upgrade not working, it also did not send any alert about the problem. We worked on getting the one VPN server upgraded, and once that was complete, the errors went away and we could access the admin interface again. While getting the VPN cluster back in sync was needed, it was not related to the warehouse device disconnect issue. The VPN server was working fine. The warehouse devices were just reconnecting for no obvious reason.
We turned our focus to the warehouse device that was having the disconnect issue most frequently. It happened to be the oldest RF device we had in use. I had a new device available, so I replaced the oldest device with a new model. The new model worked great. The other devices continued to have disconnect issues. We kept logging information and investigating possible causes. As we gathered more and more information, we started seeing that the older the device, the more frequent the disconnects, and the newer the device, the less frequent the disconnects. The one new device was working well, with none of the recurring disconnects.
We had not forgotten that the VPN server software was updated to a new version right when the disconnect issues became worse. We kept going back to that, but in all the documentation and technical information available, it said it was compatible with the client software on the warehouse devices. Since the client devices could connect to the VPN server and work as expected, it appeared the updated VPN server was compatible. However, as we continued to collect information, we were seeing that there would be a VPN reconnect at the time of the issue, but there was no WiFi issue at the same time. It looked like the VPN connection just reset for no obvious reason.
As the data was pointing to an issue with the VPN, we decided we had to set up a different VPN platform to see if we could confirm whether the VPN was the issue. We did some searching for other VPN options that would meet our needs and work on the devices we have. We found an option called WireGuard. We set up a WireGuard server and configured some test devices to use it, made sure that everything worked as we needed for our systems, and then reconfigured a couple of the warehouse devices in use on a daily basis to use the new VPN. We did one older device and one somewhat newer device. Both worked well. They did not have the frequent disconnect issues. The other devices still using the original VPN software were still having frequent issues. Each day, we moved more devices over to the new VPN, and by the third day, we moved all devices to the new VPN. The frequent connection issues were resolved.
As things go, after we had decided the issue had to be with the original VPN software and we had tested and moved the devices to WireGuard, we found this updated information about the original VPN software:
For security purposes, servers renegotiate encryption keys at regular intervals. An older Android client may have a bug that prevents it from properly handling this key refresh, leading to a disconnection.
This was it. Our warehouse devices use the “older Android client” because a newer one is not available. The one new device where we could run the newest Android client was not having the issue. While this issue has been present for a while and only occasionally caused a problem, the recent update to the VPN server appears to have made this issue happen more frequently.
Short term, we will continue to use the WireGuard VPN, as it is working well with the older Android devices that we have. If we are able to update the devices in use in the warehouse with newer Android devices, then we can evaluate which VPN solution is better for us, as both VPN solutions work great on newer devices.
The warehouse crew was very patient while these issues were happening. It would have been great to have figured this out before our busiest shipping season, but unfortunately, we did not get this resolved until the end of August. A couple of weeks into the new setup and the new VPN tunnel was still working well for the warehouse.