_lhermann’s Twitter Archive—№ 5,117

This morning I'm waking up to this in my inbox: Primary server went down at 5:48 AM 😱😱😱
Permalink On twitter.com ❤️ 40 Favorites 2023 Jan 11 Mood 0

…in reply to @_lhermann
Ten minutes later my Mongo DB cluster auto-scaled to a bigger instance. So I'm thinking: What in the world is happening? 🤔🤔🤔
Permalink On twitter.com ❤️ 2 Favorites 2023 Jan 11 Mood 0

…in reply to @_lhermann
So first I'm checking my server stats. CPU, Disk IO, Network traffic, everything is spiking at around the same time. I'm thinking DDOS attack. Stagetimer under assault. The enemy at the gate. I'm starting to sweat. 😨😨😨
Permalink On twitter.com ❤️ 2 Favorites 2023 Jan 11 Mood -5 🙁

…in reply to @_lhermann
So I investigate further. I have to get to the bottom of this now. The database metrics show a grotesque spike in update operations around the same time. You can see when this instance shut down to switch over to a more powerful shard. Maxing out at 1.34k updates per second.
Permalink On twitter.com ❤️ 2 Favorites 2023 Jan 11 Mood +2 🙂

…in reply to @_lhermann
It's now 9:50 AM and I haven't eaten breakfast. There's only one place that can give me clear answers: The server logs. I see tons of incoming socket connections. And there it is. A room with 150 simultaneous connections. A picture starts to emerge.
On twitter.com ❤️ 3 Favorites 2023 Jan 11 Mood +1 🙂

…in reply to @_lhermann
When I check the room everything becomes clear. A government agency used Stagetimer to run an event and probably shared the link. Hundreds of people started to connect at once stressing the primary server into slower response times.
Permalink On twitter.com ❤️ 6 Favorites 2023 Jan 11 Mood +2 🙂

…in reply to @_lhermann
The load balancer detects a timeout and switches over to the failover server. All socket connections get disconnected at once and connect to the failover server, spiking requests.
Permalink On twitter.com ❤️ 4 Favorites 2023 Jan 11 Mood 0

…in reply to @_lhermann
The good news? My latest infrastructure work paid off. DB auto-scaling and automatic server-failover prevented any downtime longer than 60 seconds. And when I woke up, everything was humming along calmly. Not even a complaint email in our inbox 💪💪💪
Permalink On twitter.com ❤️ 13 Favorites 2023 Jan 11 Mood 0

…in reply to @_lhermann
But now I have to do performance week: 1. Remove the insane number of DB updates on socket connects 2. Find a way to ease the spike in socket connections on server failover 3. Probably switch servers to real load-balancing 4. And avoid having to pay $0.18/h for DB 💀
Permalink On twitter.com ❤️ 12 Favorites 2023 Jan 11 Mood -4 🙁