_lhermann’s avatar_lhermann’s Twitter Archive—№ 5,117

          1. This morning I'm waking up to this in my inbox: Primary server went down at 5:48 AM 😱😱😱
            oh my god twitter doesn’t include alt text from images in their API
        1. …in reply to @_lhermann
          Ten minutes later my Mongo DB cluster auto-scaled to a bigger instance. So I'm thinking: What in the world is happening? 🤔🤔🤔
          oh my god twitter doesn’t include alt text from images in their API
      1. …in reply to @_lhermann
        So first I'm checking my server stats. CPU, Disk IO, Network traffic, everything is spiking at around the same time. I'm thinking DDOS attack. Stagetimer under assault. The enemy at the gate. I'm starting to sweat. 😨😨😨
        oh my god twitter doesn’t include alt text from images in their APIoh my god twitter doesn’t include alt text from images in their API
    1. …in reply to @_lhermann
      So I investigate further. I have to get to the bottom of this now. The database metrics show a grotesque spike in update operations around the same time. You can see when this instance shut down to switch over to a more powerful shard. Maxing out at 1.34k updates per second.
      oh my god twitter doesn’t include alt text from images in their API
  1. …in reply to @_lhermann
    It's now 9:50 AM and I haven't eaten breakfast. There's only one place that can give me clear answers: The server logs. I see tons of incoming socket connections. And there it is. A room with 150 simultaneous connections. A picture starts to emerge.
    oh my god twitter doesn’t include alt text from images in their APIoh my god twitter doesn’t include alt text from images in their API
    1. …in reply to @_lhermann
      When I check the room everything becomes clear. A government agency used Stagetimer to run an event and probably shared the link. Hundreds of people started to connect at once stressing the primary server into slower response times.
      oh my god twitter doesn’t include alt text from images in their API
      1. …in reply to @_lhermann
        The load balancer detects a timeout and switches over to the failover server. All socket connections get disconnected at once and connect to the failover server, spiking requests.
        oh my god twitter doesn’t include alt text from images in their API
        1. …in reply to @_lhermann
          The good news? My latest infrastructure work paid off. DB auto-scaling and automatic server-failover prevented any downtime longer than 60 seconds. And when I woke up, everything was humming along calmly. Not even a complaint email in our inbox 💪💪💪
          oh my god twitter doesn’t include alt text from images in their API
          1. …in reply to @_lhermann
            But now I have to do performance week: 1. Remove the insane number of DB updates on socket connects 2. Find a way to ease the spike in socket connections on server failover 3. Probably switch servers to real load-balancing 4. And avoid having to pay $0.18/h for DB 💀