The future of the internet is static! AKA how we scaled to 10 million users in a minute without crashing

The future of the internet is static! AKA how we scaled to 10 million users in a minute without crashing

We usually define our company as a pure tech agency, but most of our customers are treating us like IT firefighters: they call us when the s**t is hitting the fan, and the house is burning.

Introduction

Almost a year ago, I got one of this middle of the night phone call that I'm ambivalent about:

  • I love it because it means rush, and I'm an intellectual adrenaline junky
  • I hate it because my wife wants to kill me

Back to track, a company was launching a new project, a fund raising platform for a charity, in a few hours, and they were going to fail, because their system was not scalable. Why? Let's see.

The website had 6 pages:

  • home page, where people are landing, in it, you should also have how much was raised
  • donation page, where people can choose how much they want to give
  • success page, where people are landing after a successful payment
  • error page, where people are landing after an unsuccessful payment
  • contact page, I think it's self explanatory
  • about us page

As you can see, over all these pages, only a few are dynamic:

  • Home page with the donation
  • Success and Error page

The original website

The entire website was based on PHP, with a MySQL database, sessions inside MySQL, and a lot of dynamic content without any reason (all translations, etc...).

Of course, we helped as much as we could by:

  • moving sessions to Redis
  • modifying code to make it more efficient
  • adding DB cache
  • pop new servers, and have a bigger cluster

Unfortunately, it still could not handle the load, even if we had some big improvements, and furthermore, it costed a fortune to host!!!

So this year, they asked us to deliver a brand new solution, and we did!

Kalvad's Power

We had some metrics from the previous launch, so we knew what to expect.

We decided to not follow at all the same pattern of programming:

  • We hate PHP, we think it's outdated, and that nobody should use it in 2021
  • We love the planet Earth, so we don't want to spend more energy and money than required
  • We wanted something fast enough to answer in less 10ms even with a high traffic

Changing paradigm

Our first idea was to change the paradigm: no more big PHP cluster, welcome Hugo!

Hugo is one of the most popular open-source static site generators. With its amazing speed and flexibility, Hugo makes building websites fun again.

Why? because we didn't need the big guns:

  • the donation amount could just be a separated API
  • the payment platform that we used was amazing (SmartDubai), as it's only based on redirections
  • You want to reduce your exposure for security, why would you have more dynamic content, when you can just generate some html, css and js without any security hole?

We still needed an API, so we choose an amazing language and framework: Elixir and Phoenix.

Elixir is a dynamic, functional language for building scalable and maintainable applications.
Elixir leverages the Erlang VM, known for running low-latency, distributed, and fault-tolerant systems. Elixir is successfully used in web development, embedded software, data ingestion, and multimedia processing, across a wide range of industries

Why we didn't choose Go, Rust, Java?

  • Elixir has a very clear syntax for a functional programming language (Hello Rust)
  • OTP aka Erlang VM is, according to us, one of the most impressive piece of software in our industry
  • Phoenix is inspired by Ruby On Rails, so the syntax is very easy and you can go from prototype to production very fast
  • OTP is super reliable, even under heavy load

Architecture

This schema represent the default configuration, but as we deploy all our applications on Clever Cloud, we had an auto scalability in place (we were able to go to up to 40 servers per cluster, each with 16CPU and 32GB Ram).

Load Testing

We wanted to prove the performance of the system, so we did a load test with an amazing tool: locust (an article is already in progress to explain how we use it, should be release soon TM).

Long Story short: we were able to handle 10 million users during our load test, doing each one request per second on the homepage, without any downtime.

Real Numbers

Photo by Antoine Dautry on Unsplash

Static Website (Hugo)

  • We got 22 million visitors in the first 10 minutes.
  • We got over 85 million unique visitors.
  • We got 7.934 billion requests.
  • The average answer time was 4 ms.
  • The cost for the static hosting during 45 days was 22 USD

API (Elixir)

On the elixir side, during the entire 45 days of the fund raising, we got:

  • Some attacks, yes, some people are going after charity websites (no harm done).
  • We got 132 million requests.
  • only 2 HTTP 500 (detected through Sentry, and fixed in 10 minutes).
  • the average time per request was 243ms (as most of it had to communicate with the payment gateway).

Conclusion

We could have launched a 200 nodes Kubernetes cluster to solve the original issue, like most people are doing these days, but I love Earth, I love efficiency, and I hate fixing my own issues.

Furthermore, we, at Kalvad, think that elegance is important, even with code and infrastructure, and our mantra is clear: Excelsior.

If you have a problem, if no one else can help, and if you can find them, maybe you can hire the Kalvad-Team.

Photo by Bill Jelen on Unsplash