As you may know, if you are familiar with our blog, most people are considering us as IT's firefighters (and don't get me wrong, I'm not a big fan of middle-of-the-night phone calls).

Usually, the main issue for middle-of-the-night calls is infrastructure, but from time to time, people are calling us for security issues.

It can be on the code or infrastructure level, but every time, it has a big impact! So we wanted to share how we handle security on our side!

This week is going to be about how we handle the security of our infrastructure. The code's security is coming in a second article.

Infrastructure and Network

Let me tell you a story:

Mister Loïc, I'm sorry, but I am sure that our security was not compromised: we have a CDN with a firewall, an internal firewall, a WAF, a SIEM and a antivirus on all servers.

That's true, it's a great setup, and you probably spend a lot of money on it, but how do you configure it? The answer is always the same: by hand!

Yes, you heard me well: BY HAND!

Rare video of an admin configuring a whole network by hand!

This means that you have no peer review process, no automatic validation, long story short: no security! Furthermore, instead of focusing on efficient security policies and protocols, they focus on buying Rolls-Royce's software at every level, without checking interoperability! Finally, instead of relying on knowledge or understanding, they rely on marketing data!

The issue, in that case, was that a Jenkins was exposed without password, without SIEM, and without WAF, because they lost the main password! They choose to temporarily deactivate the security, but forgot to put it back!

But how do we do it at Kalvad, in order to avoid this kind of issue?

Choosing your software!

We are very proud to defend open-source software, and are heavily using and contributing to Open-Source software!

How do we select them?

  • We read the code! How can you check the quality better than this?
  • We check the build method: a well-documented build method means that people are not suffering to build it, which means that people would contribute more!
  • API available: we don't take any software if it does not have an API! It could be a REST API, or something else (socket, WebSocket, gRPC), but we want a way to control it remotely!
  • We check the issues: do you have a lot of class 10 CVEs?

So here are our choices.

Router and Security: OPNSense

OPNSense is based on HardenedBSD, a reinforced version of FreeBSD. Obviously, it does cover what you expect from any professional router:

  • Firewall
  • MultiWAN
  • VPN (including IPSec and WireGuard)
  • Hardware Failover through CARP (equivalent to Cisco VRRP)
  • OSPF and BGP
  • SD-WAN

It also has some more interesting features, like:

  • IDS with Suricata, including integration with Proofpoint
  • Web Filtering
  • ntop-ng
  • Full API

Finally, the biggest advantage so far: the price!

The DEC3860 can take up to 2.5Gbps VPN, and 17 Gbps as of the firewall, for a cost of around 1,900 USD!

All of this allows us to control and manage our routers through Infrastructure as Code setup: we store the configuration as JSON, we then parse it and execute it against the API of OPNSense!

We are currently working on an ansible module, but it has been delayed due to some implications on other open-source projects!

DNS and DNS protection

PiHole is so far the best software we have encountered if you don't want to manage your DNS by yourself: you set it up on a VM, you ask your DHCP Server to give your PiHole IP as DNS, and it's done! You will:

  • have network-wide protection: no more telemetry leaking some confidential data!
  • block in-app advertisements: you reduce your vector of attack!
  • improve network performance, as you reduce the network usage
  • API is available, which means that you can control it remotely

How do you make it redundant? Very simple!

If like us, you use an API to control it, you just deploy a second PiHole, push the configuration to the second one!

WAF

The WAF is an antiquated technology that was created to help stem the rise of application security vulnerabilities, which had been overwhelming organizations with their frequency of discovery.

PCI requirement 6.6 states that you have to either have a WAF in place or do a thorough code review on every change to the application. Because most organizations prefer to solve problems in a dirty way, they chose to put a WAF, but let's be clear, if you change your application, you need to change the configuration of the WAF! In the end, ditch your WAF, it's useless, and focus on reviewing your code!

SIEM

How many times have I seen a SIEM with rules like:

  • Excessive firewall denies
  • Multiple login failures
  • Excessive firewall accepts
  • Outbound communication to known malicious sites

The problem is that it will generate a lot of useless alerts, which I count as policy violations, instead of literally pointing at the sensitive issues.

Our best recommendation is to ditch it, and have a traditional log system, with monitoring alerts on top!

We highly recommend Alerta and to handle alerts only if the attack is a success!

It will make your alerting pipeline less busy, and it is going to let your team focus on real issues!

Maintaining your infra

Once this is set up, the hardest part is to maintain it? How many times have I seen an infrastructure checked at creation, then never updated after?

To be honest, that's where I wanted to go: when you control the software, you control the cycle, so you control your binding to security.

A security system is like a chain: it's as good as the weakest link!

So we want to be able to update, improve, automate our systems, by reducing the human risk, and here comes the solution!

Our maintenance system is built around 2 software:

  • Nautobot
  • Ansible

Nautobot

Nautobot is a fork of Netbox that we are using to maintain our whole inventory and store additional information about our systems.

We use it as our source of truth, so you can easily understand what is where, how is it connected, who has access, etc...

It has also a very nice feature: it can provide an inventory to ansible, so ansible can directly read the information (like IP, user, software, version, etc...) to it.

It also provides a REST API that you can work around, especially to check the version of your appliances/software.

Let me give you an example: we have an infrastructure with 3 HAProxy servers. These 3 HAProxy servers are stored inside the inventory as lb-0, lb-1, and lb-2.

Every day, when we launched our ansible-playbook, the HAProxy configuration is rebuilt from scratch, then the version of HAProxy is sent to nautobot, which means that if one day, we want to check which version is deployed, we just check in our Nautobot WebUI, or through the nautobot API, we can easily do a full audit of our infrastructure, and check if a CVE is still present.

Ansible

Ansible, even if we start to like it less and less, is the most powerful tool so far to deploy software and configuration. As mentioned earlier, to build infrastructure, we prefer to use terraform, but to configure servers, switches, VMs, Ansible is the tool to use!

As discussed earlier, it can integrate very well with Nautobot, but it has some extra capabilities, let me give you an example:

Behind our 3 load balancers, we have 6 RabbitMQ nodes, that we want to share with all applications. How do you manage user creation? Vhost creation? Most people will just have one admin account, connect on it, create the new vhost and the new user. On our side, we prefer to store it in a different way: we have a list of accounts following this format:

[
{"vhost": "loic", "account": "loic", "email": "loic@kalvad.com", "reset_password": false},
...
]

Our system is going to create a vhost "loic", an account "loic", it will send an email with an autogenerated password to "loic@kalvad.com", asking the person to change the password.

If one day, the password is leaked, you just changed reset_password to true, and we are going to reset the password for the user.

You can imagine the same scenario for your HP Switch, where you configure the VLans according to your needs and security policies.

Emergency Scenario

The web is on fire, Apache HTTPD has 2 major CVEs in 1 day. How do we handle it?

First of all, we receive an alert from multiple mailing lists and RSS feeds (we mostly rely on Archlinux, so it's the main source of truth for us), we check the impact on versions, then we prepare a small script against our nautobot inventory, in order to check if we have vulnerable versions. The answer is yes, we have 25 vulnerable servers, so how do we react: we just connect to our Ansible Semaphore, execute the playbook upgrade apache to all apache servers. Problem solved. Total time of the operation: 15 minutes!

Let's imagine that we didn't have all of our tools: we didn't get the alert on time, then we don't know if we need to patch! We don't have an inventory, so we don't know which servers are having apache servers, and under which version. Finally, we start to ssh to every server and upgrade one by one. Total time of operations: 3 days!

Conclusion

Of course, we could not cover the entirety of how we manage our infrastructure at scale, but by starting using these tools, methods, and software, you should improve a lot your security!