Why and How to Implement Seccomp Policies

Explaining the use cases and how to implement an under-discussed Kubernetes feature

Oct 10, 2024

If you look up seccomp profiles in the Kubernetes dictionary, you’d see one line: big scary. Despite being scary, seccomp policies are also the single most secure thing you can implement in a cluster. At a high level, it’s a big security payoff that typically requires some implementation challenges, which is why it’s somewhat rare to see. Seccomp profiles restrict the system calls that a container can make, in other words, they function as a process level allow list, explicitly stating which kernel functionalities a container can use.

The above image shows some common syscall examples your container might make while it’s running. If you think about these syscalls, you may be already getting some ideas for security: does your app need to start new processes? Does it need to read files? The security possibilities are complex and nearly endless.

In this article, we’ll fully understand seccomp profiles, their use cases, and how to implement and maintain them. I worked with ARMO Security on this article, because they’re making the process a lot easier by using their eBPF agent to generate these profiles on the fly - so we’ll also discuss that approach and what makes it neat.

WTF is SecComp?

Seccomp policies come in three flavors: RuntimeDefault, Localhost, and Unconfined. These flavors tell the container where to look for the seccomp profile to implement, as well as what policies syscalls will be restricted to the container. These policies are defined in your deployment yamls, and we’ll go through each below with an example implementation. Here’s a quick TL;DR on each:

Runtime default is an attempt to block some super high risk syscalls that are also very uncommon, it’s meant to be a good baseline for blocking stuff.
Localhost is customizing a seccomp to your heart’s content. That’s easier with a tool like ARMO, but harder if you’re trying it yourself
Unconfined is the same as no seccomp - it’s not blocking anything.

Runtime Default

securityContext:
  seccompProfile:
    type: RuntimeDefault

The first profile, RuntimeDefault uses the seccomp policy provided by your container runtime, usually either containerd or Docker. This default profile blocks a lot of common dangerous syscalls, such as ptrace, usually only used for debugging or spying on specific processes. The full list of blocked syscalls is here.

The idea of this policy is to be broadly applied to all your nodes, that’s why it’s the example application in the official Kubernetes documentation.

LocalHost

securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: my-custom-profile.json

As is usually the case, the second we deviate from our paved path, things immediately get super complicated, but it’s also where the security juice lives. Let’s think about what the localhost policy is doing: it’s specifying that the container looks to the node file system to find the policy stating what syscalls it can access.

Where localhost fetches the profile from

This requires your seccomp profile be deployed across all of your nodes so the kubelet (what deploys containers on the nodes) can fetch the seccomp policy. If you’re doing orchestration on your nodes, the root directory is

/var/lib/kubelet/seccomp/

Typically you would deploy this via a CRD, as I have in this example.

Undefined

securityContext:
  seccompProfile:
    type: Unconfined

Undefined is super easy to implement, it doesn’t block anything. This is the default behavior, and will slowly be considered insecure practice (if seccomp gains steam).

Practicality

Now that we understand the three types of seccomp profiles, we can assess how realistic it is to implement them.

First, what’s the risk of using undefined? It all depends on the runtime protection you’re using in the cluster (if any). If you’re not using anything, then this is pretty risky, because you have zero visibility to an attacker pivoting throughout your cloud environment. If you have an EDR, the gap comes down to your comfort with that provider. Certainly it will stop some really obvious attacks like uploading malware, but it’s very unlikely to stop an active attacker. Furthermore, a key issue is that eBPF responses can still be post exploit - something Accuknox details really well in this blog.

How eBPF can be too slow to respond to exploits from Accuknox

The problem is that moving beyond undefined is Sunday scaries level scary. If a developer can’t manage their resources, they definitely won’t know every syscall their application makes.

This is the idea behind RuntimeDefault - maybe if we can’t do least privileged container syscalls, can we at least move the needle? While it’s a nice idea, unfortunately it doesn’t fully fix the issues that makes blocking things frightening. While this should work without issues, the challenge is knowing that in advance will be really hard to test. This is exacerbated because no one is really happy with the state of their staging cluster - so they won’t trust that this doesn’t break everything.

So, do security teams need to just throw in the towel, or is there a better way?

eBPF to the Rescue?

If only there was a tool for detecting what syscalls a container was making to the kernel…wait… that’s exactly what eBPF does! ARMO, your friendly neighborhood eBPF agent, is utilizing their agent to create seccomp profiles for your containers. Let’s go through what using it is like, and if it breaks my application.

First, Generating my Seccomp Profiles

The ARMO agent generates seccomp profiles based on observed syscalls from the app automatically. The agent also does anomaly based container detections, so I’m getting value whether or not I use this functionality.

Second, Copying and Deploying the Files

Example files in this Pull Request

Adding the localhostProfile to tell the container where to look for the policy. As an aside, it’s easy to see how you could helm this into per workload policies based on the profile names.

No that AWS access key isn’t real…unless…

Adding the seccomp profile itself to be deployed to the nodes. This profile is saying to raise an error (SCMP_ACT_ERRNO) if a syscall happens, but to allow the explicitly declared syscalls.

Default Action tells it to raise an error if a not allowed syscall happens

Run helm upgrade or deploy normally

Third, Validating Your Changes

There are a few ways we can validate that our change worked.

Checking for the seccomp,

kubectl get pod [pod-name] -o json | grep seccomp

Trying a malicious command

Note: There’s a decent chance you’re also breaking the ability to /bin/bash into a container when you deploy seccomp

Check ARMO

Success! It’s implemented!

Fourth, Monitoring Pro Tips

This is the hardest part of seccomp in my opinion - making changes that might be needed based on new features that might break the profile. The most normalized way of doing this right now is the seccomp-notifier, which is a bit nerdy to try and implement at scale in my opinion. Something like this will be helpful, but I wonder how often your existing application is really making new syscalls in the first place. Having a GUI like ARMO definitely makes this tracking easier, but this is still the greatest barrier to adoption in my opinion.

Conclusion

Okay, Seccomp is super nerdy and scary, but it’s the best preventative security you could possibly do for a container. That said, the hard part will be monitoring going forward how to tell if a new feature is causing an issue with a container. In my opinion, seccomp is worth doing for high risk workloads where you absolutely cannot afford a breach. Similarly, I don’t think trying to make manual seccomp profiles is a scalable strategy, so something like ARMO applying eBPF data to the use case makes a ton of sense for easily instrumenting the data.

Rahul Jadhav

Nov 4

Great Article! As a KubeArmor maintainer, would like to point out to one more quirk of seccomp that resulted us in looking at LSMs (Linux Security Modules) in place of seccomp.

One of the defining limitation of seccomp is that it does not have access to syscall parameters. Imagine having to block delete/unlink only on a certain folder, not block unlink() in general. Or allow chmod() syscall to be executed only as part of process XYZ but not anything else.

Seccomp does not allow dereferencing of these parameters making the rules very high level i.e., block at the syscall level. While this works for few syscalls but it won't work for general scenario for e.g., imagine a popular scenario where you want to allow specific binaries execution and block all others. Seccomp will allow one to block execve() altogether but won't allow granular policies.

One of the goals for KubeArmor was to prevent Remote Command Execution attempts ... We could not handle it with seccomp. This is why we chose LSMs over seccomp for KubeArmor.

Expand full comment