Why and How to Implement Seccomp Policies
Explaining the use cases and how to implement an under-discussed Kubernetes feature
If you look up seccomp profiles in the Kubernetes dictionary, you’d see one line: big scary. Despite being scary, seccomp policies are also the single most secure thing you can implement in a cluster. At a high level, it’s a big security payoff that typically requires some implementation challenges, which is why it’s somewhat rare to see. Seccomp profiles restrict the system calls that a container can make, in other words, they function as a process level allow list, explicitly stating which kernel functionalities a container can use.
The above image shows some common syscall examples your container might make while it’s running. If you think about these syscalls, you may be already getting some ideas for security: does your app need to start new processes? Does it need to read files? The security possibilities are complex and nearly endless.
In this article, we’ll fully understand seccomp profiles, their use cases, and how to implement and maintain them. I worked with ARMO Security on this article, because they’re making the process a lot easier by using their eBPF agent to generate these profiles on the fly - so we’ll also discuss that approach and what makes it neat.
WTF is SecComp?
Seccomp policies come in three flavors: RuntimeDefault, Localhost, and Unconfined. These flavors tell the container where to look for the seccomp profile to implement, as well as what policies syscalls will be restricted to the container. These policies are defined in your deployment yamls, and we’ll go through each below with an example implementation. Here’s a quick TL;DR on each:
Runtime default is an attempt to block some super high risk syscalls that are also very uncommon, it’s meant to be a good baseline for blocking stuff.
Localhost is customizing a seccomp to your heart’s content. That’s easier with a tool like ARMO, but harder if you’re trying it yourself
Unconfined is the same as no seccomp - it’s not blocking anything.
Runtime Default
securityContext:
seccompProfile:
type: RuntimeDefault
The first profile, RuntimeDefault uses the seccomp policy provided by your container runtime, usually either containerd or Docker. This default profile blocks a lot of common dangerous syscalls, such as ptrace, usually only used for debugging or spying on specific processes. The full list of blocked syscalls is here.
The idea of this policy is to be broadly applied to all your nodes, that’s why it’s the example application in the official Kubernetes documentation.
LocalHost
securityContext:
seccompProfile:
type: Localhost
localhostProfile: my-custom-profile.json
As is usually the case, the second we deviate from our paved path, things immediately get super complicated, but it’s also where the security juice lives. Let’s think about what the localhost policy is doing: it’s specifying that the container looks to the node file system to find the policy stating what syscalls it can access.
This requires your seccomp profile be deployed across all of your nodes so the kubelet (what deploys containers on the nodes) can fetch the seccomp policy. If you’re doing orchestration on your nodes, the root directory is
/var/lib/kubelet/seccomp/
Typically you would deploy this via a CRD, as I have in this example.
Undefined
securityContext:
seccompProfile:
type: Unconfined
Undefined is super easy to implement, it doesn’t block anything. This is the default behavior, and will slowly be considered insecure practice (if seccomp gains steam).
Practicality
Now that we understand the three types of seccomp profiles, we can assess how realistic it is to implement them.
First, what’s the risk of using undefined? It all depends on the runtime protection you’re using in the cluster (if any). If you’re not using anything, then this is pretty risky, because you have zero visibility to an attacker pivoting throughout your cloud environment. If you have an EDR, the gap comes down to your comfort with that provider. Certainly it will stop some really obvious attacks like uploading malware, but it’s very unlikely to stop an active attacker. Furthermore, a key issue is that eBPF responses can still be post exploit - something Accuknox details really well in this blog.
The problem is that moving beyond undefined is Sunday scaries level scary. If a developer can’t manage their resources, they definitely won’t know every syscall their application makes.
This is the idea behind RuntimeDefault - maybe if we can’t do least privileged container syscalls, can we at least move the needle? While it’s a nice idea, unfortunately it doesn’t fully fix the issues that makes blocking things frightening. While this should work without issues, the challenge is knowing that in advance will be really hard to test. This is exacerbated because no one is really happy with the state of their staging cluster - so they won’t trust that this doesn’t break everything.
So, do security teams need to just throw in the towel, or is there a better way?
eBPF to the Rescue?
If only there was a tool for detecting what syscalls a container was making to the kernel…wait… that’s exactly what eBPF does! ARMO, your friendly neighborhood eBPF agent, is utilizing their agent to create seccomp profiles for your containers. Let’s go through what using it is like, and if it breaks my application.
First, Generating my Seccomp Profiles
The ARMO agent generates seccomp profiles based on observed syscalls from the app automatically. The agent also does anomaly based container detections, so I’m getting value whether or not I use this functionality.
Second, Copying and Deploying the Files
Example files in this Pull Request
Adding the localhostProfile to tell the container where to look for the policy. As an aside, it’s easy to see how you could helm this into per workload policies based on the profile names.
Adding the seccomp profile itself to be deployed to the nodes. This profile is saying to raise an error (SCMP_ACT_ERRNO) if a syscall happens, but to allow the explicitly declared syscalls.
Run helm upgrade or deploy normally
Third, Validating Your Changes
There are a few ways we can validate that our change worked.
Checking for the seccomp,
kubectl get pod [pod-name] -o json | grep seccomp
Trying a malicious command
Note: There’s a decent chance you’re also breaking the ability to /bin/bash into a container when you deploy seccomp
Check ARMO
Success! It’s implemented!
Fourth, Monitoring Pro Tips
This is the hardest part of seccomp in my opinion - making changes that might be needed based on new features that might break the profile. The most normalized way of doing this right now is the seccomp-notifier, which is a bit nerdy to try and implement at scale in my opinion. Something like this will be helpful, but I wonder how often your existing application is really making new syscalls in the first place. Having a GUI like ARMO definitely makes this tracking easier, but this is still the greatest barrier to adoption in my opinion.
Conclusion
Okay, Seccomp is super nerdy and scary, but it’s the best preventative security you could possibly do for a container. That said, the hard part will be monitoring going forward how to tell if a new feature is causing an issue with a container. In my opinion, seccomp is worth doing for high risk workloads where you absolutely cannot afford a breach. Similarly, I don’t think trying to make manual seccomp profiles is a scalable strategy, so something like ARMO applying eBPF data to the use case makes a ton of sense for easily instrumenting the data.
Great Article! As a KubeArmor maintainer, would like to point out to one more quirk of seccomp that resulted us in looking at LSMs (Linux Security Modules) in place of seccomp.
One of the defining limitation of seccomp is that it does not have access to syscall parameters. Imagine having to block delete/unlink only on a certain folder, not block unlink() in general. Or allow chmod() syscall to be executed only as part of process XYZ but not anything else.
Seccomp does not allow dereferencing of these parameters making the rules very high level i.e., block at the syscall level. While this works for few syscalls but it won't work for general scenario for e.g., imagine a popular scenario where you want to allow specific binaries execution and block all others. Seccomp will allow one to block execve() altogether but won't allow granular policies.
One of the goals for KubeArmor was to prevent Remote Command Execution attempts ... We could not handle it with seccomp. This is why we chose LSMs over seccomp for KubeArmor.