Introduction
At Ginkgo we ensure that actions taken on software running in production are recorded and auditable. This is generally good practice and there are compliance regimes that require this level of logging and auditability.
We also want to enable our software engineering teams to easily troubleshoot their production applications. When running applications in our Kubernetes (k8s) clusters, we can use core, standard RBAC (Role Based Access Controls) and the cluster audit logs to capture actions taken on cluster resources to ensure adherence to these best practices and policies.
This blog will explain how we used OPA Gatekeeper policies to resolve tension between engineers wanting to execute shell commands in running containers when troubleshooting, while still capturing actions in the K8s cluster audit logs.
The Problem with kubectl exec and Auditability
While we hope to provide all the visibility a software developer could want with our observability tooling, sometimes instrumentation is missing. Developers understandably want the ability to execute commands within running containers when under pressure to quickly resolve a production issue.
Kubernetes provides an exec
API , which allows for executing shell commands within a Pod container.
Unfortunately, once an interactive shell session is initiated, any commands issued within the container are no longer captured by the K8s audit logs. The audit logs record who issued an exec on which pod container, and that’s it.
Using standard RBAC resources we can deny any exec
command entirely, but developers would feel the loss of that capability. What we really want is to prevent interactive shell sessions, to which the audit logs are blind. Standard RBAC resources are not able to differentiate between interactive and non-interactive exec calls. With non-interactive exec commands, the shell commands are captured in the audit logs. While it may slow developers to have to construct individual exec commands, they can get the troubleshooting capabilities they need while satisfying logging and auditability constraints.
What is OPA Gatekeeper and How Does it Solve the Problem?
Open Policy Agent (OPA) is an open-source policy engine. OPA Gatekeeper is built on top of OPA to provide K8s specific policy enforcement features. The Software Developer Acceleration (SDA) team at Ginkgo is responsible for operating (Elastic Kubernetes Service) EKS clusters. SDA was already considering implementing OPA Gatekeeper for a few cluster policy enforcement use cases.
When concerns about allowing non-interactive exec arose, we thought OPA Gatekeeper might provide a solution. We stumbled upon a Gatekeeper GitHub issue, which described our exact use case. This issue suggested that it should be feasible to implement an OPA Gatekeeper constraint to act on the PodExecOptions
which determine whether an exec
is interactive or not.
OPA Gatekeeper uses the OPA policy engine to enforce policy in K8s by defining Custom Resource Definitions: ConstraintTemplates
and Constraints
. Those resources integrate with K8s admission controllers to reject API calls and resources which violate a constraint. K8s admission controls are implemented using validating and mutating webhooks.
OPA Gatekeeper also provides a library of ConstraintTemplates
for many common policy use cases. Unfortunately, preventing interactive exec is not one of the already implemented ConstraintTemplates
in the community library.
SDA set up the OPA Gatekeeper and then started experimenting and learning how to craft ConstraintTemplates
and Constraints
based on the examples in the library. OPA policies are expressed in Rego, and this required some learning by members of the SDA team as it’s a Domain-Specific Language (DSL).
Enabling Gatekeeper Webhooks to Validate exec Operations
The first challenge we faced was ensuring that the OPA gatekeeper ValidatingWebhookConfiguration could validate the exec operations. Validating webhook rules match on the following API features:
- Operations
- apiGroups
- apiVersions
- Resources
- Scope
To act on exec calls, the webhook must include the pod/exec
subresource in the resources, and it must include CONNECT
in the operations. We discovered that the released Helm chart for OPA Gatekeeper at the time, only specified the CREATE
and UPDATE
operations, and had omitted the CONNECT
operation. After we modified our OPA Gatekeeper install to add the CONNECT
operation our constraints were able to act upon exec
calls.
apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: gatekeeper-validating-webhook-configuration namespace: gatekeeper-system webhooks: rules: - apiGroups: - '*' apiVersions: - '*' operations: - CREATE - UPDATE - CONNECT resources: - '*' - pods/ephemeralcontainers - pods/exec - pods/log - pods/eviction - pods/portforward - pods/proxy - pods/attach - pods/binding - deployments/scale - replicasets/scale - statefulsets/scale - replicationcontrollers/scale - services/proxy - nodes/proxy - services/status
ConstraintTemplates and Constraints
ConstraintTemplates
contain policy violation rules, which can then be used by multiple different Constraints.
The PodExecOption
which determines whether an exec is interactive is the stdin
option. In the following ConstraintTemplate
, the Rego rule is reviewing the PodExecOptions
object passed to it from the Constraint to determine whether stdin
is true or false. If true, the request will violate the Constraint
.
apiVersion: templates.gatekeeper.sh/v1 kind: ConstraintTemplate metadata: name: k8sdenyinteractiveexec namespace: gatekeeper-system spec: crd: spec: names: kind: K8sDenyInteractiveExec targets: - rego: | package k8sdenyinteractiveexec violation[{"msg": msg}] { input.review.object.stdin == true msg := sprintf("Interactive exec is not permitted in production constrained environments. REVIEW OBJECT: %v", [input.review]) } target: admission.k8s.gatekeeper.sh
The Constraint
determines the objects to which the specified ConstraintTemplate
should be applied and any enforcement action to take.
SDA provides namespaces for teams operating applications in the EKS clusters. Namespaces containing applications subject to constraints are labeled.
The following Constraint applies the K8sDenyInteractiveExec
ConstraintTemplate above to the PodExecOptions
object. It also uses a namespaceSelector
to only apply the ConstraintTemplate
in namespaces bearing the label. The default enforcement action is to deny.
apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sDenyInteractiveExec metadata: name: k8sdenyinteractiveexec namespace: gatekeeper-system spec: match: kinds: - apiGroups: - "" kinds: - PodExecOptions namespaceSelector: matchExpressions: - key: <label to constrain the environment goes here> operator: In values: - "true" scope: Namespaced
Once this Constraint was in place, we tested by issuing kubectl exec
commands against some test Pods in the labeled namespace with and without the stdin
option (-i
).
% kubectl exec -it test-679bdcc64b-gnjll -- /bin/bash Error from server (Forbidden): admission webhook "validation.gatekeeper.sh" denied the request: [k8sdenyinteractiveexec] Interactive exec are not permitted in production constrained environment. REVIEW OBJECT: {"dryRun": false, "kind": {"group": "", "kind": "PodExecOptions", "version": "v1"}, "name": "test-679bdcc64b-gnjll", "namespace": "default", "object": {"apiVersion": "v1", "command": ["/bin/bash"], "container": "efs-csi-test-deployment-nginx", "kind": "PodExecOptions", "stdin": true, "stdout": true, "tty": true}, "oldObject": null, "operation": "CONNECT", "options": null, "requestKind": {"group": "", "kind": "PodExecOptions", "version": "v1"}, "requestResource": {"group": "", "resource": "pods", "version": "v1"}, "requestSubResource": "exec", "resource": {"group": "", "resource": "pods", "version": "v1"}, "subResource": "exec"}
% kubectl exec test-679bdcc64b-gnjll -- echo foo foo
(Feature photo by Nikola Knezevic on Unsplash)