Gold Fig - Making sure the things that aren’t supposed to work ... don’t work — negative testing your cloud infrastructure and app

When getting cloud infrastructure set up and functioning, developers will go through a slew of steps before it reaches production. This could include IaC, static analysis tests, and ensuring that analytics and monitoring are properly instrumented. While many of these pieces fit within the regular software development lifecycle in the form of code-reviews and integration tests against dev or staging environments, they all end up only being proxies for what is actually deployed to production.

CCA 3.0 Wikimedia.org Commons File:Gates_without_wall.jpg

The devops view picks up on the ground reality of production where the “monitoring” responsibilities fall. However, if the engineering effort that went into creating the artifact was focused on working functionality — e.g. a serverless function being able to access a database — what mechanism exists to ensure that least privilege is being enforced throughout? Or, further, that access is denied to the outside world? Or, even more generally, if it ought to explicitly not work, how is that verified? A huge amount of engineering time goes into validating tests and monitoring around availability and reliability but there’s a dearth of info on the topic of negative testing production infrastructure. We thought it would be useful to start a resource on areas startups can begin their nascent negative testing efforts against their production environments.

Test to make sure you can’t log in with no password or a wrong password - Your application’s production front door should be tested for things like end users being successfully authenticated with no password or the wrong password. Many years ago, Dropbox had an incident where a bug caused this to occur in production.
Ensure that objects in a global namespace (e.g. storage buckets, REST endpoints, database-as-a-service endpoints, disk snapshots, Docker images, etc) are not publicly accessible - Have an external check that periodically polls for a known URL that ought to constantly return access denied. While it might be the case that the data source is configured to be available to an internal app or process, modern cloud providers have a slew of offerings that include a globally accessible namespace. In addition to proper policies, access control lists, and public access blocks it is prudent to instrument an end-to-end check of these endpoints in the form of an attempted retrieval. Bonus: Ensure that application-specific access controls around groups or permissions are properly enforced by checking for canary objects at known URLs.
Ensure network endpoints are inaccessible - Security groups can be notoriously difficult to set up and get right. If the surface area on the public internet is required but constrained by IP addresses, ensure that external checks are done to verify that ports are not actually open and visible. A good practice is to use a completely different account or provider to conduct the check to preclude false negatives due to potential network or access overlaps.
Ensure IAM policies are not overly permissive - Modern applications and their menagerie of microservices means getting IAM policies threaded narrowly and just right. With aggressive deadlines to ship, engineers will frequently reach for more and more permissive settings in order to get things to just work. Ensuring a wildcard hasn’t made its way into a resource’s access policy document is a worthwhile check. A first step would be to ensure the resource isn’t accessible from an unauthenticated request from the public internet. A worthwhile next step would be to ensure that it isn’t also accessible from any other account within the cloud provider. For example, ensure that being able to read and write to SQS/SNS topics aren’t allowed by other accounts within the cloud provider.
Ensure a principal can’t delete the whole resource - Using a canary resource, ensure principals can’t delete the entire resource. Many cloud providers offer internally maintained IAM policies to make developers’ lives easy. However, these premade policies are typically overprovisioned. For example, AWS’s AmazonS3FullAccess gives principals the ability to delete whole buckets. While this might be an intended effect for some types of applications, more often than not principals, especially programmatic ones, shouldn’t be able to delete some portion of the infrastructure.
Ensure private devops tools and repositories are in fact private - Instrument checks to ensure your private git repos or issue trackers always return access denied or not found. While sites like Github now have UI guardrails to make it obvious that something out of the ordinary is about to occur, having a check to ensure that private repos remain private is a good safety net to have. This is also a good check for things like internal Maven, NPM, or other internal dependency systems.
Ensure write-once-read-many data stores do not allow mutations - Have a check that attempts to delete or edit objects that are not meant to be mutated. Logging data stores that capture audit or security information are predicated on their integrity. Check to ensure that the datastores that are emitting things like CloudTrail, security, and other important logs are in fact immutable. Bonus: also do similar checks for hashes of said log files or cryptographically signed or time stamped files.

What are some of your go-to negative testing practices? Drop us a note, we’d love to hear them!

Discover more