Hunting for Sensitive Data in Public Amazon Images (AMI)

Reading Time: 9 minutes

Table of Contents

  1. Background
  2. Building the Automation
  3. Results
  4. Attribution
  5. Remediation
  6. Final Thoughts
  7. Credits


A while back, I watched a 2019 DEFCON talk by Ben Morris on exposed EBS volumes on AWS. EBS Volumes are attachable block storage devices, essentially virtual disks. Here is the short problem statement the talk discusses (watch it if you haven’t, it’s both funny and informative)

  • EBS volumes can be set to Public (they aren’t, by default) and of course, plenty are public.
  • This allows anyone with an AWS account to mount these volumes, and snoop around for data.
  • This resulted in interesting findings.

I have been toying with the idea of running a similar exercise for quite some time against Public Amazon Images (AMIs). AMIs are Amazon’s Images, and can be used as a backup of a system, or a template to create copies in a desired state.

Public AMIs are not a new concern, but since I haven’t been able to find a blog post or another research around AMIs (if you know of one, please reach out!), I took a stab at it to see what is exposed out there via AMIs. In all honesty, I expected it to be bad given the amount of images available.

A feature of AMIs is the ability to share them with others, such as specific AWS accounts (share by ID) or share it with the community, and this is where our exploration starts.

It is important to note that Public AMIs are a search away, and can be deployed at will by anyone. This is publicly available data and is not some crazy black magic. If you wanted to explore public AMIs, you could go to EC2 -> AMI -> and select Public Images in the filter.

Building the Automation

Architecture and Collection Plan

To get things started, I planned how to carry out this assessment from an automation standpoint. I had a rough idea of what was required, but I didn’t quite know how to approach this in an efficient manner. My initial plan was:

  1. Pull Public AMIs of interest programmatically
  2. Create an Instance out of the AMIs
  3. Connect to the Instances
  4. Run some logic
  5. Store Results
  6. Destroy the Images

This seemed like a trivial task, but there was 1onehurdle, and that’s scale. Since AMIs are a regional resource, each AWS region (eu-west-1, us-east-1, etc.) has its own set of AMIs.
As of this moment, sampling the London AWS region (eu-west-2), shows it contains almost 60,000 Public AMIs.

For this proof of concept, a pretty simple architecture was used, which consisted of a MySQL node, a Redis node, and a copy of the collector written in Ruby.

Here is a simplified diagram of the infrastructure: 

A cluster of EC2 instances was used to process the deploy AMIs, MySQL was used to store data collected, The cache made processing servers aware of what was already deployed and allowed them to pick up any remaining unprocessed AMIs.

Setting Billing Limits on AWS

There’s going to be some cost associated with inspecting potentially 1000s of AMIs, and since research is a lot of trial and error, I had to implement some safety valves.

AWS Budgets

AWS allows you to set a Cost based budget to limit how much you can spend on cloud services. In cases where something goes wrong with the automation which results in many spun up services, this should be an effective limiter.

If you run experiments on AWS which involve potentially creating a lot of resources, I highly recommend you to set an upper limit to avoid draining your wallet.

AMI Baselining

The process starts with obtaining candidate AMIs and some metadata about them, such as:

  • Region name
  • AMI ID
  • Account ID (or AMI owner ID)
  • Deployed Instance IP

Given the sheer volume of public AMIs on AWS, I didn’t want to scan 60K AMIs per region. Instead, there was filtering in place to shrink the scope:

  1. AMIs named with interesting keywords (e.g. “Production”, “Staging”, “Internal”, “Private”, “Jenkins”, “DB”, “Database”, etc.)
  2. AMIs not created by Amazon themselves (Owner != Amazon)
  3. Linux based Images
  4. Recently registered AMIs (2020)

Here is an example of how to pull AMIs using AWS SDK (Ruby’s aws-sdk gem):

amis =  @ec2.describe_images({
     name: "name",
     values: ["*prod*", "*backup*", "*staging*", "*Testing*" .. snip ..],
     name: "is-public", 
     values: ['true']

Having these filters in place, a minimal list of AMIs to scan was produced. You can customize your own filters to match your needs, of course.

It’s worth mentioning that images with interesting contents may not necessarily follow any predictable naming convention, so there’s a chance to miss quite a bit with this approach, but that was okay for this experiment.

AMI Deployment

Deploying an AMI is a pretty straight forward process, you take an AMI ID and call create_instances on it along with the EC2 properties of your choice. Below is a sample of the Ruby code used. When launching an EC2 instance, the most important thing is to set delete_on_termination for any block devices, so you don’t end up with dangling disks after you terminate an instance.

    image_id: ami_id,
    min_count: 1,
    max_count: 1,
    key_name: @config['connection']['pem_name'],
    instance_type: 't2.small',
        delete_on_termination: true,
        volume_size: 100

AMIs that were deployed successfully were then registered in Redis, to avoid checking them more than once.

EC2 Connection

After the deployment of the instance an SSH connection to the EC2 needs to be made, or the disk that gets created can be mounted for inspection (similar to the approach taken by Ben Morris in his DC talk). However, in my opinion, there’s an advantage in connecting to a fully functional instance versus mounting the disk. The connection provided access to fully configured apps, so I could then inspect them without needing to tweak configuration or system files.

From what I’ve seen on AWS, it is very common for AMIs to allow any of the following users to SSH in with the PEM key you create on AWS:

  • ubuntu
  • ec2-user
  • centos
  • root

The automation flow would loop over these users until one successfully authenticates. It would then attempt to elevate (sudo) permissions to start the data inspection process.

Data Hunting

We are getting to the most important part of this research, which is what to scan for and where. Data can be in both conventional and non-conventional places around the system (dedicated data volumes, custom home folders, etc.). In the majority of cases the AMIs followed the filesystem hierarchy standard.

For the data collection, I had decided that reinventing the wheel is unnecessary. Instead, I used a few Linux oriented security tools and guidelines to zoom in on potential targets.

Leveraging existing tools

There are a few tools and guidelines out there focused on Linux, I used their logic to narrow down what to collect and which commands to run. I hoped this would find more interesting data that may be useful for a later inspection:

  • LinEnum – shell script that attempts to find misconfigurations to potentially allow root elevations
  • g0tmi1k privesc guide – Privilege Escalation guide
  • Lynis – a very comprehensive Linux security auditing tool

Leveraging Other techniques

Ransomware has logic built into it to find sensitive and valuable files to encrypt. Some of this logic was used it as an example for the collection techniques, such as which file extensions to grab. I found a curated list of file extensions used by Ransomware over time here.

In addition, the following data was also collected when it was identified:

  • All home folder of all users (root and non-privileged users)
  • bash_history, mysql_history, rediscli_history
  • Copy of Databases
  • Cache Keys
  • Webserver contents
  • id_rsa, id_dsa, id_ed2551 and id_ecdsa SSH private keys
  • Configuration files of various services
  • AWS Credentials
  • Processes and Environment Variables
  • Service specific commands e.g. Redis: echo "GET *" | redis-cli and MySQL: mysql -u root -e "show databases"

… and more.

My scripts used a YAML file that outlined what to collect, here is an example of what my configuration looked like:

You can clone the scanner repository from here and modify the config file to match your needs.

Archiving and Searching

A key aspect of this exercise was the searchability and archival of data. In order to quickly search this data later, I used an approach of having a MySQL column per collection technique, but only for high signal commands. Here’s a sample schema:

    | Field           | Type     | Null | Key | Default | Extra |
    | ami_id          | char(50) | NO   | PRI | NULL    |       |
    | aws_region      | char(50) | NO   |     | NULL    |       |
    | hostname        | text     | YES  |     | NULL    |       |
    | users           | text     | YES  |     | NULL    |       |
    | groups          | text     | YES  |     | NULL    |       |
    | shadow          | text     | YES  |     | NULL    |       |
    | history         | text     | YES  |     | NULL    |       |
    | ssh_keys        | text     | YES  |     | NULL    |       |
    | last_users      | text     | YES  |     | NULL    |       |
    | motd            | text     | YES  |     | NULL    |       |
    | services        | text     | YES  |     | NULL    |       |
    | processes       | text     | YES  |     | NULL    |       |
    | uname           | text     | YES  |     | NULL    |       |
    | os              | text     | YES  |     | NULL    |       |
    | crons           | text     | YES  |     | NULL    |       |
    | aws_creds       | text     | YES  |     | NULL    |       |

This approach allowed to do such queries as:
SELECT shadow FROM data – List password hashes.
SELECT mysql_showdb FROM data WHERE services LIKE "%3306%" – List of databases where MySQL is actively listening.

Example of grabbing hashes and database names

Or even better, get all AWS creds where they exist:
SELECT aws_region, aws_creds, ami_id FROM data WHERE aws_creds IS NOT NULL

Example of grabbing AWS credentials

In some cases not everything was suitable to be stored in a database, so I did end up collecting and compressing key files and folders on the deployed system, then programmatically SCP-ing them back to my collectors. The result was that each EC2 had a tarball archive associated with it. I also wanted to be able to go back and look at raw files if any of my SQL queries returned interesting findings:

    -rw-r--r-- 1 root root      30720 Dec 28 05:15 ami-e******-collect.tar
    -rw-r--r-- 1 root root     399360 Dec 28 05:15 ami-e******.tar
    -rw-r--r-- 1 root root      30720 Dec 27 17:50 ami-f******-collect.tar
    -rw-r--r-- 1 root root     399360 Dec 27 17:50 ami-f******.tar 
    -rw-r--r-- 1 root root      30720 Dec 28 05:57 ami-z******-collect.tar
    -rw-r--r-- 1 root root     399360 Dec 28 05:57 ami-z******.tar    

There’s naturally a lot of garbage when casting such a wide net of data collection, so I ended up dedicating a data volume for these archives to ensure I didn’t run out of space during the process.


The aformentioned data structure made it relatively easy to identify and isolate more sensitive AMIs from the 1000s that were processed. Considering the amount of data, it was likely something interesting would come up. Here are some of the findings:

  • AMIs configured with AWS credentials
  • Webapps with their complete configurations
  • Dockerfiles with credentials
  • Python and bash scripts with credentials
  • SQL connection strings in various places
  • WordPress credentials
  • History files full of API keys
  • Private SSH keys
  • Certificates
  • Environment variables with secrets
  • Penetration testing boxes
  • Security agents configurations (e.g. Tanium, etc.)
  • Git repositories
  • Django deployments
  • Fully configured Jenkins servers

The severity of most of these findings is pretty obvious, so I don’t want to spend too much time on these. I would like, however, to spend a minute on some of the less obvious items.

Security Agents

Security agents such as EDRs and AVs tend to have unique configurations per customer (such as the call-home endpoint, e.g. this helps attributing the AMI to a company, and could potentially be used as an attack vector in itself.

Penetration Testing Boxes

Many pentest tools drop logs, configurations, run time state and other data into hidden folders under ~/. make sure you sanitize your box before exposing your images publicly. You want to avoid spilling internal data on the public internet.


There is no doubt that attribution was the most difficult aspect of this exercise. However, in some cases I was able to identify the company or individuals behind the AMI by:

  • Looking at Public SSH key comments in ~/.ssh/authorized_keys, or RFC4716 3.3.2 Comment Header
  • Hosts File
  • Let’s Encrypt Certbot commands in bash_history
  • Log files
  • Nginx & Apache configuration files
  • git commit logs
  • Configuration files of security agents installed (EDRs, AVs, etc), weirdly enough,

Reaching out to some of these companies for disclosure was a painful process which wasn’t always successful.


Checking if you have any AMIs exposed is extremely simple and can be done both from the Console as well as the AWS CLI.

AMI Console

  1. Go to your AWS Console -> EC2 -> AMI
  2. Filter by Owned by me
  3. Visibility Column -> Sort by Public

If any of your images are public, consider them compromised, then remove their visibility by doing the following:

  1. Select Action
  2. Select Modify Image Permissions
  3. Toggle the setting to Private


If you have aws-cli configured, you can simply run the following for each of your regions:

root@host:~# aws ec2 describe-images --filters Name=is-public,Values=true --owners self
    "Images": []

You want Images to return an empty array, which indicates your images are all private (or you have no images at all in the queried region)

Final Thoughts

AMIs Registry is an ocean of data that just sits there waiting to be grabbed. From an attacker perspective, it is a pretty stealthy reconnaissance method, since in general, not a single packet is sent to the owner of the image (unless it calls home?). Luckily, the remediation is fairly straightforward. If you are interested in the proof of concept code that aided in this research, you can find it on GitHub.


Credits to my 2 friends for their technical review and proof reading of this research!

Thanks for reading!

Leave a Reply

Your email address will not be published.