Base container image identification API

Josie Anugerah and Eve Martin-Jones, Open Source Insights Team

We’re excited to launch our experimental base container image identification API in deps.dev! The base image identification API lets clients look up a published container image by its chain ID. We have 730k unique images indexed so far (and counting).

The new OSV-Scanner v2 release leverages this API to associate security vulnerabilities with specific base images in a container image scan.

What is a base image?

A container image is “a standard unit of software that packages up code and all its dependencies so that the application runs quickly and reliably from one computing environment to another” All container images begin with a base image that the image author builds upon.

Images are typically created from dockerfiles, which contain a series of commands that build the image when executed. A short dockerfile might look like:


FROM alpine:latest
ADD my-program /
CMD ["/my-program"]

Fig 1. A short example dockerfile.

In this example, the FROM command sets the base image for this image to be alpine:latest. The ADD command copies the binary my-program to the images filesystem. The CMD command sets the command “my-program” to be executed when running a container from this image.

Base images like alpine:latest allow common instructions to be shared between image authors. Rather than re-writing the instructions needed to set up an alpine environment, any image author can use the FROM command to include the base alpine image. Base images are also specified as a dockerfile/series of executable instructions. For example, the instructions for alpine:latest might look like:


ADD alpine-minirootfs-3.21.2-x86_64.tar.gz /
CMD ["/bin/sh"]

Fig 2. An example base image docker file.

To build an image, the container build tool executes the commands in the dockerfile. First, it executes the commands specified in the base image, followed by the custom commands in the image. When building the example image above, the following commands would be executed:


ADD alpine-minirootfs-3.21.2-x86_64.tar.gz /
CMD ["/bin/sh"]
ADD my-program /
CMD ["/my-program"]

Fig 3. The effective commands run for the short dockerfile in Fig 1. The FROM line is replaced by the base image commands.

The output of the build process is a series of filesystem layers (where each command produces ~one layer), along with additional metadata files such as the image manifest and config files which provide an ordered list of the filesystem layers identified by its hash. For example, the manifest file produced by building the example image might look something like:


{
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "layers": [
    { "digest": "sha256:<add-alpine-layer-hash>" },
    { "digest": "sha256:<add-my-program-layer-hash>" },
  ]
}

Fig 4. The layers produced by building the dockerfile in Fig 1. Instructions that don’t produce a file system diff (eg. most CMD instructions) are commonly excluded in the manifest.

Together, these build outputs constitute a container image which can be distributed by a container registry. However, building an image is a lossy process. The original instructions used to build the image (including what base image was used) are not included in the layers and files the build outputs. These instructions and the base image information are therefore also not present in the images that get distributed.

How does knowing the base image help?

Not knowing the base image makes analyzing images for vulnerabilities difficult. While we’ve been using a simple image as an example, in practice, many images are built from multiple base images chained together. If you don’t know which of these images introduced a vulnerability, it’s difficult to remediate that vulnerability.

For example, suppose a user downloaded our example image from a container registry and ran it through a vulnerability scanner like OSV-Scanner. They may find that the second layer introduces a vulnerable package to the filesystem. But if they don’t know whether that layer was created from the instructions in alpine:latest or the instructions in the example dockerfile, it will be difficult to remove the vulnerable package. Knowing where exactly a vulnerability was introduced is crucial to remediating it.

How does the container base image API work?

As mentioned above, an image is made up of file system layers, a manifest file and a configuration file. The manifest and config files don’t explicitly tell us the base image of an image is. However they do provide an ordered, content-addressed list of the layers.

A layer is an archive of added and deleted files. They can be content-addressed by their diff ID (a hash of the decompressed archive). A sequence of layers in a particular order can be referred to using a chain ID.

The container base image API works by accepting an ordered sequence of layers in the form of a chain ID and matching it to the chain IDs of popular base images deps.dev has indexed. The API returns any image repositories the chain ID matched to.

By producing the chain ID for every prefix of layers for your image and looking them up in the QueryContainerImages API, you can identify what base image(s) your image could be based on.

How can I use this?

The easiest way to try this out is by using the new OSV-Scanner v2 release (blog) which leverages QueryContainerImages to scan containers for vulnerabilities and attribute them to layers.


osv-scanner scan image <image-name>

Example output from the osv-scanner scan image command.
Example output from the osv-scanner scan image command.

If you’re just wanting to use the API to see what base image(s) your image depends on, there is example code at the deps.dev repo that computes the Chain IDs from an OCI image tarball and prints any base images reported by our API.

We’re also exploring improvements into the API such as ordering repository results by popularity and including tags in the result. This is an experimental API available in v3alpha, so we’d love to hear about any feedback you have. You can file a GitHub issue at github.com/google/deps.dev or email us at depsdev@google.com.