Enumerating dependents using BigQuery
Open Source Software (OSS) allows developers to share reusable parts of code across projects, teams and organizations. As a result many thriving ecosystems of interdependent OSS packages have developed. Many OSS packages depend on other OSS packages to function.
We compute a full set of transitive dependencies for each version of each package, and we call this the “dependency graph”. This data is available on our web site, API, and BigQuery dataset. We also compute the inverse of these dependency graphs, providing the full set of versions that depend on any given version, and we call these “dependents”.
The set of packages that depend on a given package is useful for a number of reasons. This blog post demonstrates how to fetch all the dependents of a package within the deps.dev dataset using BigQuery.
Why do we need dependents?
There are various uses for the set of dependents of a package.
For example, the number of the dependents—direct and indirect—may indicate the level of interest and adoption of a package. Well known popular packages such as react or gopkg.in/yaml.v3 have tens of thousands of dependent packages published in their respective package management systems. Sorting packages by dependent count can help identify some of the most critical packages within OSS ecosystems.
Additionally, when a vulnerability is discovered in a package its set of dependent packages is highly valuable. It provides insight into the scope of vulnerability across an ecosystem. In some cases the dependents of the affected package may need to act to help propagate a fix through the software supply chain to end users. Access to dependent sets provides a means to identify such packages.
Finally, OSS maintainers can also benefit from being able to identify the many consumers of their packages and better understand how and where their package is used. For example this information may help prioritize future work on their package.
Let’s dive into some BigQuery examples. These samples will select the packages that depend on gopkg.in/yaml.v3, but it is easy to adapt them for any other package. Currently the full set of dependents for a given package can only be accessed via BigQuery.
All dependent versions
Our first example fetches a list of all versions of all packages tracked by deps.dev that depend on the Go package gopkg.in/yaml.v3 version v3.0.1.
SELECT Dependent.System, Dependent.Name, Dependent.Version FROM `bigquery-public-data.deps_dev_v1.DependentsLatest` WHERE System = 'GO' AND Name = 'gopkg.in/yaml.v3' AND Version = 'v3.0.1';
Currently dependent versions
The previous query fetches all versions of all packages that depend on gopkg.in/yaml.v3 v3.0.1. Multiple versions of some packages will often be included in the result set. This means counting the number of resulting rows will not correspond to the number of unique dependent packages.
Additionally a package may have required gopkg.in/yaml.v3 version v3.0.1 at some time in the past, but has since removed or updated its dependency requirement.
To select unique packages that currently depend on gopkg.in/yaml.v3 version v3.0.1 we can filter the result set to include only the versions that are the newest release of their package.
SELECT Dependent.System, Dependent.Name, Dependent.Version FROM `bigquery-public-data.deps_dev_v1.DependentsLatest` WHERE System = 'GO' AND Name = 'gopkg.in/yaml.v3' AND Version = 'v3.0.1' AND DependentIsHighestReleaseWithResolution;
This query can easily be adjusted to find all packages whose highest release depends on any version of gopkg.in/yaml.v3.
SELECT DISTINCT Dependent.System, Dependent.Name FROM `bigquery-public-data.deps_dev_v1.DependentsLatest` WHERE System = 'GO' AND Name = 'gopkg.in/yaml.v3' AND DependentIsHighestReleaseWithResolution;
Direct or indirect dependents only
The result sets returned by all the queries provided so far include
both direct and indirect dependents of gopkg.in/yaml.v3. To find the
packages that import gopkg.in/yaml.v3 directly we can make use of the
MinimumDepth column of the Dependents table.
This column contains the minimum depth of the dependency in the corresponding dependency graph. It is a minimum depth because there may be multiple paths to a dependency.
A depth of 1 indicates direct dependency. A depth greater than 1 indicates an indirect dependency.
The following query selects all unique packages that currently depend on any version of gopkg.in/yaml.v3 directly.
SELECT DISTINCT Dependent.System, Dependent.Name FROM `bigquery-public-data.deps_dev_v1.DependentsLatest` WHERE System = 'GO' AND Name = 'gopkg.in/yaml.v3' AND DependentIsHighestReleaseWithResolution AND MinimumDepth = 1;
Limitations of these queries
There are some caveats to the queries provided in this post that are worth consideration.
The data does not include closed source dependents
The dataset only includes software that has been published on one of the dependency management systems tracked by deps.dev. Consequently no queries can contain closed source code that depends on a given package.
A package may see wide spread use in proprietary applications, but this popularity will not necessarily be reflected in the number of publicly available dependents.
It is common practice across most dependency management systems to allow libraries to specify a range of compatible versions for each of their dependencies. As a result most OSS packages can have their dependency requirements met by many different dependency graphs.
The context in which a library is used can determine the exact dependencies that will be installed.
To compute the dependency and dependent relation in the deps.dev dataset a single dependency graph is resolved for each version of every package tracked by deps.dev. The dependencies we have should be similar to those obtained when installing the dependencies of a package with native tooling in a clean workspace on a Linux machine. This is also true of API calls that return dependencies like GetDependencies.
It is important to note that different dependency resolutions are possible.
Securing the software supply chain is essential, and understanding the complex interrelationships of OSS software is a key part of that.
This post has shown some of the ways the deps.dev dataset can be used to achieve this goal. We are excited to see what you can do with this dataset.