Post-Advisory Exposure

James Wetter and Nicky Ringland, Open Source Insights Team

How can a user of open source software (OSS) assess their risk of exposure to a future vulnerability when taking on a new dependency?

Vulnerabilities will always find their way into software, and in an ideal world those vulnerabilities will be fixed in a reasonable amount of time. This is a critical factor for building trust between OSS maintainers and the users of their software.

This blog post looks at the events around the remediation of a vulnerability, and a few ways that trust can be established between maintainers and users of OSS. In particular we examine how often OSS packages remediate known vulnerabilities and if their users were left exposed after the vulnerability was publicly disclosed.

An ideal remediation

next-auth is an npm package that provides tools to help implement authentication for the web development framework Next.js. next-auth is popular, with almost 200,000 weekly downloads according to npm. Recently an advisory was published detailing a critical vulnerability in the next-auth package. Due to this vulnerability, an attacker could potentially gain access to another user’s account.

Fortunately for the users of next-auth, the reporter of the vulnerability and package maintainer practiced coordinated vulnerability disclosure. As a result a fixed version of next-auth was already available when this advisory was published. Both versions 4.10.3 and 3.29.10 include a patch remediating the vulnerability.

The advisory itself contains a brief timeline of key events. The vulnerability was discovered by Socket, and privately disclosed to the maintainers of next-auth on the 26th of July. The maintainers acknowledged the private disclosure within 1 hour, and released remediating versions on the 1st of August. Two days later, an advisory disclosing the vulnerability was published. The time between private disclosure and the release of a fix, the time to remediation, was approximately 5 working days.

The events surrounding a coordinated vulnerability disclosure.
The events surrounding a coordinated vulnerability disclosure.

This situation is ideal. Both the private disclosure of the vulnerability and rapid response of the package maintainers meant that the two most recent major versions both had patched versions available for users before the publication of the advisory.

By the time the advisory was published, most users of the next-auth package would be able to move to a patched version immediately with little effort. This virtually eliminated the post-advisory exposure time for the many users of the package.

What can go wrong?

Things don’t always work out as well as this, though. There are a few ways in which the process could go awry such as the discovery of a zero-day exploit, or a vulnerability in an unmaintained package.

A zero-day exploit

The events surrounding a hypothetical zero day exploit.
The events surrounding a hypothetical zero day exploit.

A zero-day exploit is when a vulnerability is being actively exploited by the time the package maintainers become aware of the issue. In these situations it may be better to publish an advisory before the maintainers have developed a patch in order to raise awareness as quickly as possible. This was the case for the well publicized remote code injection vulnerability in the popular log4j library.

In this scenario, it is not reasonable to expect the maintainers to remediate the vulnerability before the advisory is published - increased awareness is a higher priority. And as a result the users of the package will be exposed to a publicly known vulnerability until a remediation is made available, or they remove their dependency on the affected package.

An unmaintained package

The events surrounding a failed coordinated vulnerability disclosure for a hypothetical unmaintained package.
The events surrounding a failed coordinated vulnerability disclosure for a hypothetical unmaintained package.

When a vulnerability is discovered in a package that is no longer maintained there will be no response to private disclosure, leaving the reporter no choice but for the reporter to publish an advisory without a fix available.

An example of this situation is the once popular npm package parsejson. Its most recent release has an unremedied, high severity vulnerability that was publicly disclosed in 2018. But the package hasn’t seen a new release since 2016. Its GitHub repository has been archived and clearly states that it is no longer maintained. Worryingly, the package is still widely used: npm reports that the package still gets almost 250,000 weekly downloads.

It’s clear that users of OSS should not introduce new dependencies on an unmaintained package like parsejson. Existing users should remove such dependencies from their libraries and applications as quickly as they can. But it can be hard for a developer to know when one of their dependencies is no longer maintained or less actively maintained. Signals to help identify changes in the maintenance status are critical.

What usually happens after an advisory?

For our discussion here, we consider a package to have remediated an advisory when it has a release that

  1. is not affected by the advisory, and
  2. has a greater version number than all affected releases.

The semantics of versions and release differ between systems. For example PyPI uses pep440, while npm uses semantic versioning.

This definition of remediation means that if the greatest major version of a package has a fix available, the package is considered to have remediated the vulnerability even if lesser major versions remain affected. There is more to be said about packages that have multiple major versions, each of which may be fixed independently, but we will leave a discussion of the nuance of vulnerabilities and multiple major versions for another time.

Clearance rates

First let’s take a look at how many known vulnerabilities are remediated.

Across every package management system supported by, we see that most package maintainers do respond to vulnerability advisories in their packages.

There is considerable variation between ecosystems. The lower clearance rate seen in the Cargo ecosystem is expected. Within that ecosystem, there is a practice of publishing an advisory that a package is unmaintained, such as this advisory and this advisory. Such advisories are not expected to be remediated, but publishing them helps raise awareness of the package’s unmaintained state amongst its users.

Taking a closer look at individual packages, the clearance rate of vulnerabilities gives an indication of the health of the package, and consequent risk of using the package. Some packages have a very high number of known vulnerabilities in older versions, but all of the vulnerabilities have been remediated. For example

These packages are healthy and well maintained, and their high clearance rates are a good indication of that.

Post-advisory exposure time

Now let’s consider how long users are exposed to a known vulnerability without a fix. That is, the interval between the publication of an advisory and the publication of a release to remediate it. We call this the post-advisory exposure time.

The PyPI, Cargo and npm packaging systems expose the publication times for each version. Using this data we can examine the post-advisory exposure time.

The post-advisory exposure time across the PyPI, Cargo and npm ecosystems for all the vulnerabilities in the dataset.
The post-advisory exposure time across the PyPI, Cargo and npm ecosystems for all the vulnerabilities in the dataset.

At a glance these graphs paint a positive picture. Each ecosystem appears healthy, with the majority of vulnerabilities disclosed in an advisory being remediated very quickly. This demonstrates that security is a priority for most maintainers.

But it should be noted that vulnerabilities where coordinated disclosure was successful will have zero post-advisory exposure time (or even negative time!). In npm and PyPI almost 60% of the vulnerabilities in our database were remediated before the publication of the corresponding advisory. Cargo has a much lower percentage, around 16%; more on that shortly.

Let’s direct our attention to cases that did not see a coordinated vulnerability disclosure. The following histograms show the post-advisory exposure time, excluding successfully coordinated disclosures.

The post-advisory exposure time across the PyPI, Cargo and npm ecosystems for the vulnerabilities in the dataset that didn't have a coordinated disclosure.
The post-advisory exposure time across the PyPI, Cargo and npm ecosystems for the vulnerabilities in the dataset that didn't have a coordinated disclosure.

In all three systems, many vulnerabilities are remediated within 30 days of advisory publication. This includes many zero-day exploits, such as log4shell, that were fixed as quickly as possible, even without the more ideal option of a coordinated vulnerability disclosure.

In the case of Cargo, the number resolved in the first 30 days is a staggering 70% of all vulnerabilities remediated after advisory publication. This is because many maintainers choose to release the remediation on the same day the advisory is published, resulting in non-zero but very brief post-advisory exposure time.

The long tail of vulnerabilities with significant post-advisory exposure time is a valuable signal on the health of the corresponding packages. For developers taking on new dependencies, knowing that they will not be left exposed for long periods of time is critical to their security posture. For existing users of a dependency, being aware of changes to future remediation likelihood of potential vulnerabilities is equally important.

Currently it is hard to know how a given package has previously performed according to this metric. Ideally this information would be easily accessible, allowing potential and existing users to make informed decisions about their dependencies.

Mean time to remediation

The number of known vulnerabilities that a package maintainer has remediated in the past can be used to help build trust between maintainers and users of OSS. Additionally, the length of time users of a package were left exposed to known, unremedied vulnerabilities in the past can provide a more detailed characterization of a package maintainer’s response.

In addition to these signals, Mean Time to Remediation (MTTR) has been proposed as a useful indicator of the quality of a package’s maintenance.

However, the available data about advisories rarely contains timestamps for critical events in the remediation process. For example, most advisory databases, including GitHub Advisories and OSV, do not provide a timestamp field for the private disclosure of the vulnerability or the maintainers acknowledgement. And while some advisory write-ups do include an event timeline, these are quite rare.

These missing timestamps make it impossible to compute the time that elapsed between a maintainer being notified of a vulnerability and the release of a remediation, relegating MTTR to a, for now, still hypothetical metric to compute.


Vulnerabilities are an inevitable part of software development. The code reuse and efficiency gains made possible by OSS broadens the potential impact of vulnerabilities.

But cooperation between parties that discover vulnerabilities and package maintainers reduces the time that users are left exposed to publicly known vulnerabilities. Thanks to the hard work of OSS maintainers, there is no post-advisory exposure for the majority of vulnerabilities in our advisory database.

Developers should still prepare for less ideal outcomes. Every dependency they introduce increases the risk of exposure to future vulnerabilities. The clearance rate and post-advisory exposure time for past advisories can provide users of OSS assurance about the quality of maintenance their dependencies receive. While past performance may not always predict future behavior, it can be used as a valuable signal to help make informed decisions.

After the Advisory

Nicky Ringland and James Wetter, Open Source Insights Team

This blog is based on a presentation given by Nicky Ringland at Google Open Source Live.

Open source software powers the world. Open source libraries allow developers to build things faster, organizations to be more nimble, and all of us to be more productive.

But dependencies bring complexity. Popular open source packages are often used directly or indirectly by a significant portion of the packages within an ecosystem. As a result, a vulnerability in a popular package can have a massive impact across an entire ecosystem.

Different software ecosystems have different conventions for specifying dependency requirements and different algorithms for resolving them. We will take a look at a couple of large profile incidents that discuss some of these differences.

The amplification of vulnerability impact

To measure the potential impact of a vulnerability, we can look at how many dependents it has. That is, how many other packages that use a specific version that is affected by a vulnerability. We can get a view of an ecosystem by looking at all package versions that are affected - either directly or indirectly - by a vulnerability.

First off: packages that are directly affected. At the time of writing, across all the packaging systems supported by, over 200 thousand package versions (0.4%) are directly named as vulnerable by a known advisory.

In contrast, almost 15 million package versions (33%) are affected only indirectly, by having an affected package in their dependency graph. That’s two orders of magnitude difference!

Dependencies magnify the impact of vulnerabilities.
Dependencies magnify the impact of vulnerabilities.

That underpins just how hard it is to fix a vulnerability in an ecosystem. When a package explicitly named by an advisory publishes a fix for the issue, the story is far from over. Many users of the packaging ecosystem will still be at risk, because they depend on vulnerable versions of the package deep within their dependency graphs. Fixing the directly affected package is often only the tip of the iceberg.

Addressing vulnerabilities in your dependencies

There are several ways an application maintainer could mitigate a vulnerability affecting one of their dependencies. Let’s be kind to our hypothetical maintainer and consider a simple dependency tree with two layers of dependencies.

An application may have direct and indirect dependencies.
An application may have direct and indirect dependencies.

If this maintainer is lucky, they depend on the affected package directly. That means as soon as the affected package publishes a fixed version they can update their project or application to depend on the fixed version.

A vulnerability in a direct dependency.
A vulnerability in a direct dependency.

But if the vulnerable package is among their indirect dependencies the situation could be much more complex.

In the best case scenario, the intermediate packages already depend on the patched version.

A vulnerability in an indirect dependency.
A vulnerability in an indirect dependency.

If this is not the case, our hypothetical maintainer may still have a course of action. To update to the fixed version of the indirect dependency the maintainer may be able to specify the fixed version as a minimum for the entire dependency graph. For this to work, however, the fixed version of the affected package and its direct dependents must be compatible. If not, the maintainer may have to wait for a new release of the intermediate dependent.

An indirect dependency can often have its version pinned, but incompatibile intermediate dependencies may be a problem.
An indirect dependency can often have its version pinned, but incompatibile intermediate dependencies may be a problem.

Another alternative is to remove the dependency on the affected package. But this often involves considerable effort; you would never have added a dependency without good reason, right?

A vulnerability in a dependency can be mitigated by removing that dependency.
A vulnerability in a dependency can be mitigated by removing that dependency.

In practice, dependency trees are rarely so simple and clean. Usually they are complex, interconnected graphs. Just take a look at the dependency graphs for popular frameworks and tools like express or kubernetes.

The dependency graph of a version of express.
The dependency graph of a version of express.

These complex graphs can make remediating a vulnerability far more difficult than the simple examples given above. There may be many paths through which a fix must propagate before it gets to you. Or, in order to remove a dependency, you might need to remove a significant portion of your dependency graph.

For example, consider the many paths by which one package depends on a vulnerable version of log4j:

Various paths from through a dependency graph to a vulnerable version of log4j.
Various paths from through a dependency graph to a vulnerable version of log4j.

With this in mind, perhaps you can imagine why it often takes a long time for a patched version of a popular package to roll out to the ecosystem.

log4shell in the Maven Central ecosystem

On December 9th last year, over 17,000 of the Java packages available from Maven central were impacted by the log4j vulnerabilities, known as log4shell, resulting in widespread fallout across the software industry. The vulnerabilities allowed an attacker to perform remote code execution by exploiting the insecure JNDI lookups feature exposed by the logging library log4j. This exploitable feature was present in multiple versions, and was enabled by default in many versions of the library. We wrote about this incident shortly after it occurred in a previous blog.

A new version of log4j with the vulnerability patched (albeit with few false starts due to incomplete fixes) was available almost immediately. So once that patched version was published had the ecosystem freed itself of log4shell? Unfortunately not. Part of what makes fixing log4shell hard is Java’s conventions on how dependency requirements are specified, and Maven’s dependency resolution algorithm itself.

In the Java ecosystem, it’s common practice to specify “soft” version requirements. That is, the dependent package specifies a single version for each dependency, which is usually the version that ends up being used. (The dependency resolution algorithm may choose a different version under certain rare conditions – for example, a different version already in the graph). While it is possible to specify ranges of suitable versions, this is unusual. More than 99% of dependency requirements in the Maven Central ecosystem are specified using soft requirements.

Here’s where Maven’s dependency resolution algorithm comes in. Since almost all the time, a specific version has been specified, that’s almost always the version that the dependency resolution will pick. So if a newer version with that important new bug fix is released, it won’t be included automatically. It usually requires explicit action by the maintainer to update the dependency requirements to a patched version.

As a result of the pervasive use of 'soft' dependency requirements old, vulnerable versions of a package may continue to be depended on long after the release of a patched version.
As a result of the pervasive use of 'soft' dependency requirements old, vulnerable versions of a package may continue to be depended on long after the release of a patched version.

In this case, consumers of any one of the 17,000 odd packages affected by the log4j vulnerabilities would likely still depend on an affected version of log4j, even after the first fix was published. Ideally the maintainers of around 4,000 packages that directly depend on log4j would promptly release a new version of their package that explicitly requires a fixed version of log4j. Then the maintainers of packages that depend on those packages can update their version requirements, and then maintainers of those packages, and so on. There are methods to pin the version of indirect dependencies accelerating this process, but many consumers rely on the default behavior of their tools.

It’s been over six months since the log4 advisory was disclosed. How well has the underlying fix to log4shell propagated throughout the ecosystem? A little less than a week after the disclosure around 13% of affected packages had remediated the issue by releasing a new version. 10 days after disclosure this number had risen to around 25%. Now a few months after that we see around 40% of the affected packages have remediated the problem. Considering how widespread the problem was, and the complexity of the dependencies between packages, this is amazing progress, but there’s clearly a lot more to go.

Default versions: new or old?

Package managers differ in which versions they choose to install by default. For example, systems like Maven or Go err on the side of choosing earlier matching versions, while npm and Pip tend to choose later versions. This design choice can have a big impact on how a fix rolls out or, conversely, how quickly an exploit can propagate.

Choosing the earlier versions has the benefit of stability; dependency graphs remain stable whether you install today or tomorrow, even if new versions are released. The downside is that the consumer must be conscientious in updating their dependencies when security issues arise.

Choosing the later versions has the benefit of currency; you get the latest fixes automatically just by reinstalling. The downside here is that your dependencies can change underfoot, sometimes in dramatic and unexpected ways.

With this in mind, if log4shell had occurred in the npm or PyPI ecosystems the story would have been quite different. In these ecosystems, packages typically ask for the most recent compatible versions of their dependencies.

The conventions of npm mean fixes can propagate throughout the ecosystem automatically.
The conventions of npm mean fixes can propagate throughout the ecosystem automatically.

Looking at the dependency requirements across all versions of all packages in npm we find around three quarters use the caret (^) or tilde (~) allowing a new patch or minor version of the dependency to be automatically selected when available. When adhering to semantic versioning, this means that many users will use the newest release with a compatible API by default.

This practice would likely have been a substantial benefit in remediating a log4shell-like event, where a vulnerability is discovered in widely used versions of a popular package.

But as we shall see, sometimes we really, really don’t want to use the latest version.

The case of colors

In early January 2022, the developer of the popular npm packages colors and faker intentionally published several releases containing breaking changes. These were picked up rapidly due to the npm resolution algorithm preferencing recent releases, and the norm in javascript of using dependency requirements that allow the use of new compatible versions automatically.

The conventions of npm mean new malicious code can propagate throughout the ecosystem rapidly.
The conventions of npm mean new malicious code can propagate throughout the ecosystem rapidly.

At the time of the incident, more than 100,000 packages’ most recent releases depended on a version of colors, and around half of them had a dependency on a problematic version. The following graph shows the dependency flow in the ecosystem over the 72 hours where the action happened.

The number of packages dependent on different versions of colors.
The number of packages dependent on different versions of colors.

About half the packages depending on colors remained unaffected throughout the incident because they depended on earlier versions of colors. But the other half of packages had some rapid and widespread changes in the exact version of colors that would have been used depending on the time at which their dependencies were resolved.

The first problematic version was 1.4.44-liberty-2. Due to version naming conventions this isn’t considered a stable version and as a result it wasn’t depended on by many packages.

A few hours later version 1.4.1 was released, and almost all packages using the 1.4 minor version immediately began to depend on this problematic version. Several hours later, 1.4.2 was released, and again most packages affected by the incident immediately depended on this new problematic version. After a few more hours npm stepped in and removed all the bad versions of colors, at which point all dependents moved back to safe versions.

The speed of this incident impacting the ecosystem was rapid. But so too was the rapid response of maintainers. Between the initial release of bad versions and their removal from npm, a period of less than 72 hours, nearly half of all affected packages were able to mitigate the issue. A small number of packages were able to remove their dependency on colors, about 4% of affected packages, seen as a drop in total number of dependent packages. Many more packages, 40% of those affected, were able to pin the version of colors being used to a safe version. This can be seen in the gradual increase of packages depending on an unaffected 1.4.x version.

Interestingly this rapid mitigation was the work of very few people. Just a little over 1% of the affected packages actually made a release during this time period. But their work resulted in 43% of the total affected packages mitigating the issue. This is a result of the same use of open dependency requirements that allowed the rapid spread of the issue and enabled rapid mitigation.

Every dependency is a trust relationship

The colors and log4shell incidents were very similar in terms of wide-reaching impact, but quite different in onset and response. In the case of log4shell, a new vulnerability was discovered in old and widely used versions, resulting in a need for dependents to move to a new release of the package. In the case of colors, a new release introduced breaking changes. This resulted in an initial automated surge to the problematic version, followed by a concerted effort for dependents to move to an older release of the package.

While the widespread use of open dependency constraints in npm led to a rapid and widespread impact of colors, it was also helpful in its mitigation. Conversely Maven’s approach of favoring stability resulted in difficulty resolving log4shell, but also means Maven is much less susceptible to a colors-type incident. Neither approach is obviously superior, just different.

While there is no silver bullet solution, there are best practices that consumers, maintainers, and packaging system developers can observe to reduce risk. Always understand your dependencies and why they were chosen, and always make sure your dependency requirements are well maintained.

Understanding the Impact of Apache Log4j Vulnerability

James Wetter and Nicky Ringland, Open Source Insights Team

More than 17,000 Java packages, amounting to over 4% of the Maven Central repository (the most significant Java package repository), have been impacted by the recently disclosed log4j vulnerabilities (1, 2), with widespread fallout across the software industry.1 The vulnerabilities allow an attacker to perform remote code execution by exploiting the insecure JNDI lookups feature exposed by the logging library log4j. This exploitable feature was enabled by default in many versions of the library.

This vulnerability has captivated the information security ecosystem since its disclosure on December 9th because of both its severity and widespread impact. As a popular logging tool, log4j is used by tens of thousands of software packages (known as artifacts in the Java ecosystem) and projects across the software industry. User’s lack of visibility into their dependencies and transitive dependencies has made patching difficult; it has also made it difficult to determine the full blast radius of this vulnerability. Using Open Source Insights, a project to help understand open source dependencies, we surveyed all versions of all artifacts in the Maven Central Repository to determine the scope of the issue in the open source ecosystem of JVM based languages, and to track the ongoing efforts to mitigate the affected packages.

How widespread is the log4j vulnerability?

As of December 16, 2021, we found that over 17,000 of the available Java artifacts from Maven Central depend on the affected log4j code. This means that more than 4% of all packages on Maven Central have at least one version that is impacted by this vulnerability.1 (These numbers do not encompass all Java packages, such as directly distributed binaries, but Maven Central is a strong proxy for the state of the ecosystem.)

As far as ecosystem impact goes, 4% is enormous. The average ecosystem impact of advisories affecting Maven Central is 2%, with the median less than 0.1%.

Maven ecosystem affected by the vulnerability
Maven ecosystem affected by the vulnerability

Direct dependencies account for around 3,500 of the affected artifacts, meaning that any of its versions depend upon an affected version of log4j-core as described in the CVEs. The majority of affected artifacts come from indirect dependencies (that is, the dependencies of one’s own dependencies), meaning log4j is not explicitly defined as a dependency of the artifact, but gets pulled in as a transitive dependency.

Direct dependencies vs. indirect dependencies
Direct dependencies vs. indirect dependencies

What is the current progress in fixing the open source JVM ecosystem?

We counted an artifact as fixed if the artifact had at least one version affected and has released a greater stable version (according to semantic versioning) that is unaffected. An artifact affected by log4j is considered fixed if it has updated to 2.16.0 or removed its dependency on log4j altogether.

At the time of writing, nearly five thousand of the affected artifacts have been fixed. This represents a rapid response and mammoth effort both by the log4j maintainers and the wider community of open source consumers. That leaves over 12,000 artifacts affected, many of which are dependent on another artifact to patch (the transitive dependency) and are likely blocked.

An example of affected artifact blocked by an intermediate dependency
An example of affected artifact blocked by an intermediate dependency

Why is fixing the JVM ecosystem hard?

Most artifacts that depend on log4j do so indirectly. The deeper the vulnerability is in a dependency chain, the more steps that may be required for it to be fixed. The following diagram shows a histogram of how deeply an affected log4j package (core or api) first appears in consumers dependency graphs. For greater than 80% of the packages, the vulnerability is more than one level deep, with a majority affected five levels down (and some as many as nine levels down). These packages may require fixes throughout all parts of the tree, starting from the deepest dependencies first.

Depth of log4j dependencies
Depth of log4j dependencies

Another difficulty is caused by ecosystem-level choices in the dependency resolution algorithm and requirement specification conventions.

In the Java ecosystem, it’s common practice to specify “soft” version requirements — exact versions that are used by the resolution algorithm if no other version of the same package appears earlier in the dependency graph. Propagating a fix often requires explicit action by the maintainers to update the dependency requirements to a patched version.

This practice is in contrast to other ecosystems, such as npm, where it’s common for developers to specify open ranges for dependency requirements. Open ranges allow the resolution algorithm to select the most recently released version that satisfies dependency requirements, thereby pulling in new fixes. Consumers can get a patched version on the next build after the patch is available, which propagates up the dependencies quickly. (This approach is not without its drawbacks; pulling in new fixes can also pull in new problems.)

How long will it take for this vulnerability to be fixed across the entire ecosystem?

It’s hard to say. We looked at all publicly disclosed critical advisories affecting Maven packages to get a sense of how quickly other vulnerabilities have been fully addressed. Less than half (48%) of the artifacts affected by a vulnerability have been fixed, so we might be in for a long wait, likely years.

But things are looking promising on the log4j front. After less than a week, around 25% of affected artifacts have been fixed. This, more than any other stat, speaks to the massive effort by open source maintainers, information security teams and consumers across the globe.

Where to focus next?

Thanks and congratulations are due to the open source maintainers and consumers who have already upgraded their versions of log4j. As part of our investigation, we pulled together a list of 500 affected packages with some of the highest transitive usage. If you are a maintainer or user helping with the patching effort, prioritizing these packages could maximize your impact and unblock more of the community.

We encourage the open source community to continue to strengthen security in these packages by enabling automated dependency updates and adding security mitigations. Improvements such as these could qualify for financial rewards from the Secure Open Source Rewards program.

You can explore your package dependencies and their vulnerabilities by using Open Source Insights.

  1. When this blog post was initially published, count numbers included all packages dependent on either log4j-core or log4j-api, as both were listed as affected in the CVE. The numbers have been updated to account for only packages dependent on log4j-core.  2

Introducing PyPI Support for

Paul Mathews and the Open Source Insights Team

We’re pleased to announce now has support for Python packages hosted on the Python Package Index (PyPI). That means we have over 300k—and counting—Python packages for your perusal, from boto3 to pandas.

Where does the data come from?

We use PyPI’s RSS Feeds to stay abreast of new and updated packages, with an occasional full sync from the Simple Repository API. For each package version, we fetch metadata from the JSON API and analyze it to resolve its dependencies, determine the license, and so on.

Dependency resolution is complex in any language, and Python is no exception. Sometimes you might see an error message about a particular version of a package. The most common reason for this is packages that only provide a source distribution that specifies the dependencies in a—which is hard to run safely and may not even be deterministic. This is not a problem with wheels as they do not require executing arbitrary Python code to understand the dependencies. Of course there are any number of other things that can go wrong, and Python has a long history of packaging formats, so if you find anything not working as expected, don’t hesitate to get in touch.

Where do the dependencies come from?

We periodically resolve the full dependencies of every package version we know about. In pip terms, the graph we show for version 1.0.0 of package a consists of the packages that would be installed by running pip install a==1.0.0 in a clean environment with recent versions of setuptools and wheel available.

These graphs are dependent on the versions of both Python and pip, as well as the operating system, CPU architecture, and so on. It’s not uncommon for packages to publish different wheels for various different combinations of all of these, and for each release to have its own metadata with potentially distinct dependencies. Currently we perform resolution as if we were runnning pip 21.1.3 with Python 3.9 on an x86_64 manylinux compatible platform, with more combinations on the way. We think it’s an accurate reproduction but if you see anything unexpected, please let us know!

What’s next

We’re excited to add PyPI to our set of supported language ecosystems, and epecially keen to start digging into the data and do some comparative analysis. From our first look, there are plenty of interesting things to uncover, for instance:

  • 4 of the 5 most depended on packages are all dependencies of the 6th most depended on package: requests
  • more than half of all package versions on PyPI have zero dependencies, compared to ≈15-25% across Go, npm, Cargo and Maven
  • this small package has one of the lowest ratios of direct to indirect dependents we’ve seen across all package ecosystems.

We’re also working on improving our license recognition and figuring out how to show the differences enabling various extras makes to the dependency graphs.

So slither on in and start exploring! We’ll keep digging into the data and keep you posted on what we discover!

Introducing Open Source Insights

Nicky Ringland and the Open Source Insights team

Modern software is more than just some lines of code checked into a repository. To build almost any program, one must also install packages from other developers. These external dependencies are critical components of today’s software environment, and tooling has been created to make it easy to install dependencies and update them as required. As a result, the past few years have witnessed a phenomenal growth in the open source ecosystem as well as a marked increase in the average number of dependencies for a given package. Meanwhile, many of these packages are being changed—fixed, expanded, updated—regularly.

The rate of change is significant. Our analysis shows that roughly 15% of the packages in npm see changes to their dependency sets each day, while a majority of the change is in packages that are widely used.

This activity affects not just your own software, and not even just the software you call upon, but the entire set of your software’s dependencies, which may be much larger than those listed explicitly by your project. It is common to see one package use a handful of other packages that in turn have a hundred or more dependencies of their own. Many of the most commonly used packages in open source have large dependency trees that will be pulled in by the installation process.

Today’s software is therefore built upon a constantly-changing foundation, and keeping track of that churn is challenging. Your package changes, your dependencies change, their dependencies change, and so on. Even the most diligent developers struggle to keep up beyond letting the tooling download updates to all the dependencies from time to time. Tooling helps manage the updates, but cannot guarantee what the right update is, or when the time is right to apply it.

It’s easy to miss important problems deep in the dependencies, such as security vulnerabilities, license conflicts, or other issues. The tools just do what they are told, and if a nested dependency has an issue, it will be installed regardless. Systems have been compromised or exploited by dependencies that acquired malicious changes that were undetected, sometimes for long periods.

The Open Source Insights project aims to help. It collects information about open source projects—source code, licenses, releases, vulnerabilities, owners, and more—and gathers it into a single location, making it accessible. These interfaces help developers and project owners see the full dependency graph of their projects and can use it to track release activity, vulnerabilities, and other information such as licenses that are used by the components, regardless of how deeply they are nested inside the dependencies.

In short, all the information about a package is connected to all the other packages that depend upon it, and Insights shows the connections. For instance, if your code depends on a package that has a security vulnerability, even if that vulnerability is in a package 10 dependency hops away in a package that you don’t even know about, the Insights page for your package will tell you about it.

The Insights project also helps developers see the importance of their project by showing the projects that depend on them—their dependents. Even a small project is important if a large number of other projects depend on it, either directly or through transitive dependencies.

This blog

To build we dove deep into the fundamentals of several different package management systems, collecting and organizing the metadata of millions of packages, and implementing our own bug-for-bug compatible semver parsers, constraint matchers, and dependency resolvers.

Along the way, we’ve learnt about the wider problem space, and the varied challenges that await the unsuspecting programmer. The tools are changing, and inconsistent, and often poorly understood. Many package management systems were not designed with today’s security concerns in mind. Semver constraints are not formally specified, and are implemented arbitrarily by different package managers. There is not widespread agreement on foundational questions such as whether it is better to select the newest or oldest matching version, or whether the ability to “pin” versions is a good or bad thing (and what that even means). Also whether it is good or bad (or possible or impossible) to be able to include multiple versions of the same package in a given program.

In future articles we will explore these issues in detail, comparing the approaches of various communities, so as to contribute to the conversations that push the open source ecosystem forward. We hope that we can converge, as an industry, on some fundamental “good ideas” in the space of managing software dependencies.