What's in a name?

Eve Martin-Jones, Google Open Source Security Team

August 5, 2024

We’re pleased to announce that deps.dev has extended support for querying package versions by their upstream identifiers in our BigQuery dataset. This blog post explores the problem of multiple identifiers for package versions and describes how supporting additional identifiers allows deps.dev users to query for versions more easily.

Knowing what name and version string to use when referring to an open source package can be more difficult than you might expect. Different ecosystems have different rules that specify how you should refer to a package version. For example, in Go module names are case sensitive, while in npm package names must be lowercase today (but not historically).

In many open source ecosystems, there are multiple valid identifiers that map to the same version of a package. Let’s take a look at the PyPI flask-babel package to see how this plays out in practice. Say I wanted to depend on this package from my own Python project. What name should I use to import it?

Diving into the metadata for the latest version of the package (4.0.0 at the time of writing), I can see that the name given in the PKG_INFO file for the flask_babel-4.0.0.tar.gz release is flask-babel:

A screenshot of the PKG_INFO file for the flask_babel-4.0.0.tar.gz release with the 'Name' field highlighted

This is consistent with the name on the pypi.org package page.

However, if we look at the PKG-INFO file for an older release — 2.0.0 —, we can see the package being referred to by a different name Flask-Babel:

A screenshot of the PKG_INFO file for the Flask-Babel-2.0.0.tar.gz release with the 'Name' field highlighted

At this point we’ve seen this package referred to by two different names: flask-babel and Flask-Babel. Which one is correct?

In fact, they both are. According to the PyPI name normalization rules a package name “should be lowercased with all runs of the characters ., -, or _ replaced with a single - character”. By those rules, both flask-babel and Flask-Babel normalize to the same string: flask-babel (as do many other names like FLASK-BABEL, flask_babel or FLASK._-_.babel).

So in PyPI there is no single “correct” name for the flask-babel package. Any name that normalizes to that string can be used with the pip tooling to install that package. For example, the three following commands are equivalent:

pip install flask-babel
pip install Flask-Babel
pip install FLASK._-_.babel

Similarly, if I want to depend on flask-babel from my own pyproject.toml, the following three statements are equivalent:

dependencies = ["flask-babel"]
dependencies = ["Flask-Babel"]
dependencies = ["FLASK._-_.babel"]

This is also true of version strings. While 1.0.0.0, 1.00.0.0 and 1.0.0 are different strings, according to the Python Packaging User Guide: Version specifiers they all refer to the same version.

Allowing users to refer to packages and versions under multiple identifiers can limit typosquatting attacks (if a user accidentally types Flask_Babel they’ll still get the expected package version). But it can also make some types of analysis tricky. For example, if we want to know every version of flask-babel that exists we need to look at all the releases whose name normalizes to that string. Similarly, if we want to know every package that depends on flask-babel, we need to search the dependencies of every other PyPI package for any string that normalizes to flask-babel.

To make these aggregations sensible and efficient, it often makes sense to store information keyed by a package’s normalized name/version. That way, we can easily aggregate metadata and dependents/dependencies across package versions without having to normalize the data each time.

For platforms that serve data about open source packages, this underlying normalization is generally hidden from users. For example, pypi.org redirects different valid spellings of a package name to the right underlying package (see flask-babel, Flask-Babel and FLASK.-.babel). Similarly, the deps.dev API normalizes package names in user requests so that multiple identifiers can be used to refer to the same package (see flask-babel, Flask-Babel and FLASK.-.babel).

This approach works fine in places where name normalization can be performed on-the-fly by the server (like a website or api). However, normalization can become tricky in places where packages need to be keyed by a single name — like a BigQuery dataset. While normalization is necessary in these cases (because of the aggregation requirements previously mentioned), it can be surprising to users. Especially if the normalized name differs from the name the package is commonly known by.

Let’s look at the case where the normalized name of a package version differs from its name on pypi.org — the Pygments package. If you’re looking at the Pygments package on pypi.org, but running the query:

SELECT * FROM deps-dev-insights.v1.PackageVersionsLatest
WHERE System="PYPI" AND Name="Pygments";

returns no results, it’s a reasonable assumption that deps.dev doesn’t know about the package (what’s actually happening is that the deps.dev BigQuery keys that package by its normalized name, pygments).

This is a problem because it’s fairly common for the canonicalized name to differ from the name given in the package metadata (this is true for 75558 or 13.33% of PyPI packages).

For that reason, we’ve introduced a new UpstreamIdentifiers column in BigQuery that contains the pre-normalized name and version strings. Using this column, we can query by any upstream name/version string that a package version uses to refer to itself:

SELECT * FROM deps-dev-insights.v1.PackageVersionsLatest
WHERE System="PYPI" AND "Pygments" IN UNNEST(UpstreamIdentifiers.PackageName);

There are a few caveats. Firstly, it’s possible that not all versions of a package will be returned by this query. Only those versions that refer to themselves by that name will appear in the results.

Secondly, not every string that normalizes to a name will be included in this UpstreamIdentifiers column. There are many possible strings that normalize to e.g. pygments and enumerating all of them isn’t particularly useful. Only identifiers that are encountered upstream during a package version refresh are included.

Despite these caveats, we hope that this additional column will allow our users to more easily map the identifiers they might see upstream to our BigQuery data. The upstream identifiers are also available via the v3alpha API. Additionally, deps.dev provides a Go package for parsing, order and matching versions as defined by Semantic Version 2.0.0 that supports extensions and quirks implemented by a number of package management systems. It can be found at github.com/google/deps.dev/util/semver.

If you have any questions, feedback or feature requests, you can reach us at depsdev@google.com, or by filing an issue on our GitHub repo.