Different Styles of Dependency Management

Recently I was hacking out a quick Python script. Everything was fine until I needed to import one common third party library and then Boom! I was dropped head first into the messy chain of Python dependencies, weak typing, virtual environments, and conflicts. I couldn’t even install the necessary library with pip. I’d just get an unintelligible hash of error messages and stack traces.

So off to DuckDuckGo I went to remind myself how one actually builds and packages real world Python programs that do more than print Hello World! It wasn’t pretty, and it took me quite a while to understand it. This was actually harder for me than it might have been for a newbie because I was so invested in the way Java manages dependencies. I’ve spent years working with dependency management in Java at a depth most developers never sink to. I’m an Apache Maven committer. I’ve rewritten most of their documentation about dependencies. I’ve written tools to analyze Java jars, dependency trees, and classpaths. I wrote or edited most of Google’s Java Library Best Practices. I’ve debugged many problems caused by Java class loaders. So I’ve got a pretty good understanding of how Java dynamically links third party dependencies.

But Python? Python doesn’t work like that. And neither does everything else.

In Java there is a JDK and there is a classpath, and they are two different things. The JDK is usually pointed to by $JAVA_HOME. Dependencies beyond the core Java library are usually controlled by some sort of build file, whether that’s a pom.xml, build.gradle, or bazel BUILD. The build tool adds the jar files for these libraries to the -classpath command line argument when it invokes javac or java. You can also do this manually if you’re not using a build tool, and there are variations, but roughly that’s how it works. Every project has its own classpath.

You can also put jars inside the jre/lib/ext directory, in which case java adds them to the classpath of every program that VM runs or compiles. However using jre/lib/ext is discouraged, and almost no one does that anymore. I don’t think I’ve done that since the 2000’s.

Python doesn’t work like that. Programs and projects don’t have their own classpath. Instead they take everything that is installed in the Python environment and only that. This leads to conflicts between programs when they need different versions of dependencies. This is particularly bad because a combination of weak typing and community culture means that Python libraries and runtimes introduce API breaking changes far more frequently than Java libraries and runtimes do. We can argue about why that’s the case, but I don’t think there’s any dispute that Java APIs, on average, are more stable than Python APIs.

This problem got much worse when OS vendors like Apple began bundling Python with their systems and even writing system tools in Python instead of C and bash. Now installing a new version of a Python library or the Python runtime wouldn’t just break all your other Python programs. It could brick your operating system too.

The solution Python came up with was not to introduce a per-program classpath like Java did. Instead it created virtual environments. Unlike classpaths, virtual environments are not tied to a project or a program. They are independent and often not even committed to the source repository, any more than a JDK would be. You can run the same program in different virtual environments, and use one virtual environment to build and run multiple unrelated programs.

Virtual environments can also include different versions of Python. One virtual environment can use Python 2 while another uses Python 3. You can have one virtual environment for data science and another for web scraping. In Java terms, it’s like we have multiple different JDK installs on our disk, put all dependencies into jre/lib/ext, and then select from the different JDKs with different libraries installed by changing $JAVA_HOME before compiling and running. Python’s a little cleaner than that because it’s designed to work this way, but roughly that’s what it’s doing.

Traditional statically linked languages like C and C++ are somewhat easier to manage because the libraries are linked into the final binary. Nonetheless differences between C compilers and system libraries have led to incredibly complicated systems like autoconf designed to enable portable code to be written and compiled. And even that doesn’t work all the time.

More recently developers have begun using Docker containers to treat the operating system itself, as well as its complete file system, as just another dependency of the project. This enables less concern about which versions of which shared libraries are installed. Code can be written against specific versions that match what’s installed in the Docker container. You can also rely on the locations of other binaries you want to invoke and files you want to read. You know what other processes or and aren’t running, and which ports are already occupied. There are few unexpected environmental differences at runtime. What’s installed on the bare metal doesn’t matter so much. The downside of using Docker is that there’s extra overhead to test and run a program, though given the increasing speed and power of our laptops that matters less and less every year.

I have no idea how Rust, Go, and Ruby manage their libraries. I should figure that out. I do know that the Ruby project I’m working on now uses Docker, so maybe that’s the way things are going.

Mono repos and the associated build tools used at Big Tech companies like Meta and Google avoid many of these problems by settling on a single version of every library and runtime across the company. When a dependency is upgraded, it’s upgraded for everyone at once in a single commit. This works really well, but it doesn’t fit a more heterogeneous environment that spans many organizations since you can’t update everyone at once or expect everyone to agree on a particular version of Java or Python.

I’m not sure which mode of managing dependencies I prefer. They all have their strengths and weaknesses. Some of the differences are results of longstanding arguments about strong versus weak typing and static versus dynamic linking. And to some extent, which you prefer depends on the sort of work you’re doing. Compiled GUI applications on Windows are not JavaScript programs running a web browser are not multi-server distributed databases. Different languages fit different jobs. Though I do note that for Python I do find myself wishing that there were a common way to specify my dependencies that I can include in the source repo of my project, and that would be automatically picked up by the build. That is, something like Maven’s pom.xml or Gradle’s build.gradle. As far as I can tell, neither virtualenv nor conda offer that.

Leave a Reply