We all write shitty code.

This seems like a stupidly bold statement from a random person, especially when this person very likely writes shitter stuff than most others. After all, there’s a well known joke across the industry - that is, Jeff Dean does not write buggy code, and when he does, it’s the problem of the compiler. So at least we have one counter example.

In common sense, masters don’t write shit, and Linus Torvalds very likely nods in the line.

Well, technically speaking, before we criticize this non sense bold statement, we need to define what shit is. And here, I think shitty code is anything that does not live up to the expectation. Buggy code is shitty - because it doesn’t provide consistently correct results when people expect it to. However, in the real world, we tend to operate at a much higher level - we care more about contracts, functionalities, and behaviors, and less about the exact implementations. For example, when we use an API service, we tend to care about what features it can provide, the QPS it can support, and how easily we can use it - in essence, our expectation. We probably won’t care about whether it’s run on bare metal or some public cloud, whether the memory is manually managed, by compiler, or by garbage collection.

Does that then mean all services/software that fail to match expectation start with shitty code? Well not necessarily. Windows 9x was great at the time it was first introduced and enabled a whole industry. Yet as things evolved, a lot of the design issues started to get exacerbated and it eventually got completely replaced by NT kernel. I’m not trying to defend 9x and saying it was very well designed masterpiece by any means - it had some notorious kernel safety issues after all, but just using it as an example to show that even if some software had highlights, over time other factors could expose the shittiness inside - it’s just a matter of time. What I believe is that no matter how well the systems were designed and implemented at the beginning, they deteriorate over time. We cannot prevent things from turning into shit - we can only slow down the speed.

What makes things more complicated is that, often times, there are way too many factors that have impact on how we value software:

At high level, users’ expectations may change, the overall world trend may evolve, or alternative software may leap forward.

Under microscope, experienced maintainers may leave, tools/development environment may change, or simply resource priority may shift and the project may not be able to be taken care of for years.

For example, in the cloudify trend, a lot of companies try to ditch Oracle databases in favor of AWS/Azure. E-commers turn to Salesforce rather than costly, heavy weight SAP systems. It’s most likely not the problem of the latter systems per se - in fact Oracle is still irreplaceable in certain areas if you are looking for stable, state-of-the-art single-machine db throughput for a traditional RDBMS. It’s more about expectation management - if you need to tighten a screw, you need a screwdriver, not a hammer, no matter how good the hammer is, because it won’t do the work you expect it to.

With these said, I’m not trying to say that there’s very little engineers have control of. Quite the contrary, the point I’m trying to make here is, despite the fact that there are so many external factors that could accelerate the speed of software deterioration, there’s still something engineers have control of to slow that down, with all other things being equal, which is to try to maintain a high standard of shit.

While this does seem very meta, the core ideas in practice, in which I believe, are continuous delivery and following best practices.

Continuous Delivery

Continuous delivery allows for faster code iteration, earlier error detection and recovery, and more agile response to the ever changing requirements. This is a very broad topic in practice and easier said than done, yet it’s not a unsolvable problem. At the very basic layer, continuous delivery has 3 fundamental parts - a scalable VCS to allow people to continuously evolve the entire code base, a CI/CD platform to allow for reliable shipment and a mechanism to automatically disable dead/unmaintained code.

There are a lot of good examples that exist in industry: Google and Facebook have put enormous effort into scaling the VCS platform to support their monolithic repos. This is not nonsense - while monolithic repos may not work for everyone, engineering wise, it’s the most cost-effective way at scale to save time handling dependency problems and force people to move forward together by atomic large scale code mods.

Amazon, on the other hand, offers Apollo to employees to allow engineers to automate reliable shipment of software to their global fleet. All the automation, at the end of the day, is to eliminate software rot because of human error and laziness.

The third part - eliminating dead/unmaintained software - is often overlooked and a lot of people/companies have stumped on this. The basic idea is simple - since all software deteriorates over time, the only unchanging thing is to make changes and the only way to avoid forgetting about something is to keep doing it. In practice, at the very minimum, this means setting up some deadline and after which the binary is not allowed to run, in order to force people to upgrade, even if no logic is changed. This is hard indeed in reality because often times when you build infrastructure level software for a global hardware fleet, it’s costly to update and update may lead to errors that could be prevented otherwise, let alone for commercial software, where there are business contracts that prevent stuff from being upgraded. So people sit on high horses and say if it ain’t break, don’t fix it. This is true in most cases. However then those people often are also the ones who fall harder when something crucial, that hasn’t been maintained for years, eventually breaks. In practice, this is less of a technical issue - it’s more about engineering culture, which essentially is how much time/money the company/team is willing to spend on things that do not have short term tangible value, yet valuable when team needs the strength in unexpected time.

Best Practices

The other part - following best practices - is also culture related. Often it’s hard to measure, and it’s tempting to violate when you are in a rush. A more alarming fact is, it’s often hard to keep that as a habit for a team and it takes a long time to build and maintain that culture, yet it’s very easy to crush as it’s viral - once someone shows the short term benefit of doing so and does not get penalized, others would follow and eventually things get out of control. What’s worse, when you ask them why they don’t follow, they might bounce back and ask you why their work cannot be valued because they save time by cutting corners yet still get the product features delivered, so they can have more time finding the next feature to develop.

The thing is, at the very beginning, there’s no so-called best practices. However every kind of freedom has a price. People gradually find that certain patterns lead to less bugs and cleaner, easier-to-maintain code. Or in other words, they “rot” more slowly than others. Even though all software eventually fails if they don’t get maintained, at least they fail in a less hilarious way. Unfortunately a lot of developers nowadays tend to ignore that in favor of pragmatism mainly because unfortunately following best practices tends to not have immediate benefit - unless one has some kind of OCD.

There’s various ways to tackle that in practice. Google, for example, strives to maintain a centralized standard of coding practices. It operates in a model somewhat like GPL license - you need to get the proof that you can write good quality code from people who have the proof. Then in the ideal world, that leads to a situation that people would operate on the same standard using the same language. This also works for small startups through code reviews when you only have that many engineers. This basically operates on a psychological level because when people violate the practices, you feel morally superior to them and you can criticize the violation publicly through code review, which in practice, leads to more people willing to follow the majority.

Everything comes with trade-off though. This, however, in practice leads to slight efficiency penalty, because people may have to fix the “minutes” (e.g. switching to multithreading in a non critical code path because in theory you should and other people think so) instead of only focusing more on the higher, directional level (i.e. I can ship my stuff as long as it appears to be working at that moment). So product-oriented companies like Facebook, on the other hand, tend to give trust and some kind of freedom to engineers. As a result, different teams/orgs could have different standards and slightly different preferences. Amazon even pushes that level higher as each team/org owns separate repos. And when things become bad to certain degree, they carry out the big lawn mower (fastmod for example) to fix all similar issues in an atomic way (this is why being able to automating the elimination of dead code is important for big companies). Some people don’t agree on this approach, but that way does have a valuable point, which is that enforcing individual behavior cannot scale together with the fast changing environment. Practicing large scale automation is more important than to train individuals to be sticking to “best practices” and do everything right - simply because nobody can’t over time. That way, people spend more time “solving real world problems” than “thinking about whether some piece of code could be refactored to be reused”. After all, what’s the value of latter in performance evaluation if it’s not measurable?

Unfortunately not all people think that way. A lot of people, especially in open source software community, believe that before you become a pure problem solver, you are an engineer who writes shit. And every piece of shit could fall apart in a different way. The idea of best practice, is to enable other people to follow the logic of your shit more easily, require less context switching, and locate problems faster when things break. This is why people come up with design patterns. To problem solvers’ point, they are useless, because they don’t address a single real world problem, at least not directly. Some people would argue with them that they allow one to write shit with low coupling and high cohesion. From my perspective, I think the value of following such best practice, is to allow the shit to break, if eventually it will, in a less laughable way.

There are examples of such failure - not long ago, Facebook updated the server code which led to massive app launch crash including spotify, tiktok and google due to a “simple” type mismatch error (this github issue reveals some technical details). People laughed at that, not because it’s yet another critical bug, but it’s an example to show the price for ignoring best practices, and hence software breaks in an interesting way. Some people claim that this requires better testing and type checking. IMHO, this is more of a high horse style solution because there’s always stuff you’d mis-test in large scale software. Over time if there’s no issues happening, people get used to and bored so issues would emerge again and similar situations start all over. The general industry best practice would tell us that the only way to fundamentally prevent such issues is to keep the interface in sync. There are various ways to achieve that with one being using an IDL to define such thing (thrift, protobuf, you name it). I’m not a mobile developer so I’m sure there’s reasons/trade-offs behind why no such thing is being used in this case, even if some of which might not be purely technical. But regardless of that, the point is these practices are not new. They came from people who had failed in history - coding to interfaces is not a new idea, it’s been there for decades in the OOP world. Nothing would fail immediately if one ignores those, but eventually one would join the other group and advocate for such thing.

All software decays. The idea of software engineering, at the end of the day, is not to prevent people from writing shit. However, if we keep trying to maintain a high standard of shit, or trying to get closer to it, then people would be more willing to accept the fact that things would eventually break. And when that happens, there would be more people willing to help it out, and this is how communities are formed, not just by end results, but more by reputation and expectation.

After all, we should make each shit do one thing, and do it well.