Spraying our bike shed with some new colors

About an year and a half ago, Loggi backend codebase mostly consisted of a large Django application, and several "nanoservices", with a wide range of stacks that never got traction. Rust, Elixir, C++, Nodejs, you name it.

We felt that our Django stack served a large class of applications very well, but the company was starting to write stuff that did no fit that preferred stack paradigm well, and we needed a new, different and well maintained stack in our toolbelt.

So, as small, medium and large decisions start, we wrote a design doc, which I am now sharing in case it helps those facing similar dilemmas. It has been 18 months, and it feels the experiment that was spawned by this document worked. Our team grew from 40 to 200 engineers, and our django stack continues strong in its focus. But many teams with different needs adopted the new stack, unlocking new idioms, and still interoperating successfully with the rest of the company.

Without further ado, here is our initial analysis, and how we decided to settle with Kotlin as our second backend language, JVM as our runtime, Micronaut as the overall framework for it, NoSQL databases as the primary storage, and protocol buffers as our Esperanto.

A pre-covid day of work in the Loggi Tower

Goals

Propose an additional programming stack for the Loggi ecosystem, focused on non-crud applications.
Describe several provisional sub-decisions that cover the relevant dimensions for the new stack and are deemed good enough.
Define the communication surfaces for the new stack.

Non-Goals

Replace the existing Python/Django/Pgsql stack that currently powers most of the company.
Impose the described sub-decisions related to the new stack.

Overview

Currently Loggi mostly relies on a classical Python/Django/Pgsql CRUD architecture to power its operations. Since a lot of the needs of the company fit that paradigm well, attempts to change that status quo have gotten little traction.

That stack however suffers from a few downsides which have become more prominent as the classes of applications in the company become more diversified, a natural result of the company growth and the technical complexity of the solutions it aspires to implement.

We propose an additional programming stack which aims to serve better non-crud applications by improving on the following intrinsic limitations of the Python/Django stack: a) lack of a mature type system, b) poor multi-core/concurrency support.

It is important to note that neither of these downsides are simply bad things. They are part of trade-offs that enable some very desirable traits. For example, the weakly typed nature of python was a big enabler for the Django ORM layer, one of the main reasons of the framework success. Also, the existence of the maligned Global Interpreter Lock (GIL) in cpython means a whole class of bugs is less likely in our codebase.

Nonetheless, the other side of the equation is equally real, and codebases with much less dynamic data models, as our national deliveries integration code, benefit little from the flexibility provided by this stack, and impose a penalty on developer productivity due to the attrition level to expose more strict data schemas. Others, like our address parsing service, require high parallelism in both cpu bound or i/o bound load, and again the solutions in the Python/Django stack are far fetched.

In the detailed design below we discuss some of the dimensions where the proposed Kotlin/Micronaut/Nosql stack differs from the Python/Django/Pgsql one, and also how it fulfills the expectations we believe any new framework introduced should. We also discuss why pick grpc/protocol buffers as the basis for interoperability between stacks.

Detailed Design

Multicore

In the Python/Django/Pgsql stack we currently have live, usage of multi-core is accomplished through multiple processes. This approach has merit, but it results in complex settings involving memory usage, blocking primitives, inbound throughput/latency and backend capacity.

It goes without saying that some form of tuning will exist no matter which technical choices are made, since ultimately they are simply a side effect of not having infinite resources. In fact, process based concurrency is the 12 factor recommendation. But due to the difficulty of achieving reasonable concurrency in the python eco-systems, we should adopt a more natural framework when this property is necessary.

Type system

Although we frame this discussion around the presence of a strong type system, the best way to see this is as the amount of static analysis benefits we have in each ecosystem. Python is very lacking in this domain, and although tools like PEP8 and mypy improves things a bit, and other static analysis/code smells tools also bring some benefit, those are mostly afterthoughts for the language, and struggle to deliver a benefit similar to what you get from static JVM languages, and more in particular from Kotlin.

Besides the traditional type support, we highlight the first class support for strongly-typed protocol buffers messages, and the syntax sugar and type system support for preventing nullability problems in Kotlin.

Again, it is important to highlight that a strong type system is a not a benefit per se, and the programming community has yet to settle on the code defect/productivity benefits trade-offs in the whole spectrum of options that exist in this world. But for the scope of this discussion, it does seem that Kotlin position itself in a rather different camp for which there is perceived demand in the company.

https://twitter.com/01k/status/1067788059989684224

Notice this is one of the camps the Node.js ecosystem did not fare very well in our informal comparisons, although it does excel in performance and offers great performance through its async based model, despite not exposing multiple cores to users. But from the more strongly typed options in that ecosystem, only typescript has significant traction, and javascript interoperability is not closely as smooth as the Kotlin/Java gap.

GRPC/Protocol Buffers

There are multiple alternatives for communicating between services in the same stack, or between different stacks. REST is probably the most popular choice nowadays, and has ubiquitous support. In the frontend/backend edge GraphQL is growing. Both technologies are used at Loggi. On the async front, the company also uses Kafka and AMQP (for Celery).

Each of those offers different properties. GRPC is most often chosen in the microservices backend scenario and is highly optimized for performance, leveraging HTTP/2 natively, and highly efficient binary format through protocol buffers. REST focus on the notion of resources, and hence offers rich caching capabilities deployed across the internet. GraphQL has chosen the query as its central concept, and its niche is client/server communication where both endpoints are controlled by the same company. The async protocols, kafka and ampq, are often connected to a CQRS infrastructure.

In our decision, we start with the premise that GRPC is the most efficient choice for backend/backend communication. Furthermore, we emphasize the ubiquitous support of protocol buffers as a schema language, with good schema evolution semantics, which makes it a good choice for serialization in a diverse set of environments.

Given that premise, we claim that there is strong interoperability between protocol buffers/grpc and the other choices, and given its low overhead, we conclude that building the other protocols on top of it instead of the other way around is the right choice. That choice is supported by existing prior art, covering reasonably well all the use cases.

As supporting material, see Lyft talk on using protocol buffers/grpc. For REST interoperability, we can leverage simple annotations in protocol buffers, consumed by Envoy Proxy, for GraphQL, we can leverage Rejoiner and for Kafka, we could adopt Kafka Pixy. Furthermore, Grpc and Envoy are at the core of Istio and the core tracing and observability toolset it provides.

Finally, protocol buffers 3.0 do provide a very good escape hatch through its json representation, whenever we believe GRPC, its adapters or protocol buffers are inappropriate.

Object-Relational Mapping

The Micronaut framework has first-class support for ORM. It is arguable whether it is at the same level as Django, but since it is not a goal to compete in this front, we simply reiterate that as of now we do not allow code in the Kotlin/Micronaut/Nosql to use its ORM facilities, avoiding analysis paralysis and fragmentation in our codebase.

Research is not useful if you never get to action

Besides avoiding the conflict with existing codebases, this decision has the goal of sparking more interest and developing more expertise in NoSQL development in the company. Traditional SQL applications with very high level ORM abstractions do have programmer productivity benefits, but often lead to applications that scale poorly and are hard to manage in production, since developers are shielded from too many details. Our main codebase, Loggi-Web is a classical example for this problem. In the so called NewSQL databases, scalability is less of a issue, but latency, poor data modeling and cost are still concerns.

Reactive programming

The reactive manifesto has declared a series of (arguably too) bold characteristics that modern software should aim to have. Micronaut declares first-class support for many of those characteristics, notably non-blocking i/o. Without proper support across your library set, the goals declared in the manifest are rarely achieved.

Even with proper support, it is often hard for a codebase to stay faithful to the principles from the manifest due to the cognitive overload of the ideas, and the extra cost it imposes to programmers. Nonetheless, this stack puts us in a good position to try, and partial success can still be useful.

A truly reactive codebase is as common as a dog who can sit like this

Tooling support

Kotlin is made by JetBrains, probably the best authors of full blown IDEs after Microsoft, and that means that its compiler toolchain has first class support editor integration. Besides the native integration in Intellij, also from JetBrains, it has decent support in Vim, VS Code, and even Emacs.

Micronaut is written by OCI, the authors of Grails, and a notoriously developer conscious company. For example, a nice touch in Micronaut is that it offers scaffolding through the mn command line tool, and it is easy to install on mac and linux.

Developer excitement

Since being adopted by Google for Android development, Kotlin has been rising steadily. It is the fastest growing language in github and the second most loved programming language according to stackoverflow. Its horizontal versatility, supporting backend programming, android and other experimental targets, like the browser and IOS, may not be practical but it is certainly exciting.

Micronaut goodnesses

As for micronaut, although quite knew, it packs a punch. Solving some long standing pains of the most popular JVM framework, Spring Boot, while keeping its (arguably) good idioms, has bought it a lot of positive emotions as can be seen in twitter. The framework is pretty modern, and compared well and even favorably to many other choices. Within the jvm world, we notice it does not suffer from the slowness and hard-to-debug characteristics of Spring Boot due to runtime reflection (it still suffers from spooky-action-at-a-distance due the use of dependency injection idioms). It is not tied to a build system that is struggling, like Play Framework. It does not have a high memory footprint, like Dropwizard.

It is also relevant that the framework is deemed cloud native, and has some interesting features on that front. In particular, we can highlight environment detection, jaeger tracing, service discovery and serverless functions. Notice since we plan to use Istio, some of the of the features offered are redundant.

Other features worth mentioning are the strong support for configuration with env vars overrides and reloadable components, fast startup and graceful shutdown support, out-of-the-box support for many popular storage layers (kafka, cassandra, mongo, redis, postgres), and command line support through picocli.

Library availability

Kotlin has great compatibility with Java, in both directions. And Java has the largest set of mature libraries in the enterprise and open source world. For example, Grpc was deemed a key library for this stack, and Google publishes an official java library.

The Rust ecosystem, although vibrant, has not nearly the same level of support. The Golang community is a bit more mature on that front, but also has not reached the same level of consolidation.

This was perhaps the most compelling reason to land in a JVM stack as our first alternative stack.

Testability

The junit/mockito combo is a very mature and powerful testing strategy. The extra syntax sugar and support that kotlin demands is already available, and micronaut provides a convenience bridge library so tests can easily tap in the dependency injection graph.

Performance

The Kotlin/Micronaut/Nosql stack builds on top of the JVM, one of the fastest vm-based environment. It is still behind ahead-of-time compiled languages, like C/C++ or Rust, and JVM structures are notoriously memory hungry. But those limitations are well known and the ecosystem offers good alternatives when necessary, like the fastutil library for compact data structures or the many javacpp-presets for fast (potentially gpu) math operations.

Notice that arguably Node and Golang perform better on I/O heavy loads, or even in general loads. And the fact that we are using Kotlin brings some additional concerns with hidden costs. Nonetheless, this is a trade-off which seems acceptable.

Early adoptions risks

We are doing two moves of early adoption in the proposed stack. One is the Kotlin language, the other is the Micronaut framework. We believe the first is mitigated by the large body of production code in Kotlin being created since Google declared it an official language for Android development. The fact that we are using it in the backend make things slightly different, but does not seem enough to justify a veto the adoption.

The second move, the adoption of the Micronaut framework is definitively more risky, since its first stable release was a little over a month before this stack was proposed. Micronaut is backed by the same company as the well-established (but shrinking) Grails framework, and it should be the default application context for Grails 4.0, which is a second path for production adoption.

Nonetheless, there are real risks that we will be hit by bugs due to the lack of maturity of the framework, and the framework may even be discontinued. Despite those risks, the modern feature set offered by Micronaut is arguably unmatched, except maybe for Spring Boot. If the worst comes to it, migration to Spring Boot would be painful, but possible.

The other frameworks we looked closely were Spring Boot, Spark Java, Vert.x (all three with Kotlin support), Scala Play, Dropwizard and Express.js. We did not analyzed in depth solutions in the golang eco-system for the reasons discussed in the library availability section, but it is worth noting the ecosystem is very friendly to grpc in case we eventually need interoperability.

Spring Boot was the second contender, but it is large, complex and slow (to boot), due to its ancient J2EE origins. Its lifecycle management and annotation-centric style are arguably a great productivity boost for a team of diverse skill levels, and is well captured by Micronaut. Since Micronaut was built exactly to compete with Spring Boot on modern application architectures, it is no surprise it drops the bad parts, and leave the good ones.

Spark Java, Vert.x were interesting options, but the former does not cover much other than the http server api, and much of the value we get from Micronaut are the consistent backend apis, cloud integration and lifecycle management. Vert.x offers much of this, but ties us too closely to the internal event loop model. Dropwizard has a better surface, being a great library for monoliths, including health checks and configuration, but it is memory hungry, slow to start. On the other side of the spectrum, Scala Play and its microservices cousin Lagom cover too much surface, much of which we intend to leverage Envoy Proxy for, and are highly tied to akka, which is notoriously difficult to master, and SBT, which is in a rather poor shape.

In the end, Micronaut both offered most of what we needed, and brought less burden than others, and the risk seemed worth it. The fact that it is a modern framework that relies on existing mature libraries for almost everything outside its small core, decreases our chances of hitting hard to solve bugs.

Wrapping it up

One and a half year later, the design doc above seems to have aged well. Kotlin continues to be strong, Micronaut is solid and advancing faster every quarter, Istio has gotten more traction, and grpc/protocol buffers are everywhere. Developers got exposed to a whole new set of idioms, and multiple new, different, applications were successfully brought to production.

Each company or group of developers certainly can and should approach this problem in a different way, and will definitively arrive at different conclusions. But often the journey matters more for success than where you want to land, so hopefully the points we brought here can be useful for others in their own unique scenarios.

If you are looking for a new job opportunity, we are hiring new Loggers to work in Brazil and Portugal, check out our open positions here!