Skip to main content

· 3 min read
Alvaro Jose

The singleton pattern has got a bad reputation over the years due to be widely overused in the incorrect use cases. With the proliferation of microservices, have APIs become the new singleton?

The Problem

APIs, or application programming interfaces, have become a ubiquitous part of modern software development. They allow different systems and applications to communicate with one another, enabling the creation of complex, interconnected systems that can share data and functionality. However, there has been a growing concern that APIs are being overused, leading to a proliferation of unnecessarily complex and fragile systems that are difficult to maintain and scale.

One reason for the perceived overuse of APIs is the ease with which they can be implemented. With the abundance of API management tools and frameworks available, it is relatively straightforward to expose a set of functionality as an API and make it available to other systems. This has led to a proliferation of APIs, many of which are redundant or unnecessary, adding unnecessary complexity to the overall system.

Another issue is the lack of standardization in the API ecosystem. Each API is typically designed to meet the specific needs of the system it was created for, resulting in a wide variety of different designs and conventions. This can make it difficult for developers to understand and use APIs from other systems, as they may have to learn and adapt to new conventions and patterns each time they encounter a new API.

In addition to these issues, the reliance on APIs can also lead to fragile systems that are difficult to maintain and scale. When multiple systems are tightly coupled through APIs, a change to one system can have cascading effects on others, leading to unexpected behavior and potential failures. This can make it difficult to make changes or updates to a system without the risk of breaking something else.

There are also concerns about the security of APIs. As they allow systems to communicate with one another, they can also provide a potential entry point for attackers to gain access to sensitive data or functionality. Properly securing APIs can be a complex and time-consuming task, and if not done correctly, can lead to significant vulnerabilities.

The Solution

So, what can be done to address these issues? One solution is to use APIs more judiciously, carefully evaluating whether an API is truly necessary before implementing it. This can help reduce the overall complexity of the system and make it easier to maintain and scale.

It's also important to adopt API design standards and guidelines, which can help ensure that APIs are consistent and easy to understand and use. Finally, proper API security practices should be implemented to protect against potential vulnerabilities.

· 4 min read
Alvaro Jose

The Context

Cloud and infrastructure as code have revolutionized our industry. They allowed us to be able to procure infrastructure in a simple, adaptable way.
This allowed us to move from writing huge monolithic applications to write microservices that interact between them.
One of the most accepted definition of a microservice can be expressed as:

A self-contained portion of code that does not share resources with other services, can be deployed independently, and should be easy to rewrite in a small portion of time.

This sounds great when we talk about individual parts of a software projects. Nevertheless, when thinking about systems and how they operate, There is a point to make about granularity as software does never work fully isolated. It requires interactions with other systems to fulfill their purpose.

Most of the monolithic applications of the past had an issue of being over-engineered to allow changes that might never happen.

Could that also happen with microservices?

The Issues

Clarity Of The Domain

When a system grows too much in small pieces, it becomes more and more complex to understand the big picture.
When pieces are too small, domain events start becoming exchange of information in between nodes of a network. All this removes cohesion on the knowledge over the domain of a system, making it difficult to grasp the real intention and capabilities of concepts and actors across a system.

Babel tower Issue

The more parts a system has, the less heterogeneous it becomes. This at the same time translates into a more complex environment with more integrations, frameworks and bigger learning curves that affects delivery. There need to be a balance of when and where in a system a new technology is added. Decisions must be based on needs and not on preferences.

Implicit runtime dependencies

The more a system get split, the more dependency on certain node it will have. This tends to cause more dependencies in between the pieces of your infrastructure-based puzzle where you start having god infrastructure points that become single point of failure, or you have a chain of dependent infra that need to be deployed in a go or certain order.

Hidden Complexity

The more your microservice environment grows, the more it requires a growing support infrastructure for monitoring, alerting and other services not used as part of the main system. This normally is a separate effort which has its cost. The more a system grows, those hidden complexities become a dependency for all the nodes in the system, making it a complex task to evolve and change those dependencies.

Why… if YAGNI

One of the main ideas of microservices was to be able to validate assumptions fast. Before bootstrapping new services or infrastructure, there is a need to ask ourselves about the existence of a service or infrastructure that contains the domain knowledge required for the experiment in the current ecosystem. If we are not careful, experiments won't be experiments. They will be MVPs, where domain knowledge is re-implemented, just for having it as a standalone node on the system.

Repeating Yourself

When we create pieces of code that are independent, there is always a certain level of bootstrapping that is required and repeated in each node of our systems. This will cause not only a set of duplicated code, but also has a development time cost attached to it. Bootstrapping a project in a high granularity system can be complex to standardize.

Microservices, the cloud, and infrastructure as a service have definitely revolutionized our industry, nevertheless as in everything there is a need for balance. Making sure we use the right tool for the job, and we don't over-engineer things, not only at a code level but also at infrastructure level, as everything has a cost.

Conclusion

In conclusion, a macro infrastructure due to microservice obsession can lead to increased complexity and overhead costs, as well as challenges in making changes and updates to the system. While microservices can offer benefits such as increased scalability and flexibility, it is important for organizations to carefully consider their specific needs and choose the right level of granularity for their architecture.

· 2 min read
Alvaro Jose

I have observed quite a few articles lately that elaborate on issues with TDD. Nevertheless, they focused on the first letter but miss the focus of the other two letters.

Not A Testing Strategy

If you take anything out of this article, please think about this quote:

If TDD was about testing it would have been called TDT (test driven testing).

The fact that we do test upfront in TDD does not mean at all that there is a direct relationship with a testing strategy, and as many preach, unit testing is not enough to create robust software.

A Design Strategy

TDD is actually a Design Strategy, this is why the 2 last letter are for driven development. This means that your final code is being moved by those tests and not the other way around.

The design that TDD will move you towards to is minimalistic. Reducing the tendency of overengineering solutions when you don't need them. This brings a reducing time to market, by reducing the accidental complexity.

When doing TDD most developers have the complexity of letting go their egos, the problem when people fight against the practices is because they think to know better. Nevertheless, it tends to generate waste because most code optimizations tend to be premature and most extensibility points will never be modified.

There are places where TDD does not fit, for example while investigating a technology through a spike or PoC because in these cases, the person is exploring knowledge not generating value. In other cases, TDD allows you to bring value in the shortest way possible.

Conclusion

If you are an experienced developer, do not discard TDD because you think you know better, allow it to challenge you. If you are a new developer, learn from the different ways of doing things and understand the value, don't take articles at face value.

· 2 min read
Alvaro Jose

As we develop a product over time, changes need to be made as we need to accommodate new functionality. As most of our systems don't run isolated, and we have clients that used them (ex. public API), We have to keep compatibility at least on a temporary basis. How do we achieve this?

Versions

A common practice is to have different versions for the multiple clients. While simple, it also requires significant effort to maintain as whenever an issue or bug is spotted, multiple places are affected, meaning there are more possibility of side effects.
It also makes it more difficult to make a case for clients to migrate from one to the other due to the contract changes.

This affect mostly negatively the next DORA 4 metrics:

  • Lead Time for change

Versionless: Expand & Contract

As the name says, this strategy intents to have only one state of truth and not a multitude of them. Versionless has been heavily adopted as a principle by GraphQL, for example.
We can achieve this in any code base by implementing a strategy for parallel changes called Expand & Contract, it's call this way due to the phases code goes through. Let's see for example we want to migrate from using one field value to a similar field with a more complex representation.

  • Expand: We add the new 'field' to the existing contract, and add the code to support this strategy on the existing code.
  • Contract: We monitor the usage of the old 'field' to understand when it is possible to deprecate, at that point we remove the old code.

With this, we have a clean source code that we can evolve indefinitely as required by the business.

This affect the next DORA 4 metrics:

  • ✔️ Lead Time for change

· 3 min read
Alvaro Jose

I have already written some other post on this topic. I will go straight to the point on comparing Git Flow (a legacy strategy that most companies use) and Trunk-Based Development.

Gitflow: The Bad & The Ugly

Why do I call it the bad and the ugly? Because it does not allow you to achieve Continuous Deployment.
The idea is that every developer works isolated on their branch, validate on their branch and ask through a merge request to add their code to the X stage branch.


There are multiple issues with this:

  • Code does not exist isolated, we don't deploy isolated code, so the isolated test is not valid as it will require retesting.
  • The peer review process happens at the end, causing a very slow feedback loop. Having to rewrite code that could be avoided.
  • The more time the branch lives, the more it diverges from the original behavior and the more complex it is to merge.
  • Merging can cause complex conflicts that require revalidation, and it might have side effect in other features.
  • As there needs to be validations of the merges, it's normal to have multiple environments that give a false sense of security, increases the $ cost and increases the lead time.
  • Egos and preferences become part of the review process, as it has become an 'accepted' practice that the 'experts' or 'leads' do the reviews.

All of this is red tape to go through is a problem that makes delivery slower, and create a lack of ownership mentality farther away from what happen to the individual branch.

This affects mostly negatively, most of DORA 4 metrics:

  • Deployment frequency
  • Lead Time for change
  • Mean Time To Recovery

Is there a simpler and better way to collaborate on code way?

Trunk-Based Development: The Good

What happens if we all commit to the same branch.

Most of the expressed issues are solved, in this scenario by:

  • Code is never isolated, as we all push code to the same place.
  • Teams that do this practices also practice pair programming, making the peer review process is continuous and synchronous.
  • As individuals push multiple times a day, merge conflicts are non-existent or small.
  • Does not require revalidation, as validation is a continuous stream in the single environment.
  • No ego environment tent to appear as there is no centralize approver of code, so it's not a matter of preference but a team effort and ownership.

As we have seen before, having unfinished code does not need to affect users, as it is common practice to use feature flags and/or branching by abstraction.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery

Conclusion

Simplicity is king. Having a simpler structure enables speed and quality of delivery, as it allow teams to work closely, take shared ownership and act faster related to a smaller change.

· 2 min read
Alvaro Jose

Before we enable code for our clients, we need to test and validate it does what is expected. This could be an entire series of its own (please let me know if you want one), so I will keep it on a high level.

Testing

I could probably spend hours sharing different types of testing strategies and where and why to use them.
In reality, the most important thing, is to make sure we use the correct ratio of the different types of tests, as it will highly affect the time and location of your testing.

This ratio has always been shown as a pyramid with:

  • Unit test: validate individual pieces of logic that are isolated.
  • Integration test: validates interactions with multiple parts of your system or other systems.
  • Integrated test: They test the system as a whole.

Tests are divided in these layers because there is a cost in time and complexity.

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate

Validation

Validation differs from testing as it's the confirmation that the behavior is what the user expected, for now, humans are the only ones that can discern this.
As we have seen in the previous chapter, the recommendation is to do this in production, so you get:

  • Get real behaviors of interactions with other systems
  • Get real performance

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate

· 3 min read
Alvaro Jose

Now that we know where our code lives, we need to make sure our users get access to the features. For this, we need to get our code to the environment we want to deploy to, and control the rollout (if you are not a big bang release fan).

Blue/Green Deployment: Getting to prod with 0 downtime

What is this?, The concept is simple, we have a set of machines (ex. blue) where we currently have our app running, and we want to deploy. The intent is to create a new set of machines (ex. green) where our new version of the code will run. We would like to validate as much as possible (ex. automated e2e tests) that this new version is up to par with the previous one before moving the traffic and destroy the previous version.

You can see the process in the next graph:

With this, we are trying to achieve a 0 downtime while deploying a new version of our code. This is critical for teams that practice continuous deployment, as you want to avoid having systems down as you deploy multiple times a day.

Enabling feature access to users

there are multiple ways to enable access to users, in between them:

Big Bang Releases

This is the plug and pray solution. Pushing the code and expecting it to work as it's enabled for all users. This is a very dangerous strategy as your blast radius is all your users.

Canary Releases

This is a practice that comes from the mining industry, The idea was the next one:

If a canary is in the same place where humans are inside the mine, when there is a problem with the breathable air it will be the first one to perish.

If we translate this to software, the idea is to have deployed the changes only to one or a few servers. With this, we can monitor this canary instances and act if any issue happens, we reduce the blast radius of issues to only the users who go through that server.

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate

This approach provides us a way to reduce the blast radius from a big bang release. Nevertheless, it does not help us to prevent or act faster upon a bug in our code.

Feature Flag Releases

To improve upon the canary release strategy, we can move towards feature flags.

Feature Flags are hiding our code behind a 'flag' this can help decide if the code is enabled or disabled, as in the next image.

There are a multitude of services, libraries & SDKs that allow you to create flags in your code. They help by:

  • Decouple activation of features from the release pipeline.
  • Solving incidents in a matter of seconds.
  • Do a controlled rollout. For example:
    • Enable only for team.
    • Enable for X% of the traffic.
    • Enable for users in a specific country.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

· 3 min read
Alvaro Jose

Our services need to run somewhere, so our users can access it. It's a very common practices to have multiple environments like dev, staging, and prod. Is this actually a good practices?

CI vs. CD vs. CD

when people talk about continuous integration, delivery and deployment, they normally talk about it as a whole.

Nevertheless, let's reflect why these are 3 different practices. As they are steps in a journey, you can do one and not the next one.

  • Continuous integration: allows making reproducible states of the code in multiple places.
  • Continuous Delivery: Now that it's reproducible, it needs to be marked as potentially deployable and provide the ability to deploy it.
  • Continuous Deployment: Delivers the code to your clients and not only to your team as you commit.

The trap of Multiple Environments

As you can imagine, with the previous definition of CI/CD, having multiple environments will never allow you to achieve Continuous Deployment.

The intent of having multiple environments is to reduce change failure rate, are we actually achieving this with the practices? The answer is normally not due to:

  • A non-production environment will never be the same as a production.
    • Different data
    • Different performance
    • Different security practices
    • Etc…
  • Stress and ownership of moving things to production
  • Accumulation of code in lower environments (meaning more bugs).
  • Longer feedback loop.
  • Continuous misalignment due to development cycles in between different teams.

As you can see, this makes a fake sense of safety, but it does not affect positively the change failure rate.

This affects mostly negatively, most of DORA 4 metrics:

  • Deployment frequency
  • Lead Time for change
  • Mean Time To Recovery
  • 〰️ Change Failure Rate

Achieving Continuous Deployment, Only prod, is it so crazy?

How can a team Continuous deployment? The answer tends to be simple, making every commit go to production and testing in it.
Be aware this does not mean to have our users experience possible bugs or see test data, as we can hide functionalities behind toggles, headers, or parameters that allow access to only the development team. As we will see in future installments of this series.

An example strategy is the one in the next diagram.

This allows us to keep only one environment that discriminates in between test and non-test data that can be clean periodically, while it provides the real environment with the real behavior. With this, we solved:

  • Real performance & behavior.
  • Continuous alignment with other teams.
  • Smaller feedback cycles.
  • Control of rollout.
  • Smaller $ cost.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • 〰️ Change Failure Rate

Conclusion

There is no one size fit all, but modern practices tend to go towards simplicity and fast feedback loops. There are many practices involved on this simplicity that enables us to feel comfortable with only production environments. We will talk about them on this series.

· 3 min read
Alvaro Jose

When we talk about observability, we talk about:

Capability of developers to understand the health and status of their application.

We don't want users or clients to be the ones noticing something is wrong. For this, there are multiple tools that fall under the observability category.

Tools

Alarms

This is the first line of defense against issues, the intent is to get notified if any potential issue arises.
The intent of this is to provide a notification if any parameter of our application is out of range (ex. to many 5xx).

This allows us to use our mental bandwidth to focus in creating value and not continuously check if the parameters are in range.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery

Metrics

As the name says, this is a set of measurements we track from our code, it allows us to understand the health of individual parts of our system.

This metrics are shown in dashboards that allow us to visually understand what is happening. We can divide metrics dashboards in 2 types:

  • Status: It will give us a really fast overview of the health of the system.
  • Details: It will not tell us what is wrong, but will provide more detailed information to dig deeper into a specific area.

It's important to not mix this 2 together, as they have different purposes. Like with alarms, it helps focus our mental bandwidth in the correct place.

As you see in the previous image, the left represents a detail dashboard that makes it difficult to know on a single view if there is an issue. For this, as in the image on the right, we have a status dashboard that in a single glance we can spot where to look next.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery

Logs

This is the lower level you want to go. It should tell you where in the code is your issue, so you can go and fix it.

When thinking about logging, it is significant not log everything. Due to the added noise that this can bring.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery

Example

let's get practical on how would this work.

  • Implement your service
  • Create metrics and send them to your metrics system (ex. Datadog, Grafana)
  • Create logs and send them to your logging system (ex. Datadog, Kibana, CloudWatch).
  • Create dashboards:
    • Single Status dashboard. Use only simple boxes with green and red backgrounds that represent in one view the health of your system & subsystems.
    • Multiple Detail dashboards. Create a dashboard for each subsystem with as much data as necessary to understand where the issue is, so you can later pinpoint the root cause in your logs.
  • Create alarms based on the status dashboard boxes.
  • Connect your notification system (ex. Opsgenie, PagerDuty, Slack channel) to the created alarms, so you get push notifications as something goes wrong.

· 4 min read
Alvaro Jose

When we start our journey towards continuous integration & delivery, the first thing to take in count is the mentality. There are a few of them that will make or break our intent. Let's see the most important and also some practices.

Mentality

You build it, you run it

create a DevOps culture, not a Devs vs Ops

This mentality is the idea that the same people who develop the software re in charge to maintain it in good health by observing it.

For many years, this was not the case. Operations & development were handled by different teams. This caused a dystopian situation where each group had a different goal:

  • Devs: deliver as fast as possible. By pushing code to production without observing the side effects of it.
  • Ops: keep system stability.

With the 'you build it, you run it' mentality, devs focus on their service or work, while Ops becomes a product team that focus on providing the correct tooling for Developers.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Embrace Ownership in Failure Culture

the problem is not breaking things, is the inability to recover from it

Normally, developers feel they need a safety net to feel comfortable to introduce changes to production, this tends to translate in delegating the ownership to others trough peer review or other validation step.
This lack of ownership have massive effects on the capacity to recover and the gates that code needs to go through, affecting the feedback cycle.

To improve this failure culture is necessary to promote this behavior, having no blame reduces the amount of stress people go through.

If something fails is not an issue of the individual but of the process itself.

Imagine that every commit goes to production, changes will be so small that fixing or rolling back can be done in minutes or seconds. At the same time, developers will be able to create the correct tooling to feel more comfortable with this practice.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Be a Boy Scout

Don’t continue the same path if you think something can be done better

As individuals, need to bring change to our products. If we see any new practice, tool, services… that can support the work of the team, bring it forward. Don't shy away because the team is currently doing it.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Learn & Adapt

Not everything is solved in the same way, don't follow:

If your only tool is a hammer then every problem looks like a nail

For this, learn and take your time for it. When you have a new problem, as it's possible, you don't have the correct tool in your toolbox.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Practices

Firefighter Role

The firefighter role is a rotating role inside the team. They are responsible for being the first responder to incidents and helping solve them.
At the same time, to make sure this person does not suffer from cognitive load due to context switching, this person is not involved on the normal pair rotation and development tasks.
In exchange, they focus during the week in improving the specific tooling of the project (ex. DB migration tooling).

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

On Call Rotation

As the development team is also in charge of running the service, some of them will require after working hour support. On call is just this, the disposition of team members to take care of their services around the clock.
This tends to sound bad, but there are ways to not make this suck. I can't express it better than Chris Ford has already done in this page.

This affect the next DORA 4 metric:

  • ✔️ Mean Time To Recovery

Conclusion

These are the starting point to feel comfortable running things in production without the concern that any issue is a catastrophic thing. Failing is not an issue, the important part is to be able to recover as soon as possible from any problem that arises.