Skip to main content

23 posts tagged with "programming"

View All Tags

· 3 min read
Alvaro Jose

I have already written some other post on this topic. I will go straight to the point on comparing Git Flow (a legacy strategy that most companies use) and Trunk-Based Development.

Gitflow: The Bad & The Ugly

Why do I call it the bad and the ugly? Because it does not allow you to achieve Continuous Deployment.
The idea is that every developer works isolated on their branch, validate on their branch and ask through a merge request to add their code to the X stage branch.

There are multiple issues with this:

  • Code does not exist isolated, we don't deploy isolated code, so the isolated test is not valid as it will require retesting.
  • The peer review process happens at the end, causing a very slow feedback loop. Having to rewrite code that could be avoided.
  • The more time the branch lives, the more it diverges from the original behavior and the more complex it is to merge.
  • Merging can cause complex conflicts that require revalidation, and it might have side effect in other features.
  • As there needs to be validations of the merges, it's normal to have multiple environments that give a false sense of security, increases the $ cost and increases the lead time.
  • Egos and preferences become part of the review process, as it has become an 'accepted' practice that the 'experts' or 'leads' do the reviews.

All of this is red tape to go through is a problem that makes delivery slower, and create a lack of ownership mentality farther away from what happen to the individual branch.

This affects mostly negatively, most of DORA 4 metrics:

  • Deployment frequency
  • Lead Time for change
  • Mean Time To Recovery

Is there a simpler and better way to collaborate on code way?

Trunk-Based Development: The Good

What happens if we all commit to the same branch.

Most of the expressed issues are solved, in this scenario by:

  • Code is never isolated, as we all push code to the same place.
  • Teams that do this practices also practice pair programming, making the peer review process is continuous and synchronous.
  • As individuals push multiple times a day, merge conflicts are non-existent or small.
  • Does not require revalidation, as validation is a continuous stream in the single environment.
  • No ego environment tent to appear as there is no centralize approver of code, so it's not a matter of preference but a team effort and ownership.

As we have seen before, having unfinished code does not need to affect users, as it is common practice to use feature flags and/or branching by abstraction.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery


Simplicity is king. Having a simpler structure enables speed and quality of delivery, as it allow teams to work closely, take shared ownership and act faster related to a smaller change.

· 2 min read
Alvaro Jose

Before we enable code for our clients, we need to test and validate it does what is expected. This could be an entire series of its own (please let me know if you want one), so I will keep it on a high level.


I could probably spend hours sharing different types of testing strategies and where and why to use them.
In reality, the most important thing, is to make sure we use the correct ratio of the different types of tests, as it will highly affect the time and location of your testing.

This ratio has always been shown as a pyramid with:

  • Unit test: validate individual pieces of logic that are isolated.
  • Integration test: validates interactions with multiple parts of your system or other systems.
  • Integrated test: They test the system as a whole.

Tests are divided in these layers because there is a cost in time and complexity.

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate


Validation differs from testing as it's the confirmation that the behavior is what the user expected, for now, humans are the only ones that can discern this.
As we have seen in the previous chapter, the recommendation is to do this in production, so you get:

  • Get real behaviors of interactions with other systems
  • Get real performance

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate

· 3 min read
Alvaro Jose

Now that we know where our code lives, we need to make sure our users get access to the features. For this, we need to get our code to the environment we want to deploy to, and control the rollout (if you are not a big bang release fan).

Blue/Green Deployment: Getting to prod with 0 downtime

What is this?, The concept is simple, we have a set of machines (ex. blue) where we currently have our app running, and we want to deploy. The intent is to create a new set of machines (ex. green) where our new version of the code will run. We would like to validate as much as possible (ex. automated e2e tests) that this new version is up to par with the previous one before moving the traffic and destroy the previous version.

You can see the process in the next graph:

With this, we are trying to achieve a 0 downtime while deploying a new version of our code. This is critical for teams that practice continuous deployment, as you want to avoid having systems down as you deploy multiple times a day.

Enabling feature access to users

there are multiple ways to enable access to users, in between them:

Big Bang Releases

This is the plug and pray solution. Pushing the code and expecting it to work as it's enabled for all users. This is a very dangerous strategy as your blast radius is all your users.

Canary Releases

This is a practice that comes from the mining industry, The idea was the next one:

If a canary is in the same place where humans are inside the mine, when there is a problem with the breathable air it will be the first one to perish.

If we translate this to software, the idea is to have deployed the changes only to one or a few servers. With this, we can monitor this canary instances and act if any issue happens, we reduce the blast radius of issues to only the users who go through that server.

This affect the next DORA 4 metrics:

  • ✔️ Change Failure Rate

This approach provides us a way to reduce the blast radius from a big bang release. Nevertheless, it does not help us to prevent or act faster upon a bug in our code.

Feature Flag Releases

To improve upon the canary release strategy, we can move towards feature flags.

Feature Flags are hiding our code behind a 'flag' this can help decide if the code is enabled or disabled, as in the next image.

There are a multitude of services, libraries & SDKs that allow you to create flags in your code. They help by:

  • Decouple activation of features from the release pipeline.
  • Solving incidents in a matter of seconds.
  • Do a controlled rollout. For example:
    • Enable only for team.
    • Enable for X% of the traffic.
    • Enable for users in a specific country.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

· 3 min read
Alvaro Jose

Our services need to run somewhere, so our users can access it. It's a very common practices to have multiple environments like dev, staging, and prod. Is this actually a good practices?

CI vs. CD vs. CD

when people talk about continuous integration, delivery and deployment, they normally talk about it as a whole.

Nevertheless, let's reflect why these are 3 different practices. As they are steps in a journey, you can do one and not the next one.

  • Continuous integration: allows making reproducible states of the code in multiple places.
  • Continuous Delivery: Now that it's reproducible, it needs to be marked as potentially deployable and provide the ability to deploy it.
  • Continuous Deployment: Delivers the code to your clients and not only to your team as you commit.

The trap of Multiple Environments

As you can imagine, with the previous definition of CI/CD, having multiple environments will never allow you to achieve Continuous Deployment.

The intent of having multiple environments is to reduce change failure rate, are we actually achieving this with the practices? The answer is normally not due to:

  • A non-production environment will never be the same as a production.
    • Different data
    • Different performance
    • Different security practices
    • Etc…
  • Stress and ownership of moving things to production
  • Accumulation of code in lower environments (meaning more bugs).
  • Longer feedback loop.
  • Continuous misalignment due to development cycles in between different teams.

As you can see, this makes a fake sense of safety, but it does not affect positively the change failure rate.

This affects mostly negatively, most of DORA 4 metrics:

  • Deployment frequency
  • Lead Time for change
  • Mean Time To Recovery
  • 〰️ Change Failure Rate

Achieving Continuous Deployment, Only prod, is it so crazy?

How can a team Continuous deployment? The answer tends to be simple, making every commit go to production and testing in it.
Be aware this does not mean to have our users experience possible bugs or see test data, as we can hide functionalities behind toggles, headers, or parameters that allow access to only the development team. As we will see in future installments of this series.

An example strategy is the one in the next diagram.

This allows us to keep only one environment that discriminates in between test and non-test data that can be clean periodically, while it provides the real environment with the real behavior. With this, we solved:

  • Real performance & behavior.
  • Continuous alignment with other teams.
  • Smaller feedback cycles.
  • Control of rollout.
  • Smaller $ cost.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • 〰️ Change Failure Rate


There is no one size fit all, but modern practices tend to go towards simplicity and fast feedback loops. There are many practices involved on this simplicity that enables us to feel comfortable with only production environments. We will talk about them on this series.

· 3 min read
Alvaro Jose

When we talk about observability, we talk about:

Capability of developers to understand the health and status of their application.

We don't want users or clients to be the ones noticing something is wrong. For this, there are multiple tools that fall under the observability category.



This is the first line of defense against issues, the intent is to get notified if any potential issue arises.
The intent of this is to provide a notification if any parameter of our application is out of range (ex. to many 5xx).

This allows us to use our mental bandwidth to focus in creating value and not continuously check if the parameters are in range.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery


As the name says, this is a set of measurements we track from our code, it allows us to understand the health of individual parts of our system.

This metrics are shown in dashboards that allow us to visually understand what is happening. We can divide metrics dashboards in 2 types:

  • Status: It will give us a really fast overview of the health of the system.
  • Details: It will not tell us what is wrong, but will provide more detailed information to dig deeper into a specific area.

It's important to not mix this 2 together, as they have different purposes. Like with alarms, it helps focus our mental bandwidth in the correct place.

As you see in the previous image, the left represents a detail dashboard that makes it difficult to know on a single view if there is an issue. For this, as in the image on the right, we have a status dashboard that in a single glance we can spot where to look next.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery


This is the lower level you want to go. It should tell you where in the code is your issue, so you can go and fix it.

When thinking about logging, it is significant not log everything. Due to the added noise that this can bring.

This affect the next DORA 4 metrics:

  • ✔️ Mean Time To Recovery


let's get practical on how would this work.

  • Implement your service
  • Create metrics and send them to your metrics system (ex. Datadog, Grafana)
  • Create logs and send them to your logging system (ex. Datadog, Kibana, CloudWatch).
  • Create dashboards:
    • Single Status dashboard. Use only simple boxes with green and red backgrounds that represent in one view the health of your system & subsystems.
    • Multiple Detail dashboards. Create a dashboard for each subsystem with as much data as necessary to understand where the issue is, so you can later pinpoint the root cause in your logs.
  • Create alarms based on the status dashboard boxes.
  • Connect your notification system (ex. Opsgenie, PagerDuty, Slack channel) to the created alarms, so you get push notifications as something goes wrong.

· 4 min read
Alvaro Jose

When we start our journey towards continuous integration & delivery, the first thing to take in count is the mentality. There are a few of them that will make or break our intent. Let's see the most important and also some practices.


You build it, you run it

create a DevOps culture, not a Devs vs Ops

This mentality is the idea that the same people who develop the software re in charge to maintain it in good health by observing it.

For many years, this was not the case. Operations & development were handled by different teams. This caused a dystopian situation where each group had a different goal:

  • Devs: deliver as fast as possible. By pushing code to production without observing the side effects of it.
  • Ops: keep system stability.

With the 'you build it, you run it' mentality, devs focus on their service or work, while Ops becomes a product team that focus on providing the correct tooling for Developers.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Embrace Ownership in Failure Culture

the problem is not breaking things, is the inability to recover from it

Normally, developers feel they need a safety net to feel comfortable to introduce changes to production, this tends to translate in delegating the ownership to others trough peer review or other validation step.
This lack of ownership have massive effects on the capacity to recover and the gates that code needs to go through, affecting the feedback cycle.

To improve this failure culture is necessary to promote this behavior, having no blame reduces the amount of stress people go through.

If something fails is not an issue of the individual but of the process itself.

Imagine that every commit goes to production, changes will be so small that fixing or rolling back can be done in minutes or seconds. At the same time, developers will be able to create the correct tooling to feel more comfortable with this practice.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Be a Boy Scout

Don’t continue the same path if you think something can be done better

As individuals, need to bring change to our products. If we see any new practice, tool, services… that can support the work of the team, bring it forward. Don't shy away because the team is currently doing it.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

Learn & Adapt

Not everything is solved in the same way, don't follow:

If your only tool is a hammer then every problem looks like a nail

For this, learn and take your time for it. When you have a new problem, as it's possible, you don't have the correct tool in your toolbox.

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate


Firefighter Role

The firefighter role is a rotating role inside the team. They are responsible for being the first responder to incidents and helping solve them.
At the same time, to make sure this person does not suffer from cognitive load due to context switching, this person is not involved on the normal pair rotation and development tasks.
In exchange, they focus during the week in improving the specific tooling of the project (ex. DB migration tooling).

This affect the next DORA 4 metrics:

  • ✔️ Deployment frequency
  • ✔️ Lead Time for change
  • ✔️ Mean Time To Recovery
  • ✔️ Change Failure Rate

On Call Rotation

As the development team is also in charge of running the service, some of them will require after working hour support. On call is just this, the disposition of team members to take care of their services around the clock.
This tends to sound bad, but there are ways to not make this suck. I can't express it better than Chris Ford has already done in this page.

This affect the next DORA 4 metric:

  • ✔️ Mean Time To Recovery


These are the starting point to feel comfortable running things in production without the concern that any issue is a catastrophic thing. Failing is not an issue, the important part is to be able to recover as soon as possible from any problem that arises.

· 3 min read
Alvaro Jose

This is a series I am really looking forward to writing. I have been doing this presentation for the last 3 years in multiple places.

Am I Crazy?

The answer is no, most of the thing you will see on this series comes from practices derived from Extreme Programming, that show how to build quality and value into products. So bear with me for some time.


A few years ago, I read the book Accelerate that is derived of the analysis of the state of DevOps report that happens in a regular basis.

The book does not speak only about technology but also speaks about communication, organization, etc. And how this affects effectiveness in teams & companies. I recommend reading the entire book.

4 key metrics

Nevertheless, most of the people resume this book (erroneously) in the next table.

It does a comparison on a what are called the 4 key metrics, and provide a classification of performance (teams & companies, since 2017 this classification has evolved).

What does these 4 key metrics mean:

  • Deployment frequency: is how often does the team deploy to production.
  • Lead Time for change: is how much time does a story take to get to production.
  • Mean Time To Recovery: is how fast can we solve a production issues.
  • Change Failure Rate: is how frequently do we break things in production.

All this metrics is helping teams understand their feedback cycle and stability. In the case of the team, I currently work with:

  • Deployment Frequency: once per commit to trunk (while doing trunk-based development) what ends up translating to a few times per day.
  • Lead Time for change: below 1h. We can activate a feature as soon as the code is deployed by the CI/CD using feature flags.
  • Mean Time To Recovery: In minutes. We can activate and deactivate feature flags on the fly if any of the code breaks, and we have a good observability and alarming, so we are the first one to notice.
  • Change Failure Rate: We don't optimize for this, as MTTR is more important for us (I will explain why later). Nevertheless, we currently only had 2 minor production issues in the last year, so we are way below 1%. Our CI/CD validations help a lot on this.

The intent of this series is to share the Extreme programming practices that we use to achieve being on the elite classification of DORA 4.

Note of Caution

As this twitter thread shows, this is not one size fits all, the challenges of a team are not the challenges of another one. There is no silver bullet or common root cause to the issue, and each team should use this metrics to track improvements in an unbiased way. For this, the 4 key metrics do not mean anything at company level and should not be used to compare teams.


In the following installments, I will walk backwards from having something in production and how to keep it running in a healthy manner stress-free up to coding techniques that enable Trunk-based development.

· 5 min read
Alvaro Jose

On our previous installments, we discussed the smells that can happen when splitting microservices, and the strategies that exist to make them as independent as possible. But how do we define boundaries? How do we define the process that our microservice is in charge off?

Event Storming

Event storming is a technique that is part of DDD. But, what is Event storming?, the definition on Wikipedia is:

A workshop-based method to quickly find out what is happening in the domain of a software program. The business process is "stormed out" as a series of domain events.

This process is run with stickies in a physical or digital board during a session, and requires the 'experts' on the process to be present to provide the context what/whom/how. The outcome is an understanding of the business process, not the technical one. To be able to separate them into different steps with clear responsibilities.

Step-By-Step Guide

let's do an example of how a company sets up our internet connection

Prepare a board and the people for the session

Event storming requires people to share a common view and brainstorm and discuss on it. This process takes to count time as a dimension. And has multiple types of stickies that can be used.
You can see an example board on the next image:

Regarding the Stickies, their color represent a specific meaning[1]:

  • Events (orange): Represent the factual events and anything that is relevant to a domain expert.
  • Commands (blue): These are requests to do something. They can originate from a user or system or by another event.
  • System (pink): These represent systems involved in the domain. They may issue commands or receive commands along with triggering events.
  • User (yellow): These are human users involved in the process. They may be a single person or a department/team.
  • Aggregate (tan): This is the first level of categorization and can be thought of as the “thing” that a group of events operates on.
  • Read Model (green): This represents data that may be critical for a user or system to decide.
  • Policy (gray): These represent standards or rules that may need to be executed, such as rules for a compliance policy.

Define the Events of your system

Events are the most important information of our board. They represent facts regarding the process and helps encapsulate the knowledge of the 'experts'.
As we mention before, time is a significant dimension. A process always happens in a period of time. Starting by organizing this 'things' that happen in a timeline is a good way to start.

In our example, you can see on the previous image we go from checking coverage, to creating a user, to creating a contract and connecting our user to the network.

Identify the Systems involved (Optional)

The intent of this step is to identify the existing systems and their interdependency. When we discuss systems, they can be internal or external.

In our example, all starts with the website, but soon enough it becomes apparent most of the process is taken care by the monolith.

This step is optional in the case you have a greenfield. Nevertheless, I highly recommend it if you are splitting a monolith.

Add the Actors

These are real people who are part of the process, they tend to be the starting point of a chain of events, or even on a manual process we are trying to automate the executors of the individual step.

In our case, the user is the one starting the process, but there needs to be a technician doing the last steps manually.

Connect the dots with Commands

Now we are left with events that are done by someone and take effect in parts of our system. But we are missing the cause and effect that made this look this way.

Commands allow exactly this, is a specific action or decision that will push our system into a certain direction.

Commands can be positive or negative actions, causing bifurcation and showing different cases that our system needs to cope with.

Define Bounded Context

now we are left to define where each of the sub-process that conform our system starts and ends. This is done by grouping the stickies with an enclosing and giving a noun + verb to it, as it's a sub-process and it evokes action.

Now you have a set of split actions that can become their microservices and provide part of the process independently.

Create Capabilities Matrix (Optional)

Now, with the bounded context, we can start defining the capabilities of our services. This is straightforward to express in a matrix.

Network ManagementCheck coverage
Enable Network
3rd party Hardware management integration
User ManagementCreate User
User Email Verification
contract ManagementCreate Contract
User Email Verification
3rd party digital signature integration

Devise your Goal Architecture (Optional)

Knowing our current architecture, it's good to think where we want to go.
This is not only a technical challenge, but an organizational challenge due to Conway's law. If we would like to be successful in splitting a monolith our communication, meaning the teams structure involved, need to resemble this target state.

Define a plan on how to split the Monolith (Optional)

A change so big as the one shown on the previous image can be overwhelming for an organization and create a paralysis and doubts. It's always good to split the problem in steps to understand progress and be always on a better state. This will improve morale.


· 3 min read
Alvaro Jose

On the previous installment of this series, we discussed the pitfalls that could happen when we split a monolith into microservices. In specific, we talked about creating what are called microliths.

Given that you have followed the recommendations on designing your domains correctly. Today we are going to elaborate on patterns to remove that synchronous communication in between 'microservices'. This will help our services to become more resilient.

The Patterns

Circuit Breakers

The most simple solution we can go for is called circuit breakers. As it implies, is just a piece of code that upon multiple request failed to a downstream service will fail silently and allow service to resume their normal behavior.

What are we solving and what are we letting unsolved:

  • ✔️ We don’t fail continuously if some other service fails.
  • ❌ We silently don’t finish the entire process requested.
  • ❌ We require all chain of dependencies to be called.
  • ❌ We force other services to scale to our needs.
  • ❌ Data is mutable, so errors will be propagated and not solvable.

Outbox Pattern

The next level in solving our microlithic issue is to decouple our services using Pub/Sub to exchange models in between services.
Our service will consume and store the necessary information to run the process locally, and will broadcast the outcome models. This will mean there will always be a strong consistency in the outbox, and eventual consistency on the service database (if it exists).

What are we solving and what are we letting unsolved:

  • ✔️ We don’t fail continuously if some other service fails.
  • ✔️ We always finish our process and promise the rest will be done.
  • ✔️ We just require our service to do what we promise.
  • ✔️ Fast services will be fast, and slow services can go slow.
  • ❌ Data is mutable, so errors will be propagated and not solvable.

Event Sourcing

The last level is event sourcing. The idea is to use the events that generated a specific state and not use the calculated state that a service can provide us.

This allows a higher resilience due to the immutability of the data. In this case, calculation issues of the past can be solved, as we can reprocess the entire set of events that took us to a certain state.

Conclusion and follow-ups

These are some of the patterns that can make our services more independent and resilient. Nevertheless, each of them has a different complexity, meaning it also affects the complexity of our code. For this, we need to make sure we use the correct tool for the job.

· 4 min read
Alvaro Jose

The Monolith

We have all at this point encounter the big service that jumpstarted the business. It's always good to find it or know it existed. It shows that there was an intent to not resolve every architectural problem before we even knew we had a business.

Nevertheless, it tends to outgrow itself and become more a pain than a solution. Some of these pains are:

  • We all work on the same code base, and conflicts and side effects start to happen.
  • You need to release the entire solution, even if different teams have different cycles.
  • There are code freezes to go through validation cycles.
  • It scales as a whole, not only the portion that has an increase in traffic.

Due to these pains, microservices were created. To give team/domain independence to create focused solutions on a business that has already been validated.

The Microservices

Let's start with a definition of a microservice:

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. These services are owned by small, self-contained teams.


All sounds like flowers and happiness when we talk about microservice. Nevertheless, does microservices solve the entire issue by itself?

Have you encountered the next cases in a microservice architecture?

  • Before we release a new version, we need to sync deploys with another team.
  • Our application was down, but is not our issue.
  • Our service was working and scaling fine until the team X started using us.
  • And more…

What is happening?


The smells mention before are caused by what Jonas Boner call Microliths, a great word for what is happening here.
Even if we think this are 'independent' services, synchronous communication can cause side effects we don't want:

  • There can be cascading events between your services.
  • Your domain boundaries are not clear because you don’t own the entire process.
  • Slow services are forced to scale by faster services requirements.
  • There is additional latency on the network calls.

What got lost in translation?

Having microliths comes from multiple misconceptions we have. Some of them are:

Domains != Resources

Every so often, when we divide the monolith, we think about domains being resources. Due to how we normally have divided API's and DB's as we think about splitting what already exists and not about extracting the processes being achieved.

When thinking about a microservice, we should think about what part of the process it is trying to solve, this will help us define good boundaries for our bounded context.

When we think in a process, data is secondary. The process will require different pieces of existing data to fulfill their capabilities, and it is ok for it to own its copy of what is needed to fulfill his mission.

Independence != Single Source

A single source of data does not mean independence, as whenever your software requires complementary data, it will have to acquire it from somewhere else, what means a direct dependency. This also affects boundaries as you must enter other team's domain.

If you strive for independence, copy the information you require for your process, even if it exists somewhere else.

Fast != Synchronous

Humans think that a direct response is always faster than sending out a message. While occasionally this is true, in microservices this could start a cascade of synchronous calls from one service to the next one, leaving our users in a timeout limbo.

Think if really your system requires calling others directly or if you can message them to start their process.

Resilience != Complete

Making sure the entire process has been completed, is normally confused by resiliency. Resiliency only refers to the capability to complete the process.
If we have well-defined contracts in between our pieces, we don't need to finish things synchronously, we can promise our users things will happen. And let our services do their work at their speed.

Conclusion and follow-ups

Are we doomed?

The answer is no, we are not doomed! We can design our services with the correct division using some DDD tooling and also use the correct tools to decouple our microservices.
Let's talk about this on the next chapters of this series.