Enforcing Resilience in Spring Boot with Resilience4J

An important property of modern web apps is resilience. In simple terms, resilience refers to the ability of a system or feature to fail gracefully without impacting the entire system. In the context of web apps, we want to ensure that the entire system will not go down if a remote service, such as a database or API server, fails or is slow.

By designing our web apps with resilience in mind, we can improve their availability and reliability, providing a better experience for users and reducing the risk of costly downtime.

There is also a video course that demonstrates how to use Resilienve4J in Spring Boot Microservices.

What is Resilience4J?

Resilience4J is a fault tolerance library that is lightweight, easy-to-use, and designed for Java 8 and functional programming. It is inspired by Netflix Hystrix, but it comes with some additional features and improvements, such as a simpler and more flexible API, improved support for reactive programming, and support for several other resilience patterns, such as bulkheads, rate limiters, and retry mechanisms.

Resilience4J provides developers with a set of tools and patterns that they can use to build more resilient and fault-tolerant applications. By using Resilience4J, developers can better handle failures and errors in their applications, reduce the risk of downtime, and provide a better experience for their users.

Resilience Modules Provided by Resilience4J

Resilience4J provides several modules, each representing a specific resilience pattern, and each can be implemented independently. Each module has its own set of dependencies, which you can include in your project as needed. When you need to use all of the Resilience4J modules in your project, you can directly import the starter project provided by Resilience4J. This starter project includes all of the dependencies required for all of the modules.

To use a specific Resilience4J module, you need to add the corresponding dependency to your project. For example, if you want to use the Circuit Breaker module, you need to include the resilience4j-circuitbreaker dependency in your project. Similarly, if you want to use the Retry module, you need to include the resilience4j-retry dependency.

You can refer to the Resilience4J documentation or the Maven Central Repository to find the specific dependency you need for each module. The documentation also provides examples and usage instructions for each module.

Circuit Breaker

When the number of failures recorded during a call to a service exceeds a certain threshold, the Circuit Breaker trips and subsequent calls to that service are either rejected (fail-fast) or a fallback is executed. The Circuit Breaker operates as a finite state machine, as described by the diagram below.

The Circuit Breaker has three states:

CLOSED: In the normal state, all requests to the service are allowed to pass through, and their results are recorded.
OPEN: If the failure threshold is exceeded, the Circuit Breaker trips and enters the OPEN state. In this state, all requests to the service are rejected and a fallback is executed immediately.
HALF-OPEN: After a certain amount of time has passed, the Circuit Breaker transitions to the HALF-OPEN state, during which a limited number of requests to the service are allowed to pass through. If these requests succeed, the Circuit Breaker transitions back to the CLOSED state. If any of these requests fail, the Circuit Breaker returns to the OPEN state.

Bulkhead

In the Thread Pool pattern, a separate thread pool is assigned to each remote service that is called. This ensures that the system as a whole does not slow down or freeze if any one service is slow. When a remote call is made, a thread is taken from the pool to execute the call. If no thread is available in the pool, the call is queued until a thread becomes available. If the queue exceeds a certain size, a fallback is executed.

Alternatively, Semaphores can be used to implement this pattern instead of thread pools, but Thread pools are generally preferred due to their simplicity and better resource management.

Retry

In the Retry pattern, when a remote call to a service fails, a configured number of retries are attempted before the failure is reported back to the calling code. A fallback can also be provided if the call fails even after all the retries are attempted. The retry logic can be configured to wait for a certain amount of time between retries and to increase the wait time for subsequent retries, to allow the remote service to recover.

Rate Limiter

The Rate Limiter pattern limits the rate at which remote calls are made to a service. A configured number of calls per second are allowed, and any further calls are blocked until the next second begins. A fallback can be provided in case a call to the remote service is blocked due to exceeding the call rate limit. This pattern is useful for preventing overload on remote services and ensuring that they can handle requests without experiencing downtime.

Time Limiter

The Timeout pattern limits the amount of time spent on a remote call to a service. If a call exceeds the configured timeout, it is cancelled, and a fallback method is executed. This pattern is useful for preventing long-running calls from impacting system performance or availability. It is analogous to the timeout when making HTTP calls, where a request is terminated if it takes too long to respond.

Aspect Order

When you apply multiple patterns on a service call, they execute in a specific order. The default order specified by Resilience4J is:

Bulkhead
Time Limiter
Rate Limiter
Circuit Breaker
Retry

Creating Specifications for a Module

Resilience4J Provides two ways to create specifications for any of the above modules: through the application.yml file or Customizer Bean definition. The Bean definition overrides the specifications in the application.yml. Below is an example of defining some specifications for a Circuit Breaker Pattern. You can see how we can create specifications for the other modules in the Hands-on-Code section.

Using Application.yml

resilience4j:
  circuitbreaker:
    instances:
      cb-instanceA:
        failure-rate-threshold: 60  #The Threshold Percentage Above Which the Circuit Breaker will move from Closed to Open State.
        wait-duration-in-open-state: 5000  #Time in milliseconds, in which the circuit breaker is to stay in open state before moving to half-open state
        permitted-number-of-calls-in-half-open-state: 10
        minimum-number-of-calls: 10  #The number of calls after which the error rate is calculated. I have assigned it with a small value for test purpose.

Using Bean Definition

   @Bean
   public CircuitBreakerConfigCustomizer circuitBreakerConfigCustomizer() {
       return CircuitBreakerConfigCustomizer.of("cb-instanceB",builder -> builder.minimumNumberOfCalls(10)
       .permittedNumberOfCallsInHalfOpenState(15));
   }

Hands-on-code

Below is a simple project that portrays the use of each pattern. The specifications and attributes for each module are pretty self-explanatory. We will define the specifications through the application.yml.

Maven’s pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.4.4</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.resilience</groupId>
    <artifactId>demo</artifactId>
    <version>1.0.0</version>
    <name>Resilience4J Demo</name>
    <description>Demo project for Resilience4J</description>
    <properties>
        <java.version>1.8</java.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>io.github.resilience4j</groupId>
            <artifactId>resilience4j-spring-boot2</artifactId>
            <version>1.7.0</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-aop</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
        </dependency>

    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

Resilience4J requires Spring AOP to make the use of annotations possible.

Application.yml

resilience4j:
  circuitbreaker:
    instances:
      cb-instanceA:
        failure-rate-threshold: 60  #The Threshold Percentage Above Which the Circuit Breaker will move from Closed to Open State.
        wait-duration-in-open-state: 5000  #Time in milliseconds, in which the circuit breaker is to stay in open state before moving to half-open state
        permitted-number-of-calls-in-half-open-state: 10
        minimum-number-of-calls: 10  #The number of calls after which the error rate is calculated. I have assigned it with a small value for test purpose.
  ratelimiter:
    instances:
      rl-instanceA:
        limit-refresh-period: 200ns
        limit-for-period: 40 #The Max number of calls that can be done in the time specified by limit-refresh-period
        timeout-duration: 3s # The max amount of time a call can last
  thread-pool-bulkhead:
    instances:
      tp-instanceA:
        queue-capacity: 2 #The number of calls which can be queued if the thread pool is saturated
        core-thread-pool-size: 4 #The Number of available threads in the Thread Pool.
  timelimiter:
    instances:
      tl-instanceA:
        timeout-duration: 2s # The max amount of time a call can last
        cancel-running-future: false #do not cancel the Running Completable Futures After TimeOut.
  retry:
    instances:
      re-instanceA:
        max-attempts: 3
        wait-duration: 1s # After this time, the call will be considered a failure and will be retried
        retry-exceptions: #The List Of Exceptions That Will Trigger a Retry
          - java.lang.RuntimeException
          - java.io.IOException

Service Class

@Service
public class DemoService {

    @CircuitBreaker(name = "cb-instanceA",fallbackMethod = "cbFallBack")
    public String circuitBreaker() {
        return cbRemoteCall();
    }

    private String cbRemoteCall() {
        double random = Math.random();
        //should fail more than 70% of time
        if (random <= 0.7) {
            throw new RuntimeException("CB Remote Call Fails");
        }
            return "CB Remote Call Executed";
    }

    public String cbFallBack(Exception exception) {
       return String.format("Fallback Execution for Circuit Breaker. Error Message: %s\n",exception.getMessage());
    }

    @RateLimiter(name = "rl-instanceA")
    public String rateLimiter() {
        return "Executing Rate Limited Method";
    }

    @TimeLimiter(name = "tl-instanceA")
    public CompletableFuture<String> timeLimiter() {
        return CompletableFuture.supplyAsync(this::timeLimiterRemoteCall);
    }

    private String timeLimiterRemoteCall() {
        //Will fail 50% of the time
        double random = Math.random();
        if (random < 0.5) {
            return "Executing Time Limited Call...";
        } else {
            try {
                System.out.println("Delaying Execution");
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        return "Exception Will be Raised";
    }

    @Retry(name = "re-instanceA")
    public String retry() {
        return retryRemoteCall();
    }

    private String retryRemoteCall() {
        //will fail 80% of the time
        double random = Math.random();
        if (random <= 0.8) {
            throw new RuntimeException("Retry Remote Call Fails");
        }

        return  "Executing Retry Remote Call";
    }

    @Bulkhead(name = "tp-instanceA", type = Bulkhead.Type.THREADPOOL)
    public String bulkHead() {
        return "Executing Bulk Head Remote call";
    }
}

Controller Class

@RestController
@RequestMapping("/resilience")
public class DemoController {
    private final DemoService demoService;

    public DemoController(DemoService demoService) {
        this.demoService = demoService;
    }

    @GetMapping("/cb")
    public String circuitBreaker() {
        return demoService.circuitBreaker();
    }

    @GetMapping("/bulkhead")
    public String bulkhead() {
        return demoService.bulkHead();
    }

    @GetMapping("/tl")
    public CompletableFuture<String> timeLimiter() {
        return demoService.timeLimiter();
    }

    @GetMapping("/rl")
    public String rateLimiter() {
        return demoService.rateLimiter();
    }

    @GetMapping("/retry")
    public String retry() {
        return demoService.retry();
    }
}

Conclusion

Beginners usually overlook resilience. I will advise you to explore the specifications for each module to see how each can be effectively used.

Hope this tutorial has helped you. To learn more, check out the Spring Cloud tutorials page.

Frequently Asked questions

Can Resilience4J be used with other frameworks besides Spring Boot?
Yes, Resilience4J can be used with other frameworks besides Spring Boot. It can be integrated with other popular frameworks like Micronaut, Quarkus, and Vert.x.
How does Resilience4J compare to other fault tolerance libraries like Hystrix and Sentinel?
Resilience4J is a lightweight fault tolerance library that offers similar functionality to other libraries like Hystrix and Sentinel. However, Resilience4J is designed to be more flexible and configurable, allowing developers to choose which modules to use and how to apply them.
Can Resilience4J be used in a reactive programming environment?
Yes, Resilience4J provides support for reactive programming and can be used in reactive programming environments. It offers features like reactive streams support, reactive retry, and reactive circuit breakers.
How does Resilience4J handle edge cases like slow responses and network timeouts?
Resilience4J provides modules like Time Limiter and Retry that can handle edge cases like slow responses and network timeouts. The Time Limiter module can be used to limit the duration of a method call, while the Retry module can be used to retry a failed method call a specified number of times.
Does Resilience4J provide any metrics or monitoring capabilities?
Yes, Resilience4J provides a metrics module that can be used to collect and monitor various metrics related to fault tolerance, such as circuit breaker status, bulkhead capacity, and retry attempts. These metrics can be exposed to monitoring systems like Prometheus and Grafana.