FusionReactor Observability & APM

Troubleshoot

Blog / Info

Customers

About Us

Installation

Configure

Troubleshoot

Blog / Info

Customers

About Us

[frs-271] Understanding FusionReactor’s Timeout Protection

Introduction

We're sometimes asked about the timings for FusionReactor's Timeout Protection, part of the Crash Protection system. The idea behind this crash protection is that it places a hard upper limit on the length of time requests can run. You might find, however, that requests take a few seconds longer to be killed than you originally thought. Here's why.

Kill Strategies

After the time limit is reached, three strategies are employed by the Request Assassin to kill the request:

  • Soft Kill
    If the request makes any output to the page whatsoever, it will be killed. This is done by intercepting the call which writes data back to the network. The Request Assassin waits for 2 seconds for the request to write output, before trying the next strategy.
  • Interrupt
    We interrupt the thread in which the request runs. If the request is performing an interruptible operation (a call which can throw InterruptedException; sleeping, for instance), the exception will be raised. The Request Assassin waits for 1 second for the thread to process the interrupt – if it does so at all – before trying the final strategy.
  • Thread Stop
    The request is forcefully stopped by Java's thread stop mechanism. Although this isn't completely reliable – threads performing native operations will not be stopped, for instance – it works most of the time, and in fact there is no other way to kill a thread.

The Request Assassin

In addition to these delays, the Request Assassin doesn't run all the time: it wakes up every second, and checks whether there are any threads it should examine. If there are any, it starts to process them, employing the strategies above. If a strategy succeeds, it moves on to the next candidate. If not, it tries the next strategy, until the list of strategies is exhausted.

Because FusionReactor attempts to place as little load on the target system as possible, the candidates for kill are queued up. This is an additional source of delay – previous kill operations may have to be processed before the Request Assassin gets to the request of interest.

Worked Example

Let's look at a worked example, with Timeout Protection set to 1 second.

[frs-271] Understanding FusionReactor's Timeout Protection

In the diagram, you can see (starting from the top timeline):

  • The flow of a request (blue).
  • The best possible timing of a crash protection (red).
  • The worst possible timing of a crash protection (red).

Best Possible Scenario

  • The Request Assassin wakes up at time T1 – the exact instant the request becomes eligible for kill. It immediately sets the Soft-Kill flag, and starts the soft kill delay, 2 seconds.
  • At time T3, the soft-kill delay expires, and the request is still alive, having not performed output. The thread is interrupted, and the Request Assassin waits for a further second.
  • At time T4, the request is still alive, so the Request Assassin performs a thread kill.

The request then dies, marked by the first red "X" in its timeline. If the request was performing native operations (i.e. using part of the Java language that's written in C), it will be killed as soon as it returns. One common native operation is blocking socket communication: this is performed in Java by a C library.

Worst Possible Scenario

The second timeline is much the same as the first, except the Request Assassin ran at time T0.9r, i.e. an instant before the request became eligible for kill at T1. The next time the Request Assassin runs is at time T2 (actually an instant before that), meaning the whole timeline starts 1 second later, at time T2, and completing with the kill at time T5.

Conclusion

We showed in this small article where the (up to) 3-second overhead in Timeout Protection comes from: a combination of soft-kill and interrupt delays, which allow us to try to stop a request without resorting to a hard kill.

Issue Details

Type: DevNet
Issue Number: FRS-271
Components: Crash Protection
Environment:
Resolution: Fixed
Last Updated: 22/Sep/11 3:36 PM
Affects Version:
Fixed Version: 4.0.0
Server:
Platform:
Related Issues: