FusionReactor Observability & APM

[frs-310] Changes to FusionReactor Timeout Protection in 5.0.0

Changes to FusionReactor Timeout Protection in 5.0.0

Background

FusionReactor 5.0.0's Guard, Metro and Transit engines represent a significant evolution of how we manage request tracking, and system survivability.

Transit: Transaction Tracking

Transactions are now managed independently of their content.

Where previously we had request tracking as an embedded concern, with impacts throughout the product, this has been rationalized and modularized into Transit – a generic transaction tracking engine.

The engine itself manages and tracks transactions, independent of what they actually represent. Using this new paradigm, we were able to rearchitect the way we track both web requests and JDBC statements, and the engine is versatile: users can use FRAPI, annotations or pointcuts to define their own transactions, which will be automatically tracked by FusionReactor.

Metro: Metrics Storage

The new Metro Engine is an evolution of FusionReactor's metrics tracking.

We looked at how FusionReactor was being used to track various statistics: memory, request runtime, CPU occupancy etc., and created a new separate engine just to handle these values. Metro provides us with abstractions for data series, and takes care of aggregating data into less-resolute time series.

FusionReactor's probes, which measured these values and stored them piecemeal, have been rewritten as Metro samplers, which store their values in a central in-memory database. This can be queried very quickly to provide data for display, and decision support for the Guard system.

Guard: Crash Protection

Guard helps keep systems up using a rule engine, and interfaces into Metro and Transit.

Guard is the consolidation of FusionReactor's Crash Protection into a rule-and-gate based system. Periodic rules update Guard with pertinent data about how the system is doing. Guard then makes decisions about whether the system needs to be throttled using queues, or even to reject requests to help ensure survivability.

This is done using Transit's Transaction Assassin (previously Crash Protection's Timeout Protection), as well as Transit Transaction Gating (previously Crash Protection's Request Protection and Memory Protection). Gating allows us to queue requests while a given condition clears, or to reject them outright. Guard uses Metro to obtain metric data to support its decisions.

On the surface, you won't see much difference to Protection – but inside FusionReactor, these three evolutions allow us more more flexibility and scope for future enhancements.

Timeout Protection – Changes

Pre-5.0.0

In versions prior to FusionReactor 5.0.0, Crash Protection operated in two stages: Soft Kill and Hard Kill.

Soft Kill

As a request times out, a flag is set on the output stream – how the request writes back to the browser. As soon as the request attempts to write to the page, the flag causes it to immediately exit. The thread can then carry on running requests.

Hard Kill

If the request isn't outputting anything – for instance, it may be stuck in a loop waiting for the database – the soft kill flag, though set, will never be checked.

In this case, Crash Protection proceeds to a hard kill: the thread is forcilby stopped.

Monitors and Locks

Monitors

When a hard kill occurs, any VM monitors held by the thread are immediately and automatically released. Monitors are acquired by threads using the Java synchronized keyword, and prevent two threads entering the same block of code at the same time. They're used to ensure that things happen in a given order, and that multiple requests can't change the same data at the same time.

Because these monitors are managed by the virtual machine, they are automatically released when a thread is killed.

Locks

With the advent of Java 1.5, Sun introduced a second kind of lock, the Ownable Synchronizer. These types, which are located in the java.util.concurrent.locks package, function broadly the same way as monitors, but are much more flexible. Instead of using the synchronized language keyword, these locks are Java objects, and can be stored and manipulated much more easily. Whereas the monitor acquired by synchronized was managed by the virtual machine, Locks are managed by the Java code itself.

However, with no support from the virtual machine platform for automatic unlocking (since they are just normal objects), Locks can remain in locked state even when their owning thread is killed. This will then lead to other threads hanging as they wait for a Lock release that will never occur. The effect will be that the other request threads wait indefinitely – a hang.

5.0.0 and Later

Soft Kill

Soft kill is now managed as a Guard rule segment, but is logically identical to its pre-5.0.0 counterpart.

Hard Kill

Hard kill has evolved into the Transaction Assassin, but has been enhanced somewhat.

Because killing a thread which owns Locks could potentially cause the system to hang, the Transaction Assassin now checks to see if the thread holds any before killing it, refusing to do so if there are any.

In making this design decision, we had to balance the utility of the Transaction Assassin against the risk of leaving locked Lock objects in the system. Since FusionReactor is a production monitor, we always err on the side of safety. We'd rather have a "stuck" request and a stable system, than a "killed" request but a system which may become instable later.

This change also applied to kills initiated from the user interface, and from FRAPI.

Display Changes

If Transit refuses to perform a full kill because the thread holds Locks, you'll see:

  • A new short annotation in the request list displays
  • A new tab in Request Details.

Both the annotation and the tab are called "Transit – Locks". The tab contains information about the locks which were discovered, to aid in troubleshooting.

All locks are displayed in the tab, represented by their Java toString() methods. Some locks have more information than others, though in all cases you should see the name of the class implementing the lock.

Future

Naturally, we believe the most optimal solution would be to find a way to release these locks, then allow the hard kill to proceed. This would then be the optimal solution: no "stuck" request, and no instability due to locked Lock objects. We're looking at ways to make this possible, though we are proceeding carefully; whatever we do must have the stability and performance of the system as its core concern. We hope to have a solution in place for a future version.

Issue Details

Type: Technote
Issue Number: FRS-310
Components: Crash Protection
Environment:
Resolution: Fixed
Last Updated: 20/Jul/13 2:39 AM
Affects Version:
Fixed Version: 5.0.0
Server:
Platform:
Related Issues: