In a previous blog post, I demonstrated how we made parallel UI test execution possible in our test environment for Miradore Online. This was not supported by default in Microsoft’s unit test framework and according to Microsoft, it shouldn’t even be done. Still, it was something we needed. For us, the justification for this project was reducing the execution time of our UI test set which had already gone up to over 70 minutes. When we started to run the tests in parallel, the execution time dropped to under 20 minutes. In this case the problem was clear and always reproducible. Such is not the case with this blog post’s topic – How to deal with random UI test failures.

For us developers to be able to trust and depend on automated tests, they have to be robust. There should be no false alarms. Those that have worked with UI test automation know that there are lots of moving parts and it is a pain to make the tests fool proof. Selenium, the UI test library we use, does not work like a human. Instead, it does everything as fast as it possibly can. In the test code, it’s not always possible to know when some component has finished loading and can be interacted with. This means that you often have to implement explicit waits to ensure that the page is in a proper state before letting the test continue. With explicit waits though, comes another problem. You don’t want to make the waits any longer than they have to be to keep the execution time as short as possible. This in turn means that the wait times you have tweaked to be just enough, not always are, and then the test fails. There are a number of things that may affect load times which in turn correlate directly to needed wait times. For example, there may be temporary network issues, load of the test server may vary, first initializations after a fresh deployment can cause longer load times and so forth. What’s mutual between the above-mentioned is they are all external issues that are hard or downright impossible to tackle within the scope of a UI test.

So what to do when the failures are seemingly random and there is no way to “just fix the test that fails”? To me the obvious answer to that question is to just retry the failed test since it will probably pass just fine after a retry. To do this, one option would be to configure the automated test environment so that it would retry all failed tests after a test set completes, but there are annoyances with this approach. The biggest being the tests would first be flagged as failed and remain so until the system finishes rerunning failed tests and flagging them as ok. A better option would be to retry a failed test immediately in the test code so it would never even be flagged as failed. The problem with this approach is that it requires adding retry logic to every single test method, resulting in lots of copy-paste boilerplate code. There just has to be a better way…

…And fortunately, there is! Aspect-oriented programming to the rescue. There are a number of different AOP libraries available for .NET, but perhaps the most comprehensive and well known is PostSharp. I’m not going to summarize here what AOP means, so those interested or unfamiliar with the subject can research it, but I will tell you how AOP and PostSharp in particular can be used to help us with the issue at hand. PostSharp allows us to intercept method calls and add custom code around the intercepted methods. To help us with our problem we use this functionality to create an interception class that contains the test retry logic, which would otherwise have to be added to each test method separately. Next, we add an attribute matching the interception class to all test methods we want to have retry functionality in, and we are done! That really is all that’s needed and here is a concrete code example to prove it.

[Serializable]
public class RetryAspect : MethodInterceptionAspect
{
  private const int MAX_RETRIES = 2;

  public override void OnInvoke(MethodInterceptionArgs args)
  {
    int retryCount = 0;

    bool success = false;

    do
    {
      try
      {
        args.Proceed();
        success = true;
        
        if (retryCount > 0)
        {
          Console.WriteLine(args.Instance.GetType().Name + "." + args.Method.Name +
            " passed after " + retryCount + " retries");
        }
      }
      catch
      {
        if (retryCount < MAX_RETRIES)
        {
          ServerTestBase testBase = (ServerTestBase)args.Instance;
          testBase.TestCleanup();
          testBase.TestInitialize();
        }
        else
        {
          Console.WriteLine(args.Instance.GetType().Name + "." + args.Method.Name +
            " failed even after " + retryCount + " retries");
          throw;
        }
      }
    }
    while (!success && retryCount++ < MAX_RETRIES);
  }
}

So let’s step through this. Our RetryAspect class extends MethodInterceptionAspect which is part of PostSharp and allows us to intercept method calls. The class contains a retry loop with a try-catch, in which arg.Proceed() executes the original test method we’ve intercepted. A failed test always throws an assertion exception so we can safely assume that if no exceptions have been thrown, the test has succeeded. The TestCleanup() and TestInitialize() methods executed in the catch block are the same ones executed in the beginning and end of our normal test runs. So when a test fails, we clean up and re-initialize it to put everything in the exactly same state in the beginning of every retry round. We can access the current test class object via args.Instance. To actually make use of this interception class, all it takes is to add an attribute to the test methods as follows.

[RetryAspect]
[TestMethod]
public void AddLocation()
{
  string location = TestUtils.RandomString(10);

  // Navigate to locations view and open wizard
  ClickLink("Locations");
  ClickLink("Add location");
  ...

With the RetryAspect class and the above attribute, AddLocation() is always retried twice before deeming it failed. Not bad. Compare this to manually adding the logic of RetryAspect class to every single test method…

Our RetryAspect in action. No more random fails.

Some may argue that the random failures are an indication of test smell, which to be honest, is true. I admit that this is a compromise, but to me it feels much worse to use extra long explicit waits after every action to try and tackle these very random failures, when even that doesn’t always work. Or even worse, ignore the random failures altogether and let developers get used to the red or yellow light and not consider it an issue that needs to be dealt with right away, slowly eroding all trust in the test automation.

Where we too often were for no real reason, before dealing with this issue

Where we are now

So what do you think? Have you solved similar problems in some other, perhaps more elegant way? I’m interested in how others have tackled this issue. Please tweet your comments to me @makinjes.

Jesse Mäkinen

Jesse Mäkinen

Software Engineer at Miradore Ltd
Jesse Mäkinen has been a software engineer in Miradore since 2011. His main focus in Miradore has been mobile platforms. Jesse holds a B.Eng. from Saimaa University of Applied Sciences.
Jesse Mäkinen

Latest posts by Jesse Mäkinen (see all)