Challenge
Recently our Sitecore CD App has been crashing ~70+ times per day and it took us roughly 4 weeks to pinpoint an issue. So, thought to share those learnings with you as well. So, you don’t need to invest some time to figure out the root cause and if you are proactive you could avoid such issues in your application as well!
Solution
It took 3.5 weeks for us to collect a Crash dump due to Azure PaaS gotchas and then .5 weeks to fix an issue and deploy it till PROD. I’ve already written a post about 3.5 weeks of struggle to capture a crash dump here: https://www.kiranpatils.com/2021/08/23/a-tale-of-crashing-web-app-and-failing-crash-dump-collection-tools/. If you are curious to know about dump collection challenges, then please read the given post else it’s not mandatory.
High level Application architecture
- Built on Sitecore 9.3 with SXA
- Hosted on Azure Web App (PaaS)
- P2V2 * 4 (Scaled to 4 instances)
- Multi-site solution using Sitecore hosting 12 websites
- Serving ~6 M Requests/Day
To begin our troubleshooting we had the following stack trace from Proactive Crash Monitoring (No raw dump file as PCM deletes it)
========================================================
Dump Analysis for w3wp__app-cd__6f6d__PID__8612__Date__07_20_2021__Time_02_28_56PM__20__ntdll!ZwTerminateProcess_pd0mdwk00000N.dmp
========================================================
Thread 5272
ExitCode 800703E9
ExitCodeString COR_E_STACKOVERFLOW
Managed Exception = System.StackOverflowException:
CallStack - Managed Exception
========================================================
callstack - Crashing Thread
========================================================
FaultingExceptionFrame
HelperMethodFrame
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1+<Invoke>d__5[[System.__Canon, mscorlib]].MoveNext()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
System.Threading.Tasks.AwaitTaskContinuation.RunCallback(System.Threading.ContextCallback, System.Object, System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.Task.FinishContinuations()
System.Threading.Tasks.Task.Finish(Boolean)
System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].TrySetException(System.Object)
System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].SetException(System.Exception)
Microsoft.Owin.Mapping.MapMiddleware+<Invoke>d__3.MoveNext()
HelperMethodFrame
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
Microsoft.Owin.Mapping.MapMiddleware+<Invoke>d__3.MoveNext()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
System.Threading.Tasks.AwaitTaskContinuation.RunCallback(System.Threading.ContextCallback, System.Object, System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.Task.FinishContinuations()
System.Threading.Tasks.Task.Finish(Boolean)
System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].TrySetException(System.Object)
System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].SetException(System.Exception)
Microsoft.Owin.Mapping.MapMiddleware+<Invoke>d__3.MoveNext()
HelperMethodFrame
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
// Same Stack trace repeats 5 times - removed to save your scroll wheel :-)
As you can see from the above stack trace all assemblies are either System or Microsoft. We shared the above information with Microsoft support and they asked us to reach out to Sitecore. The Sitecore support team asked us to double-check OWIN related customization. Which we had already shared with them and they could also not pinpoint any issues with our OWIN Customization related to Federated Authentication.
As nothing looks suspicious to us and the code has not changed for 6 months we requested Sitecore to help us understand this stack trace more which they started checking with their internal OWIN expert. In the meantime, I also opened GitHub issue here: https://github.com/aspnet/AspNetKatana/issues/424 Where Chris Ross helped us understand the internals of the above stack-trace and what we understood is “Microsoft.Owin.Mapping.MapMiddleware+d__3.MoveNext” invokes OWIN Methods
In the meantime, We were able to capture a crash dump with Raw dump file, and Crash dump analysis helped us pinpoint an exact method:
Simplified method details:
namespace SCBasics.Foundation.Security.Pipelines.OwinInitialize
{
public class HandleIbeUrl : InitializeProcessor
{
public override void Process(InitializeArgs args)
{
if (!this.Settings.FederatedAuthenticationEnabled())
return;
HandleExternalIbeUrl(args);
}
protected void HandleExternalIbeUrl(InitializeArgs args)
{
Assert.ArgumentNotNull(args, "args");
args.App.Map(string.Concat(Settings.IdentityProcessingPathPrefix().EnsureTrailingSlash(), "externalibeurl"), SessionStateBehavior.Required, app => app.Run(context =>
{
// Some custom Code throwing exception when input is wrong - Can Crash your Sitecore App!
return Task.CompletedTask;
}));
}
}
}
<?xml version="1.0"?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
<sitecore>
<pipelines>
<owin.initialize>
<processor type="SCBasics.Foundation.Security.Pipelines.OwinInitialize.HandleIbeUrl, SCBasics.Foundation.Security" resolve="true"/>
</owin.initialize>
</pipelines>
</sitecore>
</configuration>
Once we had steps to reproduce the issue, we tried it on all the environments. Interestingly, We could reproduce this issue on QA and PROD only UAT app was not getting crashed. We compared everything (bin, configs, etc.) but there was no difference. Sitecore was also unable to answer this (It seems something related to “Sitecore.Owin.Pipelines.Initialize.SetGlobalExceptionHandler, Sitecore. Owin”). But as we had a couple of environments (Most Important PROD) we added a fix – And the fix was super simple – Wrapping that code in a try..catch in other words handling exception gracefully! ๐
It’s still a mystery to us why App gets crashed due to exceptions at this level. As theoretically it shouldn’t be. But it does in the real world! And that’s what I (And most of the developers) love about our profession! Every challenge is a new puzzle irrespective of your experience! (Experience surely helps!)
Have a healthy and stable Sitecore application!