Improving CosmosDB Test Automation Reliability with Retry Logic
You will find that CosmosDB Emulator will fail, randomly, for no apparent reason doing simple things like get an instance of a container or create the database. In the world of cloud, it’s important to handle Transient Faults, or errors that are not repeatable or consistent in when the appear.
They might look like this:
Business.Tests.BuildingTests.CreateNewBuilding [3s 650ms]
Error Message:
Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: 500 Substatus: 0 Reason: (Microsoft.Azure.Documents.DocumentClientException: Unknown server error occurred when processing this request.ActivityId: b55227fb-15d1-4506-a685-c6b7751271c3, Microsoft.Azure.Documents.Common/2.9.2, {"RequestStartTimeUtc":"2020-03-22T04:41:43.6841102Z","RequestEndTimeUtc":"2020-03-22T04:41:44.0570433Z","RequestLatency":"00:00:00.3729331","IsCpuOverloaded":false,"NumberRegionsAttempted":1,"ResponseStatisticsList":[],"AddressResolutionStatistics":[{"StartTime":"2020-03-22T04:41:43.6842253Z","EndTime":"2020-03-22T04:41:44.0570433Z","TargetEndpoint":"https://192.168.231.161:8081/dbs/mydb/colls"}],"SupplementalResponseStatistics":[],"FailedReplicas":[],"RegionsContacted":[],"ContactedReplicas":[]}, Windows/10.0.14393 cosmos-netstandard-sdk/3.4.2 at Microsoft.Azure.Cosmos.GatewayStoreClient.ParseResponseAsync(HttpResponseMessage responseMessage, JsonSerializerSettings serializerSettings, DocumentServiceRequest request) at Microsoft.Azure.Cosmos.GatewayStoreClient.InvokeAsync(DocumentServiceRequest request, ResourceType resourceType, Uri physicalAddress, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)).
Stack Trace:
at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()
at Microsoft.Azure.Cosmos.CosmosResponseFactory.ProcessMessageAsync[T](Task`1 cosmosResponseTask, Func`2 createResponse)
at Microsoft.Azure.Cosmos.DatabaseCore.CreateContainerIfNotExistsAsync(ContainerProperties containerProperties, Nullable`1 throughput, RequestOptions requestOptions, CancellationToken cancellationToken)
at Common.DataAccess.BaseDataAccess.GetContainerAsync() in D:a1sCommonDataAccessBaseDataAccess.cs:line 27
at DataAccess.BaseCrudDataAccess`1.CreateAsync(T entity) in D:a1sCommonDataAccessBaseCrudDataAccess.cs:line 24
at Business.BuildingRepository.CreateAsync(BuildingDetail entity) in D:a1sLocation.BusinessBuildingRepository.cs:line 88
at Business.Tests.BuildingTests.CreateNewBuilding() in D:a1sLocation.Business.TestsBuildingTests.cs:line 26
I’ve done that using some simple retry logic that I found from a great post on Stack Overflow that I modified to include async operations (DoAsync and DoAsync
then the one that has a return type:
Example usage, in he cosmos DB setting is this:
Just adding some simple re-try logic around my Cosmos DB data access code drastically improved the reliability of automated tests.