Solution Design - Ordering System

This is a Rosetta Code post.

Story

You need to come up with a Solution Design for an Ordering System that is fault tolerant. Orders are placed by the Admin user on behalf of Customers. The customer doesnt login to anything, they place the order via E-mail/Telephonic conversation but their details are known to the company. Orders are shipped via an external company called Bobs Post but their API is unstable and can often go down for days, there is no other shipper we can use.

Task

Design a system that allows an Admin User to place orders for Customers, the actual items (what they order) dont matter. The focus is on Solution Design for a fault tolerant system where the Customer is notified when the item(s) are shipped.

Solutions

Conceptual Thinking

First Draft

First lets understand the problem and use an unconventional Sticky Note representation.

First draft to understand a simple flow

Second Draft

Remember Bobs Post has an unstable API, we can unload responsability from the BFF and introduce some retry logic using the Asynchronous Request-Reply pattern and Queues. Additionally Polly can be used to retry HTTP requests.

Second Draft, move workload Async

Now add a Worker process that will process the order request. If a message is not deleted and it has been received the maximum receive count times its pushed to the configured Dead Letter Queue (DLQ)

Second Draft, understand the Async workload. Its doing too much!

Third Draft

There is still a problem with the design as Bobs Post is unstable, so if any of the requests fail and the retry count for Polly is exhausted the process will fall over and cause possible duplication.

We can try address this by adding a key/value database like Dynamo to keep track of the process. Additionally the responsability of the process can be delegated to several workers and queues. The caveat being this brings complexity, so only introduce this when needed.

Conditional logic would be added so that should the data exist in Dynamo then dont do that step. IE: We have the user data, dont call the User API ect.

  • Third Draft: The main focus here is to place the order with Bobs Post (status=Scheduled)

Third Draft. Break it down, check and update `Scheduled` status

  • Third Draft: The main focus here is to check for updates at Bobs Post (status=Shipped)
  • If no update is avalible, just Queue another message to check for shipped (WARNING: Can cause infinite loop.) So potentially just dont delete and rely on the DQL.

Third Draft. Break it down, check and update `Shipped` status

  • Third Draft: The main focus here is to notify the User that their item(s) have shipped
  • Emails are a common thing to send, potentially a flow exists (Event Queue) where the request can be sent

Third Draft - Notification

Fourth Draft

The DLQs can be redriven.

  • Another worker can poll each DLQ and re-drive
  • The process could be manual

Finally once you agree with your team(s) draw as a technical sequence diagram. Also see the Miro Template.

Final Thoughts

The process above is not without fault. The SQS messages default retention period is 4 days so if Bobs Post is down for longer than this and the maxium retry period is exhausted the system will fall over. Potentially the flow could be SPA -> BFF -> SQS -> DYNAMO with workers that progress the status and call the APIs based on the state of the DYNAMO database record.

My suggestion is to be pragmatic and solve problems when they are problems, if you add complexity early you could solve a problem that doesnt exist and the potential gains are lost in the complexity.

References