Refactoring in Mule

In my first entry I will cover how I approach refactoring of existing integrations and how to use cache to optimize performance of an API.

Business case:

In one of the integrations we push product data to SFTP location to be picked up by a B2B platform. It is a relatively simple integration when you look at destination end but on source end it is a bit complicated. We need to pick up product data from ERP system and the ERP system depends on market (or country if you like). We have two ERP systems globally but we have there only some of the information required, only what usually sits in ERP, like logistic master data that help you do business processes. There is also marketing data like categories, brands etc that we keep in separate database (let us call it EDW). We need to combine them together to feed sales system.

All interfaces were built 2 years ago. While we move them to cloud we looked for optimizing because processing for some countries took up to 60 minutes to complete. Vcores on cloud cost money so performance simply converts to money saving.

Approach:

First I looked into applications and APIs structure. It appeared that the bottleneck is a not ERP system that we query in bulk but auxiliary database (EDW) that we pull marketing data from. The root cause for lagging is a poor design – processing layer queries system layer API per product (per product code and country to be exact) and the system layer just queries database to respond. In total the request-response time goes up to 5-8 seconds.

Honestly – the best approach would be to redesign EDW API and query for data in batches but when you do that, you need to refactor all dependant applications. And there are a lot of them because each country has its own app. I obviously did not want refactoring in multiple apps to fix this architectural flaw. I would have stuck for weeks to do so. Rather I preferred to make an isolated change in EDW API which is one global endpoint.

Pretty easy – my first thought. Just add a cache on database query in EDW API. But will that help really? Hm, basically cache can help if you make consequent calls for the same data but in product masterdata there are no duplicates. No gain here.

Ok, what if I make a huge cache with all data that we have in EDW? I can feed the cache in an auxiliary flow every hour and always pick up data from cache rather than query database. Wait – but how many rows we have in EDW? 2.4M? Quick math – 1 product is roughly 1kB so it will take 2.4GB. It will not work in cloud. I would not risk on-prem either…

Final design:

Final solution is based on usage pattern that I could see on EDW API. It is queried by single product and country. But when a country is queried you may expect that all products (or most of them) will be queried subsequently. So I query database and feed cache with all products for a country when first product is queried.

The cache must be contained within certain limits to avoid out-of-memory issues but Mule has configs to fine-tune it. In fact all the components required come with Mule and no custom code is required.

Gain:

Request-response time went down from 5 sec to 20-30 ms. The first product takes longer (8 sec) but it pays off…

Caveat:

When usage pattern changes and API calls will be random with respect to country the performance can be even worse.

Solution walkthough:

Original flow was invoked by API Kit Router and looked pretty simple. There was a database call for product data and depending on response length response was generated. When the product was missing in EDW error response was generated. Otherwise input product json was enriched with EDW bit.

Original flow

EDW database query is marked yellow.

After redesign the flow did not change much. Intentionally I did not change anything but only replaced database call with cache logic.

Flow refactored

The cache logic I encapsulated in a separate sub flow.

Product fetching bit with cache refresh logic

As you see I am using Object Store for my cache. Remember that I chose to cache country products collections so I need logic that will check if the country cache is valid or I need to feed it with fresh database query results.

Check for country cache existence

The first step is checking if we have cache entry for country. Product data will go to cache with {country}:{productCode} key and always when rebuilding country cache I am adding one extra cache entry with {country}:generic key. If this key existes, then country cache exists. So in this step I only check for existence of this key.

Let us have a look at Object Store config.

Object Store Configuration

I decided to have 20 minutes TTL for product cache and on top of that to limit number of entries to prevent memory issues. 20 minutes is more than enough to process one country file and 50000 limit will result in 50MB cache size. That will fit well in 0.1 (micro) worker container.

When country cache is empty it is time to query all country data and feed the cache. That logic is ecapsulated in a separate subflow.

Cache feeding

Query for count is only for testing here and will be there until UAT, it adds no value to the response. Products are queried in second database call. Then query result is transformed into Java object (collections of rows or LinkedHashMap objects) and Object Store entry for country is created. Finally products (LinkedHashMap objects) are added to Object Store one by one with a {country}:{productCode} key.

Note that country flag is added before product cache is built. This way the first entry that expires for a country will be country flag. And that is my intention.

Adding a flag cache entry for a country
Adding a product cache entry inside For Each scope.

Time to get back to our primary subflow for getting product data.

We are back in the flow at Set Payload element as “empty array list”. That will be response going back from this flow. We only replaced database query with our subflow so the invoking flow expects a collection.

At this point we are sure that country cache is valid so we can get product data from cache.

Get product data from Object Store

Object Store value will go to “product” flow variable. If the product does not exist in cache the value will be set to “none” string.

When product exists (ie. product flow variable is not “none”) a script component will add it to the collection. Otherwise the collection will be empty and that is compatible with invoking flow expectations. We are just mimicking database select query here.

Adding product data object do collection

Summary

Solution I presented is using in-memory Object Store as a cache layer and is aligned to API usage pattern. It also shows how make isolated changes that will not affect depending applications using the API but also will not require much changes in existing API logic.

All components used here are out of the box Mulesoft components. These are: Choice Routing, Object Store, Set Payload, DataWeave, Script, Loggers, Database Component. All but one are contained in Community edition. To use DataWeave you need Enterprise Edition.

Be aware that always using cache you trade speed for accuracy (or truth if you like). When you have fast changing data you should not use it or use it with very short TTLs. Masterdata is usually good canditate for caching but always ask your business people how often they change it.

Leave a Reply