Friday, 1 December 2017

OSB: Disable Chunked Streaming Mode recommendation

Intro

These weeks I got involved in a document generation performance issue. This ran for several months, maybe years even. But it stayed quite unclear what the actual issue was.

Often we got complaints that document generation from the front-end application (based on Siebel) was taking very long. End users often hit the button several times, but with no luck. Asking further, it did not mean that there appeared a document in the content management system (Oracle UCM/WCC). So, we concluded that it wasn't so much a performance issue, but an exception along the process of document generation. Since we upgraded BI Publisher to 12c, it was figured that it might got something to do with that. But we did not find any problems with BI Publisher, itself. Also, there was an issue with Siebel it's self, but that's also out of the scope of this article.

The investigation

First, on OSB the retry interval of the particular Business Service was decreased from 60 seconds to 10. And the performance increased. Since the retry interval was shorter, OSB does a retry on shorter notice. But of course this did not solve the problem.

As Service developers we often are quite laconical about retries. We make up some settings. Quite default is an interval of 30 seconds and a retry count of 3. But, we should actually think about this and figure out what the possible failures could be and what a sensible retry setting would be. For instance: is it likely that the remote system is out of order? What are the SLA's for hoisting it back up again? If the system startup is 10 minutes, then a retry count of 3 and interval of 30 seconds is not making sense. The retries are done long before the system's up again. But of course, in our case sensible settings for system outage would cause delays being too long. We apparently needed to cater for network issues.

Last week our sysadmins encountered network failures, so they changed the LoadBalancer of BIP Publisher, to get chunks/packets of one requests routed to the same BI Publisher node. I found SocketReadTimeOuts in the logfiles. And from the Siebel database a query was done and plotted out in Excel showing lots of request in the  1-15 seconds range, but also some plots in ranges around 40 seconds and 80 seconds. We wondered why these were.

The Connection and Read TimeOut settings on the Business Service were set to 30s. So I figured the 40 and 80 seconds range could have something to do with a retry interval of 10s added to a time out of 30 seconds.

I soon found out that in OSB on the Business Service, the Chunked Streaming Mode  was enabled. This is a setting we struggled with a lot. Several issues we encountered were blamed on this one. As a Helpdesk employee would ask you if you have restarted your system, on OSB questions I would ask you about this setting first... Actually, I did for this case, long before I got actively involved.

Chunked Streaming Mode explained

Let's start with a diagram:

In this diagram you'll see that the OSB is fronted by a Load Balancer. But since 12c the Oracle HTTP Server is added to the Weblogic Infrastructure. And following the Enterprise Deployment Guide we added an OHS to the Weblogic Infrastructure Domain, as a co-located OHS Instance. And since the OSB as well as the Service Provider (in our case BI Publisher) are clustered, the OHS will load balance the requests.

Now, the Chunked transfer encoding is an HTTP 1.1 specification. It is an improvement that allows clients to process the data in chunks right after the chunk is read. But in most (of our) cases a chunk on itself is meaning-less, since a SOAP Request/XML Document need to be parsed as a whole.
The Load Balancer also process the chunks as separate entities. So,by default, it will route the first one to the first endpoint, and the other one to the next. And thus each SP Managed Server gets an incomplete message and there for a so-called Bad Request. This happens with big requests, where for instance a report is requested together with the complete content. Then chances are that the request is split up in chunks.

But although the SysAdmins adapted the SP Load Balancer, and although I was involved in the BIPublisher 12c setup, even I forgot about the BIP12c OHS! And even when the LoadBalancer tries to keep the chunks together, then again the OHS will mess with them. Actually, if the LoadBalancer did not keep them together, the OHS instances could reroute them again to the correct end-node.

The Solution

So for all those Service Bus developers amongst you, I'd like you to memorize two concepts: "Chunked Streaming Mode" and "disable", and the latter in combination with the first, of course.
In short: remember to set Chunked Streaming Mode to disable in every SOAP/http based Business Service. Especially with services that send potentially large requests, for instance document check-in services on Content/Document Management Systems.

The proof of the pudding

After some discussion and not being able to test it on the Acceptance Test environment, due to rebuilds, we decided to change this in production (I would/should not recommend that, at least not right away).

And this was the result:


This picture shows that the first half of the day, plenty requests were retried at least once, and several even twice. Notice the request durations around the 40 seconds (30 seconds read timeout + 10 seconds retry interval) and 80 seconds. But since 12:45, when we disabled the Chunked Streaming Mode we don't see any timeout exceptions any more. I hope the end users are happy now.

Or how a simple setting can throw a spanner in the works. And how difficult it is to get such a simple change into production. Personally I think it's a pity that the Chunked Streaming Mode is enabled by default, since in most cases it causes problems, while in rare cases it might provide some performance improvements. I think you should rationalize the enablement of it, in stead of actively needing to disable it.

No comments :