Intro
These weeks I got involved in a document generation performance issue. This ran for several months, maybe years even. But it stayed quite unclear what the actual issue was.Often we got complaints that document generation from the front-end application (based on Siebel) was taking very long. End users often hit the button several times, but with no luck. Asking further, it did not mean that there appeared a document in the content management system (Oracle UCM/WCC). So, we concluded that it wasn't so much a performance issue, but an exception along the process of document generation. Since we upgraded BI Publisher to 12c, it was figured that it might got something to do with that. But we did not find any problems with BI Publisher, itself. Also, there was an issue with Siebel it's self, but that's also out of the scope of this article.
The investigation
First, on OSB the retry interval of the particular Business Service was decreased from 60 seconds to 10. And the performance increased. Since the retry interval was shorter, OSB does a retry on shorter notice. But of course this did not solve the problem.As Service developers we often are quite laconical about retries. We make up some settings. Quite default is an interval of 30 seconds and a retry count of 3. But, we should actually think about this and figure out what the possible failures could be and what a sensible retry setting would be. For instance: is it likely that the remote system is out of order? What are the SLA's for hoisting it back up again? If the system startup is 10 minutes, then a retry count of 3 and interval of 30 seconds is not making sense. The retries are done long before the system's up again. But of course, in our case sensible settings for system outage would cause delays being too long. We apparently needed to cater for network issues.
Last week our sysadmins encountered network failures, so they changed the LoadBalancer of BIP Publisher, to get chunks/packets of one requests routed to the same BI Publisher node. I found SocketReadTimeOuts in the logfiles. And from the Siebel database a query was done and plotted out in Excel showing lots of request in the 1-15 seconds range, but also some plots in ranges around 40 seconds and 80 seconds. We wondered why these were.
The Connection and Read TimeOut settings on the Business Service were set to 30s. So I figured the 40 and 80 seconds range could have something to do with a retry interval of 10s added to a time out of 30 seconds.
I soon found out that in OSB on the Business Service, the Chunked Streaming Mode was enabled. This is a setting we struggled with a lot. Several issues we encountered were blamed on this one. As a Helpdesk employee would ask you if you have restarted your system, on OSB questions I would ask you about this setting first... Actually, I did for this case, long before I got actively involved.
Chunked Streaming Mode explained
Let's start with a diagram:Now, the Chunked transfer encoding is an HTTP 1.1 specification. It is an improvement that allows clients to process the data in chunks right after the chunk is read. But in most (of our) cases a chunk on itself is meaning-less, since a SOAP Request/XML Document need to be parsed as a whole.
The Load Balancer also process the chunks as separate entities. So,by default, it will route the first one to the first endpoint, and the other one to the next. And thus each SP Managed Server gets an incomplete message and there for a so-called Bad Request. This happens with big requests, where for instance a report is requested together with the complete content. Then chances are that the request is split up in chunks.
But although the SysAdmins adapted the SP Load Balancer, and although I was involved in the BIPublisher 12c setup, even I forgot about the BIP12c OHS! And even when the LoadBalancer tries to keep the chunks together, then again the OHS will mess with them. Actually, if the LoadBalancer did not keep them together, the OHS instances could reroute them again to the correct end-node.
The Solution
So for all those Service Bus developers amongst you, I'd like you to memorize two concepts: "Chunked Streaming Mode" and "disable", and the latter in combination with the first, of course.In short: remember to set Chunked Streaming Mode to disable in every SOAP/http based Business Service. Especially with services that send potentially large requests, for instance document check-in services on Content/Document Management Systems.
The proof of the pudding
After some discussion and not being able to test it on the Acceptance Test environment, due to rebuilds, we decided to change this in production (I would/should not recommend that, at least not right away).And this was the result:
This picture shows that the first half of the day, plenty requests were retried at least once, and several even twice. Notice the request durations around the 40 seconds (30 seconds read timeout + 10 seconds retry interval) and 80 seconds. But since 12:45, when we disabled the Chunked Streaming Mode we don't see any timeout exceptions any more. I hope the end users are happy now.
Or how a simple setting can throw a spanner in the works. And how difficult it is to get such a simple change into production. Personally I think it's a pity that the Chunked Streaming Mode is enabled by default, since in most cases it causes problems, while in rare cases it might provide some performance improvements. I think you should rationalize the enablement of it, in stead of actively needing to disable it.
Hey there,
ReplyDeleteInteresting findings about chunked streaming and OSB. We actually hit an opposite issue, where we had to enable chunking to have things work.
When OSB called out to a service running on WebSphere with the IBM JVM, and did so using HTTPS+JSSE, it would not work unless chunked streaming was enabled. If we used HTTP, it was fine. It may just be a quirk of the IBM JVM when using HTTPS though as we have not had to enable chunking otherwise.
Sean
Hi,
ReplyDeleteGood point. Thanks for taking the time to post this. Sorry, for my late response.
I do think Chuncked Streaming mode is an important feature, for cases as this. But I do think it should be disabled by default, since with the reference architecture as proposed by the Enterprise Deployment Guide, it would do more harm than that it provides benefit.
Regards,
martien
Great post!
ReplyDeleteI was looking for an explanation about Chunked Streaming Mode behavior..And by far this is one of the best! I agree with you..by default this feature must be disabled. Our recommendation is always to disable it and for sure the performance always improves but I hadn't a good explanation to justif this point..Now I have it..
Regards,
Leo
Hi,
ReplyDeleteIs this setup is safe to say a good standard when using OSB?
As per my knowledge, disabling the chunked streaming mode will introduce new error which is double invocation. And to eliminate double invocation, you should set QoS to Exactly-Once.
However, in this kind of setup, you should always consider the TAT after the transaction enter OSB.
The reason is it will hold the thread while business service is on process specically if you're on the Route component/node. See http://jaredsoablogaz.blogspot.com/2013/01/choosing-between-route-service-callout.html
Please do correct me if I'm wrong or if there are missing piece on my statement.
Thanks,
Joemar
Refering to https://blogs.oracle.com/reynolds/following-the-thread-in-osb I believe you should conclude that with a Route node the response is handled in another thread. So a route would not hold on to a thread. I'm not sure what you mean by TAT. And I don't see the double invocation error. It's just that by not chunking you send the request as one complete chunk and so prevent that it send as different parts that can are handled by the infrastructure as separate incomplete messages.
ReplyDeleteThanks for the information. In my case this scenario worked. We have multiple application endpoint configured in Oracle Service bus for which by default the chunk streaming mode is enabled. we observed time outs when chunk streaming was enabled and target system was mainframe. As we do not have control on configuration on mainframe system and systems are owned by third party .we observed lot of stability after disabling the chunk streaming mode.
ReplyDelete