I am making the assumption that there is an issue of some kind at the server farm and/or database but I don’t really know where to start to try and pin it down. I guess its going to need some heavy duty logging at the servers.
If anyone can suggest any tools to help diagnose in a VMWare/IIS7.5 environment, or suggest any other reasons why this might be happening that would be great.
These types of issues are always fun to get to the bottom of…
Do you have any way of tracing the request back to the server that fulfilled it, so you can isolate whether it’s just one server, or DC etc.
What sort of server monitoring do you have in place at the moment i.e. do you chart individual server performance, DB performance etc?
One tool I have found handy for looking at Win / IIS performance is PAL (http://pal.codeplex.com/) - basically collect a whole stack of perfmon counters and then analyse off-line - but of course you need to be collecting data while one of these blips happens.
I’d build a methodical approach to solving this - map out all the steps / parts that are involved in serving a request, prioritise the list of steps, measure / analyse them and if it’s not the issue move to the next level of priorities.
Hope your brain has recovered after velocity and webperfdays!
Thanks for the feedback.
We do have full instrumentation on the DB, but unfortunately not yet for each webserver.
We are going to put a debug HTTP header so we can identify which server in the farm is serving requests in WPT which will help going forward.
However, we have come up with a theory on this one. There is a ‘hot deals’ area on our home page that is IP sensitive. The data underlying this is cached but only has a relatively short TTL. So occasionally a call to the home page causes the cache to be refreshed and the underlying SQL query for that is taking a nasty 0.7secs duration which coincides with the extra I am seeing on the html load.
Now I just have to prove it, and tune up that query.
There are other things we will be doing across the site to eek out some more preceived speed - it will be interesting to see what bounce/conv difference it will make. First of all I need to get a better RUM solution in place - I only have GA site speed at the mo. Probably going to give LogNormal a go.