A couple weeks ago I joined a discussion between engineers and customer-facing technologists within EMC and VMware. There was some confusion around a claim by EMC with respect to Transparent Page Sharing (TPS). There exists an EMC paper that hints at disabling TPS. The astute Michael Webster thought this contradicted best practices I provided when leading VMware’s performance technical marketing team. Michael was correct so I decided to jump in and see what I could learn.
First, here is a quote from the EMC paper on Oracle best practices:
For Databases and some other workloads, [TPS] can degrade performance if that [sic] memory actually changes frequently, as is true with Oracle memory regions, such as buffer caches.
This comment, somewhat vague and not fully supported, indirectly suggests that TPS should be disabled in memory-intensive applications. I had never heard such claims before so I mailed my old friends at VMware’s performance engineering team.
VMware’s outbound team told me they worked with the EMC authors and raised this issue in the original draft. As I has suspected, VMware has never seen and could not understand a TPS performance penalty in configurations like this.* And the feature’s design suggests that this slowdown is not be possible. VMware maintains its position that TPS should not be disabled But the excellent and thorough folks at EMC, also subjecting their configurations to some rigor, stand by their claim. With two expert teams in disagreement, how could we broker a compromise?
Barring a large recommitment of time by team EMC to re-run the experiments or run more, new tests at VMware’s request, the compromise was to tone down the wording in the paper. The paper does not say TPS should be disabled, as you can see above. It is true that a best practice of disabling TPS might be deduced from this claim. But EMC has not written it and VMware would not support it if it were committed to print.
When I was still in the VMware performance team I ran into cases like this with several vendors and partners. I can tell you that 95% of the time the vendors or partners made a mistake in methodology. But in one notable case a partner identified an issue unknown to VMware. It is because of this that VMware defends its position on performance but encourages experimentation and disagreement. Groups like EMC’s Enterprise Solution Group and in this case EMC IT do a tremendous amount of research on vSphere and the world is a better place for it. We all want this to continue.
But in this case time and resources did not permit execution of more experiments drive consensus to unanimity. So the written compromise respected the work and position of both parties. EMC published the paper and recently Michael Webster published an awesome summary of a presentation on EMC’s IT journey that discusses VMware’s position on this TPS research.
As a final note, remember this. Siblings bicker. But family is family.
(*) There are two cases where TPS could impact performance. But neither should have been present in this case:
- In a server running flat-out at 100% utilization, the nominal CPU usage of TPS (less than 1% of a single core) might be measurable. We never have seen this cost appear in end-user throughput or response time because it is nearly impossible to drive CPUs to 100%. With any IO on the server at all there is always some wait time and a total CPU load less than 100%. But, in theory, it is possible.
- Once memory is overcommitted, large memory pages are broken into small for page sharing. In this case, the small pages backing the guest large pages will underperform the original configuration. To stop this behavior and guarantee the integrity of large pages, you need only set memory reservations. Do not disable TPS.