The overhead which remains still includes several aspects of the original sources of added complexity, but we have seen that much of it has reduced or minimized in some fashion. One aspect which has only been partially reduced is the domain transfer, which still requires a virtual memory context switch.
An interesting aspect of this part of the overhead is that it is caused as much by the processor itself as the transfer of control: the architecture of the machine determines how many registers need to be saved and how the page tables need to be exchanged. Very little has to be done to specify in the thread control block which address domain it is operating in. Eliminating this portion of the context switch overhead allows another increase in efficiency.
If the addressing domain and execution thread are considered two distinct entities, it is possible to to switch them independently. We have already seen that the domain may be switched independently of the execution thread by allowing a single thread to operate across domain barriers.
Consider a multiprocessor system with one addressing domain mapped to each physical processing unit instead of to a ``process.'' For each addressing domain (process), a number of threads may be running. When a thread makes an RPC call, the kernel is trapped to request validation and be transferred to another address domain to execute the call. This operates exactly as before. However, instead of performing a virtual memory context switch on the processor, the thread is added to the load on the processor which is mapped to the addressing domain required. On return, the thread is returned to the original processor.
Such a scheme looks good on the screen, but haven't we introduced new overhead? Transfering the thread to a new processor means that the processor's local cache may be invalid for the pages needed, and the thread does need to be started with the A-stack used on the starting processor, for which it is quite reasonable to assume the cache is invalid.
But note also that we are crossing domains on the processor switch as well, and the destination processor has already been active in the required domain. This actually leads to cache coherence for the destination processor in the target domain, which is what is actually needed; only the A-stack needs to be transferred to the new processor. This means that the A-stack may be the only data that needs to be transferred, and that any data which is not in the cache of the second processor which is needed it most likely also out of the cache of the first processor as well. Hence, the second processor's cache is most likely in much better condition for operation in the target domain.
In most multiprocessor systems, it is not reasonable to assume that each addressing domain can be mapped to a dedicated processor, and it may not be useful for little-used domains. Dynamic mapping is typically supported, and should be used to optimize load on each processor. However, when a domain is currently active on a processor, additional threads entering that domain should be placed on that processor as well to avoid unnecessary virtual memory context switches. As an additional benefit, cache invalidation is reduced and increased value of caching is achieved.
The implementation of the LRPC system takes advantage of these ideas for performance improvement, but the details are somewhat detailed and need not be repeated in their entirety here. For a detailed discussion and measurements of the performance, see the paper (Bershad, et al.).