Wednesday, July 20, 2011

VMX and SMM – Dual monitor mode


The Dual monitor mode involves two monitors:  Executive monitor and the SMM monitor. The Executive monitor is analogous to the vmx-root hypervisor that exists outside of SMM. The SMM monitor is a special hypervisor that operates only in SMM. Under dual monitor treatment SMI’s cause vmexits and this information is recorded in a separate vmcs called SMM transfer vmcs. This enables the SMM monitor to assume control of vmexits caused by SMI# assertion.

Normal  vmx transitions (without dual monitor):
1.        vmxon executed by the executive monitor.
2.       vmptrld is executed.  The guest vmcs is then initialized via vmwrites.
3.       vmlaunch is executed. The guest virtual machine is launched.
4.       A vmexit from the guest traps back to the executive monitor. The executive monitor reads the exit_reason, exit_qual and a bunch of vmcs fields to extract more details on the vmexit.  After handling the vmexit , the executive monitor resumes the guest by executing vmresume.
5.       If there is a SMI# in the guest, vmx is turned off. The processor enters SMM. Upon a RSM, the processor takes us back to vmx guest.

Dual Monitor vmx transitions:
0.       Enable Dual Monitor. [see section on enabling dual  monitor below].
1.       A. vmxon executed by the executive monitor.  
B. Dual monitor treatment is activated [see section on activating dual  monitor below].
2.       vmptrld is executed.  The guest vmcs is then initialized via vmwrites.
3.       vmlaunch is executed. The guest virtual machine is launched.
4.       A vmexit from the guest traps back to the executive monitor. The executive monitor reads the exit_reason, exit_qual and a bunch of vmcs fields to extract more details on the vmexit.  After handling the vmexit , the executive monitor resumes the guest by executing vmresume.
5.       If there is a SMI# in the guest, then a SMM VMexit occurs. Control is transferred to the SMM monitor (instead of executive monitor).  The SMM monitor now handles the SMM vmexit by reading relevant fields in the SMM transfer vmcs.  After handling the SMM vmexit, it resumes the guest by executing a vmresume.
Steps 1A, 2,3 and 4 are identical for normal and dual-monitor vmx transitions.
The only difference between Normal vmx transitions and the Dual monitor vmx transitions is in the handling of SMI. In the dual monitor case, there is a new vmexit (SMM VMexit) that traps to the SMM monitor. All other vmexits continue to trap to the executive monitor.

When the machine is in the SMM monitor, it is considered to be in SMM. A  SMM VMexit is one that begins outside of SMM and ends in SMM. This means that SMM VMexits are also accompanied by the SMI_ACK special cycle. Similarly, a vmresume from a SMM monitor that resumes the guest is also accompanied by a SMI_ACK special cycle(since this vmresume takes the machine from SMM to outside of SMM).


The next two sections cover step 0 and 1B of Dual Monitor vmx transitions.


Enabling Dual Monitor Treatment:
Intel provides a new msr (msr 0x9b – SMM_MONITOR_CTL msr) for this.  Bit 0 of this msr is the valid bit. Bits 31:12 is the physical address (4K aligned) of the monitor segment (also called MSEG) that initializes the SMM transfer vmcs. This msr can be written only in SMM mode.  Here is a sample code:
mov ecx, 0x9B
mov eax, 0x00009001
xor edx, edx ; bits 63:32 are reserved. Clear edx
wrmsr
rsm  ; get out of SMM

The valid bit is set to 1.  Bits 31:12 = 0x9 – This implies that the physical address of the MSEG segment is 0x9000. Note that the above code snippet must run in SMM (SMI handlers that are dual-monitor aware may add the above code to initialize the MSEG). 

A sampleMSEG header looks like the one shown below.  In our example, the header is at physical address 0x9000 (what we wrote in msr 0x9b).

revision_identifier              dd 0
smm_monitor_features      dd 0
gdtr_limit                           dd 
gdtr_baseoffset                 dd
cs_sel                               dd
eip_offset                         dd
esp_offset                        dd
cr3_offset                        dd

The format of the MSEG_HEADER above matches the one described in Table 26.10(vol 3b, System Management Mode, chapter 26).  Note: Depending on the version of the Intel manual, table numbers may vary – but it will be found in the SMM chapter regardless of the manual version.


Activating Dual Monitor Treatment:
After enabling the dual-monitor treatment, software can activate it by executing vmcall instruction.  This execution of vmcall is in VMX_ROOT mode.  (Execution of vmcall in vmx_non_root mode always causes a vmexit. Vmcall execution in vmx_root mode thus has a special meaning – to activate dual monitor treatment).  Here is a sample code that accomplishes this:

; enable cr4.vmxe
mov eax, 0x00002010
mov cr4, eax
; do vmxon
VMXON [vmxon_ptr]
jbe fail
;load smm transfer vmcs pointer
vmclear [vmcs_smm_ptr]
jbe fail
vmptrld [vmcs_smm_ptr]
jbe fail
; now do vmcall
vmcall


When the processor executes vmcall instruction in vmx_root mode, internally does the following:
vmcall_flow:
if (vmx_root) {
       If(dual_monitor_active) {
          Perform SMM_VMEXIT;
       } else if (SMM_MONITOR_CTL_VALID){
         Activate_Dual_Monitor_SMM_VMexit;
      }
}

In the above code snippet,   SMM_MONITOR_CTL_VALID  comes directly from bit 0 of the SMM_MONITOR_CTL_MSR (msr 0x9b).  If these conditions are not met, vmcall fails. Also note that there are additional checks vmcall performs on the SMM transfer vmcs which are not discussed here. For those details reading the manual vol 3b is recommended.  [Also looking at the vmcall pseudo-code provided in Vol 2b (under vmx instructions) is recommended].

In the process of ‘Activate_Dual_Monitor_SMM_VMexit’, the processor does the following:
a.      
En  Enters SMM (issues a SMI_ACK bus cycle)
b.      Reads the MSEG revision identifier (offset 0). If it does not match the revision identifier supported by the processor then VMCALL fails. [The MSEG revision id supported by the processor is obtained by a rdmsr of IA32_VMX_MISC_MSR (msr 0x485 – bits 63:32).
c.       Reads the MSEG features field and performs checks on that field.
d.      After all checks pass, the processor starts executing instructions from the RIP indicated in the eip_offset field of the MSEG.

Sample MSEG code:
mov eax, 0x11ff
mov ebx, VMX_ENTRY_CONTROLS
vmwrite ebx, eax

mov eax, 0x008B
mov ebx, VMX_GUEST_TR_ATTR
vmwrite ebx, eax
 
mov ebx, VMX_EXIT_INSTR_LEN
vmread eax, ebx

mov ebx, VMX_GUEST_RIP
vmread ebx, ebx

add eax, ebx
mov ebx, VMX_GUEST_RIP
vmwrite ebx, eax

vmlaunch

The code above does only the bare minimum stuff (In reality, it will initialize the entire SMM_VMCS) – It initializes the entry_controls, updates the guest_rip and does a vmlaunch.  Where does this vmlaunch take the machine?   The answer to VMX_ROOT.  This is a special type of VMentry that takes the machine back to VMX_ROOT – Intel calls this VMentry as a ‘VMentry that returns from SMM’.   Remember that the machine performed a SMM_Vmexit when VMCALL was executed in VMX_ROOT mode – So this VMLAUNCH in the MSEG code takes us back to VMX_ROOT.  At this point, we are in the executive-monitor. This completes step 1B in the dual monitor flow.

Thursday, January 6, 2011

VMX and System Management Mode - Part 1

There are two different modes of operation of VMX within SMM:
1.Normal Mode
2.Dual monitor mode


Normal Mode:

Under Normal mode, a SMI# assertion causes the processor to turn-off vmx and enter into SMM. Upon a RSM, the processor automatically enables VMX if it was either in VMX-ROOT or VMX-GUEST prior to the SMI#. Since the processor turns off VMX, it means that CR4.VMXE is treated as reserved bit and must be 0 during RSM.

Algorithmically,

if(smi){
if(vmx_root or vmx_guest){
save cr4.vmxe internally;
if(vmx_root) internal_state = vmx_root;
if(vmx_guest) internal_state = vmx_guest;
turn_off_vmx;
}
save cr4 to smm_ram;
}

during rsm:

if(rsm){
read cr4_val from smm_ram;
if(cr4_val.vmxe==1) jump_to_shutdown;
retrieve internal cr4.vmxe;
cr4 <- cr4_val | (cr4.vmxe<<13);
read internal_state;
if(internal_state==vmx_root) put_cpu_in_vmx_root;
if(internal_state==vmx_guest) put_cpu_in_vmx_guest;
}


Notice the jump_to_shutdown during RSM. Since the processor saves CR4.VMXE internally during SMM, the value saved in SMRAM for CR4.VMXE is always 0. During RSM, the CR4 value is first loaded from SMRAM and bit 13 is checked . It must be 0 – If not the cpu will jump to shutdown. The processor then retrieves the value of VMXE from an internal register and updates CR4 with this value. The state of the processor (whether it was in vmx-root or vmx-guest or normal ia32 operation) is also retrieved and the cpu is put in that state after the completion of RSM.

This process is the default treatment of SMIs with VMX.

Notes on System Management Mode [SMM]

SMM:
SMM [System Management Mode] is an operating mode entered through the assertion of the SMI# pin. The processor upon detecting a SMI# saves the processor state in SMRAM [The base address of the SMRAM is obtained form an internal SMBASE register. The reset value of SMBASE register is 0x30000]. The processor saves several architectural values into the SMRAM (like the values of CR0, CR3, CR4 etc) when it enters SMM. To exit out of SMM , software executes a RSM(resume) instruction. During the RSM instruction, the processor reloads the architectural state from SMRAM and gets back to the state it was prior to the SMI#.
Here is a loosely defined algorithm for entering and exiting SMM:
1.Processor is executing a task (say T).
2.SMI# is detected by the processor.
3.Processor saves all information pertaining to task T in the SMRAM. It issues SMI_ENTER_ACK bus cycle and enters SMM.
4.Processor executes code from the SMM space[starting at address 0x38000]
5.When it executes the RSM instruction, the processor reloads the prior architectural state from SMRAM and then issues SMI_EXIT_ACK bus cycle and exits SMM.
6.Processor resumes executing the task T.

During Step 5, while the processor loads architectural state, it performs few checks on the state being loaded:
1.It checks the reserved bits of CR4.
2.It checks CR0 register for illegal combinations. For eg: CR0.PG=1 and CR0.PE=0 or CR0.CD=0 and and NW=1 .
If the checks above fail, then the processor enters shutdown.
[Note: there may be additional checks performed. CR0 and CR4 values in SMRAM should be left untouched by the SMM handler. These checks exist to make sure that the handler does not modify values to put the processor in an incompatible state after the execution of RSM].

Monday, October 4, 2010

Software injection into V86 guest with interrupt redirection - What must be the IDT VECTOR INFO?

The following observation is made while launching a V86 guest on Intel Merom. As part of vmlaunch or vmresume, a software interrupt is injected into the V86 guest(The entry interruption info field reads 0x800004vv where vv is the vector number). The V86 virtual machine has:

a.  GUEST_RFLAGS.VM = 1 (indicating the guest is in V86 mode).
b. CR4.VME=1 (enables interrupt redirection provided the redirection bitmap says so in TSS).
c. The exception_bitmap in the guest is configured to vmexit on a #PF.

At the end of vmlaunch, the software interrupt is injected. The guest is in V86 mode and has CR4.VME=1. The cpu consults the TSS to read the interrupt redirection bitmap. The TSS page is not present and the cpu takes a #PF. The guest is configured to vmexit on #PF. After the vmexit, use vmread to read the following vmcs fields:
a. Exit reason (reads 0)
b. Exit Interruption Info (0x80000B0E - indicates a #PF)
c. IDT Vector  Info (reads 0)
d. Exit Qualification (0x - address that caused #PF).

Something interesting in the above results is the value of idt-vector-info. The idt-vector-info must have read 0x800004vv(vv=vector), since the vmexit was encountered in the process of injecting an event. This behavior appears to violate what is stated in  vol3b.

Monday, April 26, 2010

Injecting software interrupt into a V86 guest

To inject an interrupt or exception into a guest a hypervisor uses the ENTRY_INTERRUPTION_INFO field in the vmcs. For eg: if there is a need to inject a #GP exception into the guest as part of vmentry, the entry_interruption_info field would look like this: 0x80000B0D.

1 . Bits 7:0 of this field represent the vector (0x0D - which is vector 13)
2. Bits 10:8 indicate the type(in this case type = 0x3 which is a hardware-exception).
3. Bit   11 is the error-code valid bit which is true in the example above.
4. Bit 31 is the valid bit for  ENTRY_INTERRUPTION_INFO field.

To inject a software interrupt (say vector 0x8) hypervisor would program entry_interruption_info field as given under: 0x80000408 (type=0x4 and vector=0x8). If the guest is in V86 mode (GUEST_RFLAGS[VM]=1) , the processor behaves according to Table 15.2, Intel SDM, vol 3A .


Given below is a summary of the processor behavior during normal software-interrupt execution in V86 and during an event injection into a V86 guest:

1. EFLAGS.VM = 1 , CR4.VME=1, EFLAGS.IOPL=3
=> In this case the bit in the redirection bitmap of the TSS is consulted.
=> if bit in the redirection bitmap=0, the software interrupt is redirected to x86 style handler.
=> if bit in the redirection bitmap=1, the software interrupt is redirected to protected-mode handler.

2.  EFLAGS.VM = 1 , CR4.VME=1, EFLAGS.IOPL<3
=> In this case the bit in the redirection bitmap of the TSS is consulted.

=> if bit in the redirection bitmap=0, the software interrupt is redirected to x86 style handler. Notice that this is the same behavior as with EFLAGS.IOPL=3. The difference is in the value of eflags pushed on the stack. Here the IOPL of the eflags image is forced to 3 and the value of VIF is copied to IF.

Normal behavior:  if bit in the redirection bitmap=1, the interrupt is directed to a #GP handler.
During VMX event injection:  if bit in the redirection bitmap = 1, the processor will *NOT*  #GP due to IOPL < CPL.

3. EFLAGS.VM = 1 , CR4.VME=0, EFLAGS.IOPL=3
=> Normal behavior: Interrupt directed to a protected mode handler (No #GP).
=> During event injection: Same as above.

4. EFLAGS.VM = 1 , CR4.VME=0, EFLAGS.IOPL<3
=> Normal behavior: Interrupt directed to a #GP handler .
=> During Event Injection:  No #GP can occur due to IOPL< CPL. The behavior will be the same as with IOPL=3.

Summary:
From the above discussion is there will be no #GP due to IOPL < CPL during the injection of a software interrupt into a V86 guest. If the hypervisor wants this #GP to occur, it needs to inject a #GP directly into the guest instead of a software-interrupt.This can be achieved by programming the entry_interruption_info field to 0x80000B0D.

Monday, October 19, 2009

VMEXIT on INVLPG

A boundary case observed on Intel Merom:

(a) The virtual-machine is configured to vmexit on INVLPG(bit 9 of the PROCESSOR_EXECUTION_CONTROLS is 1).

(b) The virtual-machine has GS BASE = 0xFFFF8000_00000000

(c) Virtual machine executes: invlpg [gs:0-1]

(d) Execution of invlpg causes vmexit.

(e) The address of invlpg is recorded in exit-qualification. Upon a vmread of EXIT_QUALIFICATION the value obtained is:
=> FFFF7FFF_FFFFFFFF


Notice that the value recorded is a non-canonical address ie; address[63:48] != address[47]. This is the only case i have encountered where a non-canonical address shows up on the exit-qualification.

The only explanation I can come up with for this behavior is that : INVLPG unlike other instructions does not fault in 64-bit mode with a non-canonical operand. According to the instruction spec, INVLPG morphs into a NOP for such cases.

When a vmexit handler for INVLPG is written, this case must be taken into consideration(ie; a non-canonical address might show up in the exit-qualification field).

Saturday, July 25, 2009

A full blown initialization of VMCS - Assembly code

The code below will outline the general steps prior to executing a VMLAUNCH or VMRESUME.
Prior to looking at the assembly code, here is a step-by-step description of what is being done:

The reader must know that:
A)this code will run only in ring0.
B)that paging is already enabled in CR0(bit 31).

(1) First Enable VMXE (bit 13) in CR4. Make sure that processor supports VMX by executing CPUID(leaf 1, ecx[5]).

(2) Intialize revision-id(msr 0x480,31:0) in the vmxon region and in the guest-vmcs region.

(3) Execute VMXON with the pointer to vmxon region. In some cases, if BIOS has not enabled bits 0, 2 of FEATURE_CONTROL_MSR (msr 0x3a) this will fail.

(4) Execute VMCLEAR with the pointer to the guest-vmcs region.

(5) Execute VMPTRLD with the pointer to the guest-vmcs region.

(6) Now initialize the guest-vmcs:
(a) First initialize the vmx controls. These include the following controls:
1. PIN_BASED
2. PROC_BASED
3. ENTRY_CONTROLS
4. EXIT_CONTROLS

(b) Next initialize the host-state and guest-state.

(c) Now do vmlaunch. If VMLAUNCH is successful, then the processor will start executing code
from the GUEST_CS:GUEST_RIP value specified in the VMCS.


Here comes the code:
////////////////////////////////////////////////////
mov eax, cr4
bts eax, 13
mov cr4, eax

mov ecx, 0x480
rdmsr
mov edx, [vmxon-ptr]
mov [edx], eax
mov edx, [guest-ptr]
mov [edx], eax

VMXON [vmxon-ptr]
jbe fail

vmclear [guest-ptr]
jbe fail

vmptrld [guest-ptr]
jbe fail


call initialize_vmx_controls
call initialize_vmx_host_guest_state
call do_vmlaunch

;ideally a hypervisor would read the VMX-MSRS
; to determine what values to write.
initialize_vmx_controls:
mov ebx, ENTRY_CONTROLS ;0x4012
mov eax, 0x11ff
vmwrite ebx, eax
mov ebx, PIN_CONTROLS; 0x4000
mov eax, 0x1f
vmwrite ebx, eax
mov ebx, PROC_CONTROLS ; 0x4002
mov eax, 0x0401E9F2
vmwrite ebx, eax
mov ebx, EXIT_CONTROLS ; 0x400C
mov eax, 0x36dff
vmwrite ebx, eax
ret


initialize_vmx_host_guest_state:
mov eax, cr3
mov ebx, HOST_CR3 ;0x6C02
mov edx, GUEST_CR3 ;0x6802
VMWRITE EBX,EAX
mov eax, pdebase_guest
VMWRITE EDX,EAX

mov ebx, HOST_RSP ;0x6c14
mov eax, tos ;top-of-stack
vmwrite ebx, eax

mov ebx, HOST_CR0 ; 0x6C00
mov eax, cr0
vmwrite ebx, eax
mov ebx, GUEST_CR0 ;0x6800
vmwrite ebx, eax
mov ebx, HOST_CR4 ; 0x6C04
mov eax, cr4
vmwrite ebx, eax
mov ebx, GUEST_CR4; 0x6804
vmwrite ebx, eax
mov ebx, HOST_CS_SEL ; 0x0c02
mov eax, cs
vmwrite ebx, eax
mov ebx,HOST_DS_SEL ; 0x0c06
mov eax, ds
vmwrite ebx, eax
mov ebx, HOST_SS_SEL ; 0x00000c04
mov eax, 0x18
vmwrite ebx, eax
mov ebx, HOST_TR_SEL; 0x00000c0c
mov eax, 0x18
vmwrite ebx, eax
mov ebx, GUEST_TR_SEL ;0x0000080e
mov eax, 0x18
vmwrite ebx, eax
mov ebx, GUEST_TR_ATTR ;0x00004822
mov eax, 0x8b
vmwrite ebx, eax
mov ebx, GUEST_TR_LIMIT ;0x0000480e
mov eax, 0xff
vmwrite ebx, eax
mov ebx, GUEST_LDTR_ATTR ;0x00004820
mov eax, 0x00010000
vmwrite ebx, eax
mov ebx, GUEST_SS_ATTR ;0x00004818
mov eax, 0xc093
vmwrite ebx, eax
mov ebx, GUEST_DS_ATTR ;0x0000481a
mov eax, 0xc093
vmwrite ebx, eax
mov ebx, GUEST_ES_ATTR ;0x00004814
mov eax, 0xc093
vmwrite ebx, eax
mov ebx, GUEST_FS_ATTR ;0x0000481c
mov eax, 0xc093
vmwrite ebx, eax
mov ebx, GUEST_GS_ATTR ;0x0000481e
mov eax, 0xc093
vmwrite ebx, eax
mov ebx, GUEST_SS_LIMIT ;0x00004804
mov eax, 0xffffffff
vmwrite ebx, eax
mov ebx, GUEST_DS_LIMIT ;0x00004806
vmwrite ebx, eax
mov ebx, GUEST_ES_LIMIT ;0x00004800
vmwrite ebx, eax
mov ebx, GUEST_FS_LIMIT ;0x00004808
vmwrite ebx, eax
mov ebx, GUEST_GS_LIMIT ;0x0000480a
vmwrite ebx, eax
mov ebx, LINK_PTR_FULL ;0x00002800
vmwrite ebx, eax
mov ebx, VMS_LINK_PTR_HIGH ;0x00002801
vmwrite ebx, eax
mov ebx, GUEST_GDTR_BASE ;0x00006816
mov eax, gdt32t
vmwrite ebx, eax
mov ebx, HOST_GDTR_BASE ;0x00006c0c
vmwrite ebx, eax
ov ebx, GUEST_CS_LIMIT ;0x00004802
mov eax, 0xffffffff
vmwrite ebx, eax
mov ebx, GUEST_CS_ATTR ;0x00004816
mov eax, 0xc09b
vmwrite ebx, eax
mov ebx, GUEST_RSP ;0x0000681c
mov eax, tos
vmwrite ebx, eax
mov ebx, GUEST_IDTR_BASE ;0x00006818
mov eax, idt32t
vmwrite ebx, eax
mov ebx, HOST_IDTR_BASE ;0x00006c0e
vmwrite ebx, eax
mov ebx, GUEST_CS_SEL ;0x00000802
mov eax, guest_sel
vmwrite ebx, eax
mov ebx, GUEST_CS_BASE ;0x00006808
mov eax, guest_base
vmwrite ebx, eax
mov ebx, GUEST_RIP ;0x0000681e
mov eax, 0
vmwrite ebx, eax
mov ebx, HOST_RIP ;0x00006c16
mov eax, after_vmexit
vmwrite ebx, eax
mov ebx, GUEST_RFLAGS ;0x00006820
mov eax, 2
vmwrite ebx, eax
mov ebx, EXCEPTION_BITMAP ;0x4004
mov eax,0xdeadfeef
vmwrite ebx, eax
ret

do_vmlaunch:
VMLAUNCH

after_vmexit:
;read EXIT_REASON and figure out what caused the vmexit.

///////////////////////////////////////////////////////////////

The HOST_RIP is where control is transferred after a vmexit. The hypervisor can determine the appropriate course of action by reading the vmexit fields from the vmcs.