Wednesday, July 17, 2013

A simple linux driver for vmlaunch


 The idea behind the driver is to demonstrate a real example of how to initialize the Virtual Machine Control Structure(VMCS) and to use Intel VT instructions to launch a virtual machine. The driver launches a guest (virtual machine) with vmlaunch, executes one instruction(that causes a vmexit) and then returns to the host. For the vmlaunch instruction to execute successfully, a lot of cpu state (host and guest state) needs to be initialized all of which is done by this driver. The driver also takes a simple approach in setting up the guest state by making it mirror the host state. This makes the design much simpler - for instance the guest does not need its own CR3, it shares it with the host. Inline assembly is used generously throughout the driver.

The driver source code (64bit)  is located here:

https://github.com/vishmohan/vmlaunch


The sequence leading to the launch of a virtual machine is as follows:

1. Check to make sure the cpu supportsVMX.
2. Check to see if the bios has enabled vmxon in the FEATURE_CONTROL_MSR (msr 0x3A).
3. vmxon.
4. vmptrld
5. Initialize guest vmcs
6. vmlaunch
7. Guest code executes, causes a vmexit
8. Back to the host.

Below is some discussion of the code - I have provided some code snippets for clarity.

The starting point is the function vmxon_init( ) :

1. First execute cpuid (leaf 1). Bit 5 of value returned in ecx indicates support for vmx. If cpuid indicates support for vmx then the code continues with normal execution. If vmx support is not indicated, the code exits.

asm volatile("cpuid\n\t"
       :"=c"(cpuid_ecx)
       :"a"(cpuid_leaf)
       :"%rbx","%rdx");

2. Read the feature_control_msr and look for lock bit (bit0) and vmxon bit(bit2). Both bits must be on - If not exit.

asm volatile("rdmsr\n"
                :"=a"(msr3a_value)
                :"c"(feature_control_msr_addr)
                :"%rdx"
               );


3. Call allocate_vmxon_region(). This allocates a 4k region for vmxon.

static void allocate_vmxon_region(void) {
   vmxon_region = kmalloc(MYPAGE_SIZE,GFP_KERNEL);
}


4. Set up the revision id in the vmxon region. The call to vmxon_setup_revid() sets up the revision-id. This revision id is then copied into the vmxon region. A rdmsr of VMX_BASIC_MSR(msr 0x480) returns the revision id.

vmxon_setup_revid();
memcpy(vmxon_region, &vmx_rev_id, 4); //copy revision id to vmxon region


static void vmxon_setup_revid(void){
   asm volatile ("mov %0, %%rcx\n"
       :
       : "m"(vmx_msr_addr)
       : "memory");
   asm volatile("rdmsr\n");
   asm volatile ("mov %%rax, %0\n"
       :
       :"m"(vmx_rev_id)
       :"memory");
}


5. Turn on the VMXE bit in CR4 (bit13). This enables the virtual machine extensions. Execution of vmxon will #UD without cr4.vmxe on. The function turn_on_vmxe() accomplishes this.

static void turn_on_vmxe(void) {
   asm volatile("movq %cr4, %rax\n"
           "bts $13, %rax\n"
           "movq %rax, %cr4\n"
          );
  }

6. Now execute vmxon by calling the function do_vmxon(). If vmxon fails for any reason, the CF or ZF in rflags will be set - The code checks for this case and restores the flags for debug (using pushfq and popfq below).

static void do_vmxon(void) {
   asm volatile (MY_VMX_VMXON_RAX
                 : : "a"(&vmxon_phy_region), "m"(vmxon_phy_region)
                 : "memory", "cc");
   asm volatile("jbe vmxon_fail\n");
     vmxon_success = 1;
     asm volatile("jmp vmxon_finish\n"
             "vmxon_fail:\n"
              "pushfq\n"
            );
     asm volatile ("popq %0\n"
                   :
                   :"m"(rflags_value)
                   :"memory"
                   );
     vmxon_success = 0;
      asm volatile("vmxon_finish:\n");
}


If vmxon executes successfully, the cpu is now in vmx root operation. INIT# and A20M# are blocked, CR4.VMXE cannot be cleared  and CR0.PG/PE are fixed to 1 in vmx root operation. note: cr4.vmxe can be cleared after the execution of vmxoff which takes the machine out of vmx root operation.


7. Next allocate the vmcs region for the guest and other data structures that are used in vmx non-root operation (iobitmaps, msrbitmaps etc). This is done by allocate_vmcs_region().

static void allocate_vmcs_region(void) {
   vmcs_guest_region  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   io_bitmap_a_region =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   io_bitmap_b_region =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   msr_bitmap_region  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   virtual_apic_page  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);

   //Initialize data structures
   memset(vmcs_guest_region, 0, MYPAGE_SIZE);
   memset(io_bitmap_a_region, 0, MYPAGE_SIZE);
   memset(io_bitmap_b_region, 0, MYPAGE_SIZE);
   memset(msr_bitmap_region, 0, MYPAGE_SIZE);
   memset(virtual_apic_page, 0, MYPAGE_SIZE);

}


8. Populate the guest vmcs region with the same revision id as the one used for vmxon region.

memcpy(vmcs_guest_region, &vmx_rev_id, 4); //copy revision id to vmcs region


9. Execute vmptrld. Checks rflags(ZF and CF) to make sure vmptrld executes successfully.

static void do_vmptrld(void) {
    asm volatile (MY_VMX_VMPTRLD_RAX
                : : "a"(&vmcs_phy_region), "m"(vmcs_phy_region)
                        : "cc", "memory");
     asm volatile("jbe vmptrld_fail\n");
     vmptrld_success = 1;
     asm volatile("jmp vmptrld_finish\n"
             "vmptrld_fail:\n"
              "pushfq\n"
            );
     asm volatile ("popq %0\n"
                   :
                   :"m"(rflags_value)
                   :"memory"
                   );
     vmptrld_success = 0;
     asm volatile("vmptrld_finish:\n");
}


10. Now its time to initialize the guest vmcs. This is accomplished by initialize_guest_vmcs(). It is advisable to keep the initialization of  the guest vmcs  consistent with the field encodings in Appendix B, Vol 3c. This will avoid any important fields from skiiping initialization and will save a great deal of headache trying to debug a vmlaunch fail due to invalid guest state.

static void initialize_guest_vmcs(void){
    initialize_16bit_host_guest_state();
    initialize_64bit_control();
    initialize_64bit_host_guest_state();
    initialize_32bit_control();
    initialize_naturalwidth_control();
    initialize_32bit_host_guest_state();
    initialize_naturalwidth_host_guest_state();
}

Note:
  1. The latest Intel manuals have newer fields defined - This code initializes the fields that are supported by the earliest processors to support VT. So for example VPID will not be in the initialization section as all processors do not support VPID.

2. All fields are expanded and written individually rather than iterating through a loop [for ease of debug]. Intel's vmentry checks are detailed and any issue with the initialization here that causes a vmentry fail will be less painful to debug with this code.

initialize_16bit_host_guest_state( ):
This function takes care of initializing the 16 bit guest and host states.

A sample initialization of the host and guest ES selector is given below:

  field = VMX_HOST_ES_SEL;
  field1 = VMX_GUEST_ES_SEL;
  asm ("movw %%es, %%ax\n"
                 :"=a"(value)
        );
   do_vmwrite16(field,value);
   do_vmwrite16(field1,value);

A sample initialization of the host and guest TR selector is given below:

   field = VMX_HOST_TR_SEL;
   field1 = VMX_GUEST_TR_SEL;
   asm("str %%ax\n" : "=a"(value));
   do_vmwrite16(field,value);
   do_vmwrite16(field1,value);

initialize_64bit_control( ):   
This function takes care of initializing the 64 bit controls.

A sample initialization of the IO bitmaps is given below:

   field = VMX_IO_BITMAP_A_FULL;
   io_bitmap_a_phy_region = __pa(io_bitmap_a_region);
   value = io_bitmap_a_phy_region;
   do_vmwrite64(field,value);

   field = VMX_IO_BITMAP_B_FULL;
   io_bitmap_b_phy_region = __pa(io_bitmap_b_region);
   value = io_bitmap_b_phy_region;
   do_vmwrite64(field,value);

 initialize_64bit_host_guest_state( ):
This function takes care of initializing the 64 bit host/guest state.

 field = VMX_VMS_LINK_PTR_FULL;
 value = 0xffffffffffffffffull;
 do_vmwrite64(field,value);
 field = VMX_GUEST_IA32_DEBUGCTL_FULL;
 value = 0;
 do_vmwrite64(field,value);

initialize_32bit_control( ):
32 bit controls are initialized here:

   field = VMX_PIN_VM_EXEC_CONTROLS;
   value = 0x1f ;
   do_vmwrite32(field,value);

Ideally this code should read the pin_based_ctl msr (msr 0x481) to find the allowed 0's and allowed 1's and then initialize this field. The earliest versions of cpu that supported vmx supported external interrupt exiting(bit0) and nmi exiting(bit 3) . The other bits in this field that are set to 1 are the must be 1 bits.  Hence the author sets the value to be 0x1f.


initialize_naturalwidth_control( ):

The CR0 and CR4 guest host mask are initialized below:

   field = VMX_CR0_MASK;
   value = 0;
   do_vmwrite64(field,value);
   field = VMX_CR4_MASK;
   value = 0;
   do_vmwrite64(field,value);

initialize_32bit_host_guest_state( ):

It initializes  32 bit guest/host state and a few of the natural width fields. Here are a few examples:

Initializing the AR bytes is a 2-step process - First find the access rights of the segment using the lar instruction. Then arrange the format of the access rights to match the guest access rights format described in the chapter Virtual Machine Control Structures, vol 3c(Chapter 24, Table 24.2 in the June 2013 manual).

  asm ("movw %%cs, %%ax\n"
         : "=a"(sel_value));
   asm("lar %%eax,%%eax\n" :"=a"(usable_ar) :"a"(sel_value));
   usable_ar = usable_ar>>8;
   usable_ar &= 0xf0ff; //clear bits 11:8

   field = VMX_GUEST_CS_ATTR;
   do_vmwrite32(field,usable_ar);
   value = do_vmread(field);





This code also initializes the GDT base(a natural width field) along with its limit(32 bit field) for convenience. The same process is repeated for IDTR and TR.

  asm("sgdt %0\n" : :"m"(gdtb));
   value = gdtb&0x0ffff;
   gdtb = gdtb>>16; //base

   if((gdtb>>47&0x1)){
     gdtb |= 0xffff000000000000ull;
   }
   field = VMX_GUEST_GDTR_LIMIT;
   do_vmwrite32(field,value);
   field = VMX_GUEST_GDTR_BASE;
   do_vmwrite64(field,gdtb);
   field = VMX_HOST_GDTR_BASE;
   do_vmwrite64(field,gdtb);






initialize_naturalwidth_host_guest_state( ):

Initializes the natural width guest and host states.

As is clear the host and guest cr0,cr4,cr3 are identical.

   field =  VMX_HOST_CR0;
   field1 = VMX_GUEST_CR0;
   asm ("movq %%cr0, %%rax\n"
                 :"=a"(value)
        );
   do_vmwrite64(field,value);
   do_vmwrite64(field1,value);

   field =  VMX_HOST_CR3;
   field1 = VMX_GUEST_CR3;
   asm ("movq %%cr3, %%rax\n"
                 :"=a"(value)
        );
   do_vmwrite64(field,value);
   do_vmwrite64(field1,value);

   field =  VMX_HOST_CR4;
   field1 = VMX_GUEST_CR4;
   asm ("movq %%cr4, %%rax\n"
                 :"=a"(value)
        );
   do_vmwrite64(field,value);
   do_vmwrite64(field1,value);


11. The last piece of initialization is the guest and host rip:

    After a vmexit the cpu transfers control to the host at the label 'vmexit_handler'.
   //host rip
   asm ("movq $0x6c16, %rdx");
   asm ("movq $vmexit_handler, %rax");
   asm ("vmwrite %rax, %rdx");



   After a vmentry the cpu transfers control to the guest at the label 'guest_entry_point'.
   //guest rip
   asm ("movq $0x681e, %rdx");
   asm ("movq $guest_entry_point, %rax");
   asm ("vmwrite %rax, %rdx");

12. Finally the vmlaunch:

   asm volatile (MY_VMX_VMLAUNCH);
   asm volatile("jbe vmexit_handler\n");
   asm volatile("nop\n"); //will never get here

   asm volatile("guest_entry_point:");  ---> after vmlaunch the code gets here:
   asm volatile(MY_VMX_VMCALL); ---> vmcall causes a vmexit with exit reason=0x12
   asm volatile("ud2\n"); //will never get here
   asm volatile("vmexit_handler:\n");   ---> after vmexit code starts executing here:


13. If the launch completes successfully, then the guest executes vmcall and vmexits.

14. After the vmexit, the vmexit_handler takes over. Reads the exit_reason and prints a message.

   asm volatile("vmexit_handler:\n");
   field_1 = VMX_EXIT_REASON;
   value_1 = do_vmread(field_1);
   asm volatile("sti\n");
  


15. When the driver module is removed (rmmod .ko) , vmxon_exit( ) function is called. It does the following: (a) do vmxoff (b) turn off CR4.VMXE (c) deallocate vmcs region and other data structures. (d) deallocate the vmxon region.


static void vmxon_exit(void) {
   if(vmxon_success==1) {
         do_vmxoff();
     vmxon_success = 0;
   }
   save_registers();
   turn_off_vmxe();
   restore_registers();
   deallocate_vmcs_region();
   deallocate_vmxon_region();
}


The dealloc just frees all the allocated memory regions:



static void deallocate_vmxon_region(void) {
   if(vmxon_region){
       kfree(vmxon_region);
   }
}









Wednesday, July 20, 2011

VMX and SMM – Dual monitor mode


The Dual monitor mode involves two monitors:  Executive monitor and the SMM monitor. The Executive monitor is analogous to the vmx-root hypervisor that exists outside of SMM. The SMM monitor is a special hypervisor that operates only in SMM. Under dual monitor treatment SMI’s cause vmexits and this information is recorded in a separate vmcs called SMM transfer vmcs. This enables the SMM monitor to assume control of vmexits caused by SMI# assertion.

Normal  vmx transitions (without dual monitor):
1.        vmxon executed by the executive monitor.
2.       vmptrld is executed.  The guest vmcs is then initialized via vmwrites.
3.       vmlaunch is executed. The guest virtual machine is launched.
4.       A vmexit from the guest traps back to the executive monitor. The executive monitor reads the exit_reason, exit_qual and a bunch of vmcs fields to extract more details on the vmexit.  After handling the vmexit , the executive monitor resumes the guest by executing vmresume.
5.       If there is a SMI# in the guest, vmx is turned off. The processor enters SMM. Upon a RSM, the processor takes us back to vmx guest.

Dual Monitor vmx transitions:
0.       Enable Dual Monitor. [see section on enabling dual  monitor below].
1.       A. vmxon executed by the executive monitor.  
B. Dual monitor treatment is activated [see section on activating dual  monitor below].
2.       vmptrld is executed.  The guest vmcs is then initialized via vmwrites.
3.       vmlaunch is executed. The guest virtual machine is launched.
4.       A vmexit from the guest traps back to the executive monitor. The executive monitor reads the exit_reason, exit_qual and a bunch of vmcs fields to extract more details on the vmexit.  After handling the vmexit , the executive monitor resumes the guest by executing vmresume.
5.       If there is a SMI# in the guest, then a SMM VMexit occurs. Control is transferred to the SMM monitor (instead of executive monitor).  The SMM monitor now handles the SMM vmexit by reading relevant fields in the SMM transfer vmcs.  After handling the SMM vmexit, it resumes the guest by executing a vmresume.
Steps 1A, 2,3 and 4 are identical for normal and dual-monitor vmx transitions.
The only difference between Normal vmx transitions and the Dual monitor vmx transitions is in the handling of SMI. In the dual monitor case, there is a new vmexit (SMM VMexit) that traps to the SMM monitor. All other vmexits continue to trap to the executive monitor.

When the machine is in the SMM monitor, it is considered to be in SMM. A  SMM VMexit is one that begins outside of SMM and ends in SMM. This means that SMM VMexits are also accompanied by the SMI_ACK special cycle. Similarly, a vmresume from a SMM monitor that resumes the guest is also accompanied by a SMI_ACK special cycle(since this vmresume takes the machine from SMM to outside of SMM).


The next two sections cover step 0 and 1B of Dual Monitor vmx transitions.


Enabling Dual Monitor Treatment:
Intel provides a new msr (msr 0x9b – SMM_MONITOR_CTL msr) for this.  Bit 0 of this msr is the valid bit. Bits 31:12 is the physical address (4K aligned) of the monitor segment (also called MSEG) that initializes the SMM transfer vmcs. This msr can be written only in SMM mode.  Here is a sample code:
mov ecx, 0x9B
mov eax, 0x00009001
xor edx, edx ; bits 63:32 are reserved. Clear edx
wrmsr
rsm  ; get out of SMM

The valid bit is set to 1.  Bits 31:12 = 0x9 – This implies that the physical address of the MSEG segment is 0x9000. Note that the above code snippet must run in SMM (SMI handlers that are dual-monitor aware may add the above code to initialize the MSEG). 

A sampleMSEG header looks like the one shown below.  In our example, the header is at physical address 0x9000 (what we wrote in msr 0x9b).

revision_identifier              dd 0
smm_monitor_features      dd 0
gdtr_limit                           dd 
gdtr_baseoffset                 dd
cs_sel                               dd
eip_offset                         dd
esp_offset                        dd
cr3_offset                        dd

The format of the MSEG_HEADER above matches the one described in Table 26.10(vol 3b, System Management Mode, chapter 26).  Note: Depending on the version of the Intel manual, table numbers may vary – but it will be found in the SMM chapter regardless of the manual version.


Activating Dual Monitor Treatment:
After enabling the dual-monitor treatment, software can activate it by executing vmcall instruction.  This execution of vmcall is in VMX_ROOT mode.  (Execution of vmcall in vmx_non_root mode always causes a vmexit. Vmcall execution in vmx_root mode thus has a special meaning – to activate dual monitor treatment).  Here is a sample code that accomplishes this:

; enable cr4.vmxe
mov eax, 0x00002010
mov cr4, eax
; do vmxon
VMXON [vmxon_ptr]
jbe fail
;load smm transfer vmcs pointer
vmclear [vmcs_smm_ptr]
jbe fail
vmptrld [vmcs_smm_ptr]
jbe fail
; now do vmcall
vmcall


When the processor executes vmcall instruction in vmx_root mode, internally does the following:
vmcall_flow:
if (vmx_root) {
       If(dual_monitor_active) {
          Perform SMM_VMEXIT;
       } else if (SMM_MONITOR_CTL_VALID){
         Activate_Dual_Monitor_SMM_VMexit;
      }
}

In the above code snippet,   SMM_MONITOR_CTL_VALID  comes directly from bit 0 of the SMM_MONITOR_CTL_MSR (msr 0x9b).  If these conditions are not met, vmcall fails. Also note that there are additional checks vmcall performs on the SMM transfer vmcs which are not discussed here. For those details reading the manual vol 3b is recommended.  [Also looking at the vmcall pseudo-code provided in Vol 2b (under vmx instructions) is recommended].

In the process of ‘Activate_Dual_Monitor_SMM_VMexit’, the processor does the following:
a.      
En  Enters SMM (issues a SMI_ACK bus cycle)
b.      Reads the MSEG revision identifier (offset 0). If it does not match the revision identifier supported by the processor then VMCALL fails. [The MSEG revision id supported by the processor is obtained by a rdmsr of IA32_VMX_MISC_MSR (msr 0x485 – bits 63:32).
c.       Reads the MSEG features field and performs checks on that field.
d.      After all checks pass, the processor starts executing instructions from the RIP indicated in the eip_offset field of the MSEG.

Sample MSEG code:
mov eax, 0x11ff
mov ebx, VMX_ENTRY_CONTROLS
vmwrite ebx, eax

mov eax, 0x008B
mov ebx, VMX_GUEST_TR_ATTR
vmwrite ebx, eax
 
mov ebx, VMX_EXIT_INSTR_LEN
vmread eax, ebx

mov ebx, VMX_GUEST_RIP
vmread ebx, ebx

add eax, ebx
mov ebx, VMX_GUEST_RIP
vmwrite ebx, eax

vmlaunch

The code above does only the bare minimum stuff (In reality, it will initialize the entire SMM_VMCS) – It initializes the entry_controls, updates the guest_rip and does a vmlaunch.  Where does this vmlaunch take the machine?   The answer to VMX_ROOT.  This is a special type of VMentry that takes the machine back to VMX_ROOT – Intel calls this VMentry as a ‘VMentry that returns from SMM’.   Remember that the machine performed a SMM_Vmexit when VMCALL was executed in VMX_ROOT mode – So this VMLAUNCH in the MSEG code takes us back to VMX_ROOT.  At this point, we are in the executive-monitor. This completes step 1B in the dual monitor flow.

Thursday, January 6, 2011

VMX and System Management Mode - Part 1

There are two different modes of operation of VMX within SMM:
1.Normal Mode
2.Dual monitor mode


Normal Mode:

Under Normal mode, a SMI# assertion causes the processor to turn-off vmx and enter into SMM. Upon a RSM, the processor automatically enables VMX if it was either in VMX-ROOT or VMX-GUEST prior to the SMI#. Since the processor turns off VMX, it means that CR4.VMXE is treated as reserved bit and must be 0 during RSM.

Algorithmically,

if(smi){
if(vmx_root or vmx_guest){
save cr4.vmxe internally;
if(vmx_root) internal_state = vmx_root;
if(vmx_guest) internal_state = vmx_guest;
turn_off_vmx;
}
save cr4 to smm_ram;
}

during rsm:

if(rsm){
read cr4_val from smm_ram;
if(cr4_val.vmxe==1) jump_to_shutdown;
retrieve internal cr4.vmxe;
cr4 <- cr4_val | (cr4.vmxe<<13);
read internal_state;
if(internal_state==vmx_root) put_cpu_in_vmx_root;
if(internal_state==vmx_guest) put_cpu_in_vmx_guest;
}


Notice the jump_to_shutdown during RSM. Since the processor saves CR4.VMXE internally during SMM, the value saved in SMRAM for CR4.VMXE is always 0. During RSM, the CR4 value is first loaded from SMRAM and bit 13 is checked . It must be 0 – If not the cpu will jump to shutdown. The processor then retrieves the value of VMXE from an internal register and updates CR4 with this value. The state of the processor (whether it was in vmx-root or vmx-guest or normal ia32 operation) is also retrieved and the cpu is put in that state after the completion of RSM.

This process is the default treatment of SMIs with VMX.

Notes on System Management Mode [SMM]

SMM:
SMM [System Management Mode] is an operating mode entered through the assertion of the SMI# pin. The processor upon detecting a SMI# saves the processor state in SMRAM [The base address of the SMRAM is obtained form an internal SMBASE register. The reset value of SMBASE register is 0x30000]. The processor saves several architectural values into the SMRAM (like the values of CR0, CR3, CR4 etc) when it enters SMM. To exit out of SMM , software executes a RSM(resume) instruction. During the RSM instruction, the processor reloads the architectural state from SMRAM and gets back to the state it was prior to the SMI#.
Here is a loosely defined algorithm for entering and exiting SMM:
1.Processor is executing a task (say T).
2.SMI# is detected by the processor.
3.Processor saves all information pertaining to task T in the SMRAM. It issues SMI_ENTER_ACK bus cycle and enters SMM.
4.Processor executes code from the SMM space[starting at address 0x38000]
5.When it executes the RSM instruction, the processor reloads the prior architectural state from SMRAM and then issues SMI_EXIT_ACK bus cycle and exits SMM.
6.Processor resumes executing the task T.

During Step 5, while the processor loads architectural state, it performs few checks on the state being loaded:
1.It checks the reserved bits of CR4.
2.It checks CR0 register for illegal combinations. For eg: CR0.PG=1 and CR0.PE=0 or CR0.CD=0 and and NW=1 .
If the checks above fail, then the processor enters shutdown.
[Note: there may be additional checks performed. CR0 and CR4 values in SMRAM should be left untouched by the SMM handler. These checks exist to make sure that the handler does not modify values to put the processor in an incompatible state after the execution of RSM].

Monday, October 4, 2010

Software injection into V86 guest with interrupt redirection - What must be the IDT VECTOR INFO?

The following observation is made while launching a V86 guest on Intel Merom. As part of vmlaunch or vmresume, a software interrupt is injected into the V86 guest(The entry interruption info field reads 0x800004vv where vv is the vector number). The V86 virtual machine has:

a.  GUEST_RFLAGS.VM = 1 (indicating the guest is in V86 mode).
b. CR4.VME=1 (enables interrupt redirection provided the redirection bitmap says so in TSS).
c. The exception_bitmap in the guest is configured to vmexit on a #PF.

At the end of vmlaunch, the software interrupt is injected. The guest is in V86 mode and has CR4.VME=1. The cpu consults the TSS to read the interrupt redirection bitmap. The TSS page is not present and the cpu takes a #PF. The guest is configured to vmexit on #PF. After the vmexit, use vmread to read the following vmcs fields:
a. Exit reason (reads 0)
b. Exit Interruption Info (0x80000B0E - indicates a #PF)
c. IDT Vector  Info (reads 0)
d. Exit Qualification (0x - address that caused #PF).

Something interesting in the above results is the value of idt-vector-info. The idt-vector-info must have read 0x800004vv(vv=vector), since the vmexit was encountered in the process of injecting an event. This behavior appears to violate what is stated in  vol3b.

Monday, April 26, 2010

Injecting software interrupt into a V86 guest

To inject an interrupt or exception into a guest a hypervisor uses the ENTRY_INTERRUPTION_INFO field in the vmcs. For eg: if there is a need to inject a #GP exception into the guest as part of vmentry, the entry_interruption_info field would look like this: 0x80000B0D.

1 . Bits 7:0 of this field represent the vector (0x0D - which is vector 13)
2. Bits 10:8 indicate the type(in this case type = 0x3 which is a hardware-exception).
3. Bit   11 is the error-code valid bit which is true in the example above.
4. Bit 31 is the valid bit for  ENTRY_INTERRUPTION_INFO field.

To inject a software interrupt (say vector 0x8) hypervisor would program entry_interruption_info field as given under: 0x80000408 (type=0x4 and vector=0x8). If the guest is in V86 mode (GUEST_RFLAGS[VM]=1) , the processor behaves according to Table 15.2, Intel SDM, vol 3A .


Given below is a summary of the processor behavior during normal software-interrupt execution in V86 and during an event injection into a V86 guest:

1. EFLAGS.VM = 1 , CR4.VME=1, EFLAGS.IOPL=3
=> In this case the bit in the redirection bitmap of the TSS is consulted.
=> if bit in the redirection bitmap=0, the software interrupt is redirected to x86 style handler.
=> if bit in the redirection bitmap=1, the software interrupt is redirected to protected-mode handler.

2.  EFLAGS.VM = 1 , CR4.VME=1, EFLAGS.IOPL<3
=> In this case the bit in the redirection bitmap of the TSS is consulted.

=> if bit in the redirection bitmap=0, the software interrupt is redirected to x86 style handler. Notice that this is the same behavior as with EFLAGS.IOPL=3. The difference is in the value of eflags pushed on the stack. Here the IOPL of the eflags image is forced to 3 and the value of VIF is copied to IF.

Normal behavior:  if bit in the redirection bitmap=1, the interrupt is directed to a #GP handler.
During VMX event injection:  if bit in the redirection bitmap = 1, the processor will *NOT*  #GP due to IOPL < CPL.

3. EFLAGS.VM = 1 , CR4.VME=0, EFLAGS.IOPL=3
=> Normal behavior: Interrupt directed to a protected mode handler (No #GP).
=> During event injection: Same as above.

4. EFLAGS.VM = 1 , CR4.VME=0, EFLAGS.IOPL<3
=> Normal behavior: Interrupt directed to a #GP handler .
=> During Event Injection:  No #GP can occur due to IOPL< CPL. The behavior will be the same as with IOPL=3.

Summary:
From the above discussion is there will be no #GP due to IOPL < CPL during the injection of a software interrupt into a V86 guest. If the hypervisor wants this #GP to occur, it needs to inject a #GP directly into the guest instead of a software-interrupt.This can be achieved by programming the entry_interruption_info field to 0x80000B0D.

Monday, October 19, 2009

VMEXIT on INVLPG

A boundary case observed on Intel Merom:

(a) The virtual-machine is configured to vmexit on INVLPG(bit 9 of the PROCESSOR_EXECUTION_CONTROLS is 1).

(b) The virtual-machine has GS BASE = 0xFFFF8000_00000000

(c) Virtual machine executes: invlpg [gs:0-1]

(d) Execution of invlpg causes vmexit.

(e) The address of invlpg is recorded in exit-qualification. Upon a vmread of EXIT_QUALIFICATION the value obtained is:
=> FFFF7FFF_FFFFFFFF


Notice that the value recorded is a non-canonical address ie; address[63:48] != address[47]. This is the only case i have encountered where a non-canonical address shows up on the exit-qualification.

The only explanation I can come up with for this behavior is that : INVLPG unlike other instructions does not fault in 64-bit mode with a non-canonical operand. According to the instruction spec, INVLPG morphs into a NOP for such cases.

When a vmexit handler for INVLPG is written, this case must be taken into consideration(ie; a non-canonical address might show up in the exit-qualification field).