proc: iommu: flush the iotlb during shootdowns

IOMMUs are like remote cores.  When we need a generic TLB shootdown, we
also need to shootdown the IOTLB for any IOMMUs where the process's
address space is loaded.

If you don't use device assignment, this is a noop.  If you do, it's
pretty expensive: 800 ns or so.

Part of the reason it is worse than a regular IPI shootdown is that we
have to wait for a response from the IOMMU hardware.  With IPIs, we know
that the core was using the page table (e.g. userspace or a syscall that
wasn't in IRQ context) immediately enters the handler and stops using
the page table.  In short, we stop the processor from processing and
using the table.  Then the actual flush happens.  With the IOMMU, we
can't stop the device from processing, so we have to wait until the
flush completes.

Just like IPIs, these are relatively expensive, and any tricks we want
to do to amortize or batch up TLB shootdowns will apply equally to the
IOTLB.  e.g. "defer reuse / freeing of pages and PTEs".

Signed-off-by: Barret Rhoden <brho@cs.berkeley.edu>
diff --git a/kern/arch/x86/intel-iommu.h b/kern/arch/x86/intel-iommu.h
index 5131e03..37500b9 100644
--- a/kern/arch/x86/intel-iommu.h
+++ b/kern/arch/x86/intel-iommu.h
@@ -77,6 +77,7 @@
 void iommu_unassign_all_devices(struct proc *p);
 void __iommu_device_assign(struct pci_device *pdev, struct proc *proc);
 void __iommu_device_unassign(struct pci_device *pdev, struct proc *proc);
+void proc_iotlb_flush(struct proc *p);
 
 /*
  * VT-d hardware uses 4KiB page size regardless of host page size.
diff --git a/kern/src/dma.c b/kern/src/dma.c
index 60c1dda..155a6d6 100644
--- a/kern/src/dma.c
+++ b/kern/src/dma.c
@@ -245,10 +245,6 @@
 	struct proc *p = da->data;
 
 	munmap(p, (uintptr_t)obj, amt);
-
-	/* TODO: move this to munmap */
-	extern void proc_iotlb_flush(struct proc *p);
-	proc_iotlb_flush(p);
 }
 
 static void *user_addr_to_kaddr(struct dma_arena *da, physaddr_t uaddr)
diff --git a/kern/src/process.c b/kern/src/process.c
index f02f245..0fa18ac 100644
--- a/kern/src/process.c
+++ b/kern/src/process.c
@@ -2019,6 +2019,7 @@
 			tlbflush();
 	}
 	spin_unlock(&p->proc_lock);
+	proc_iotlb_flush(p);
 }
 
 /* Helper, used by __startcore and __set_curctx, which sets up cur_ctx to run a