Documentation/kernel_messages.txt - upstream - Git at Google

 kernel_messages.txt
 Barret Rhoden
 2010-03-19
 Updated 2012-11-14

 This document explains the basic ideas behind our "kernel messages" (KMSGs) and
 some of the arcane bits behind the implementation.  These were formerly called
 active messages, since they were an implementation of the low-level hardware
 messaging.

 Overview:
 --------------------------------
 Our kernel messages are just work that is shipped remotely, delayed in time, or
 both.  They currently consist of a function pointer and a few arguments.  Kernel
 messages of a given type will be executed in order, with guaranteed delivery.

 Initially, they were meant to be a way to immediately execute code on another
 core (once interrupts are enabled), in the order in which the messages were
 sent.  This is insufficient (and wasn't what we wanted for the task,
 incidentally).  We simply want to do work on another core, but not necessarily
 instantly.  And not necessarily on another core.

 Currently, there are two types, distinguished by which list they are sent to per
 core: immediate and routine.   Routine messages are often referred to as RKMs.
 Immediate messages will get executed as soon as possible (once interrupts are
 enabled).  Routine messages will be executed at convenient points in the kernel.
 This includes when the kernel is about to pop back to userspace
 (proc_restartcore()), or smp_idle()ing.  Routine messages are necessary when
 their function does not return, such as a __launch_kthread.  They should also be
 used if the work is not worth fully interrupting the kernel.  (An IPI will still
 be sent, but the work will be delayed).  Finally, they should be used if their
 work could affect currently executing kernel code (like a syscall).

 For example, some older KMSGs such as __startcore used to not return and would
 pop directly into user space.  This complicted the KMSG code quite a bit.  While
 these functions now return, they still can't be immediate messages.  Proc
 management KMSGs change the cur_ctx out from under a syscall, which can lead to
 a bunch of issues.

 Immediate kernel messages are executed in interrupt context, with interrupts
 disabled.  Routine messages are only executed from places in the code where the
 kernel doesn't care if the functions don't return or otherwise cause trouble.
 This means RKMs aren't run in interrupt context in the kernel (or if the kernel
 code itself traps).  We don't have a 'process context' like Linux does, instead
 its more of a 'default context'.  That's where RKMs run, and they run with IRQs
 disabled.

 RKMs can enable IRQs, or otherwise cause IRQs to be enabled.  __launch_kthread
 is a good example: it runs a kthread, which may have had IRQs enabled.

 With RKMs, there are no concerns about the kernel holding locks or otherwise
 "interrupting" its own execution.  Routine messages are a little different than
 just trapping into the kernel, since the functions don't have to return and may
 result in clobbering the kernel stack.  Also note that this behavior is
 dependent on where we call process_routine_kmsg().  Don't call it somewhere you
 need to return to.

 An example of an immediate message would be a TLB_shootdown.  Check current,
 flush if applicable, and return.  It doesn't harm the kernel at all.  Another
 example would be certain debug routines.

 History:
 --------------------------------
 KMSGs have a long history tied to process management code.  The main issues were
 related to which KMSG functions return and which ones mess with local state (like
 clobbering cur_ctx or the owning_proc).  Returning was a big deal because you
 can't just arbitrarily abandon a kernel context (locks or refcnts could be held,
 etc).  This is why immediates must return.  Likewise, there are certain
 invariants about what a core is doing that shouldn't be changed by an IRQ
 handler (which is what an immed message really is).  See all the old proc
 management commits if you want more info (check for changes to __startcore).

 Other Uses:
 --------------------------------
 Kernel messages will also be the basis for the alarm system.  All it is is
 expressing work that needs to be done.  That being said, the k_msg struct will
 probably receive a timestamp field, among other things.  Routine messages also
 will replace the old workqueue, which hasn't really been used in 40 months or
 so.

 To Return or Not:
 --------------------------------
 Routine k_msgs do not have to return.  Immediate messages must.  The distinction
 is in how they are sent (send_kernel_message() will take a flag), so be careful.

 To retain some sort of sanity, the functions that do not return must adhere to
 some rules.  At some point they need to end in a place where they check routine
 messages or enable interrupts.  Simply calling smp_idle() will do this.  The
 idea behind this is that routine messages will get processed once the kernel is
 able to (at a convenient place).

 Missing Routine Messages:
 --------------------------------
 It's important that the kernel always checks for routine messages before leaving
 the kernel, either to halt the core or to pop into userspace.  There is a race
 involved with messages getting posted after we check the list, but before we
 pop/halt.  In that time, we send an IPI.  This IPI will force us back into the
 kernel at some point in the code before process_routine_kmsg(), thus keeping us
 from missing the RKM.

 In the future, if we know the kernel code on a particular core is not attempting
 to halt/pop, then we could avoid sending this IPI.  This is the essence of the
 optimization in send_kernel_message() where we don't IPI ourselves.  A more
 formal/thorough way to do this would be useful, both to avoid bugs and to
 improve cross-core KMSG performance.

 IRQ Trickiness:
 --------------------------------
 You cannot enable interrupts in the handle_kmsg_ipi() handler, either in the
 code or in any immediate kmsg.  Since we send the EOI before running the handler
 (on x86), another IPI could cause us to reenter the handler, which would spin on
 the lock the previous context is holding (nested IRQ stacks).  Using irqsave
 locks is not sufficient, since they assume IRQs are not turned on in the middle
 of their operation (such as in the body of an immediate kmsg).

 Other Notes:
 --------------------------------
 Unproven hunch, but the main performance bottleneck with multiple senders and
 receivers of k_msgs will be the slab allocator.  We use the slab so we can
 dynamically create the k_msgs (can pass them around easily, delay with them
 easily (alarms), and most importantly we can't deadlock by running out of room
 in a static buffer).

 Architecture Dependence:
 --------------------------------
 Some details will differ, based on architectural support.  For instance,
 immediate messages can be implemented with true active messages.  Other systems
 with maskable IPI vectors can use a different IPI for routine messages, and that
 interrupt can get masked whenever we enter the kernel (note, that means making
 every trap gate an interrupt gate), and we unmask that interrupt when we want to
 process routine messages.

 However, given the main part of kmsgs is arch-independent, I've consolidated all
 of it in one location until we need to have separate parts of the implementation.
	kernel_messages.txt
	Barret Rhoden
	2010-03-19
	Updated 2012-11-14

	This document explains the basic ideas behind our "kernel messages" (KMSGs) and
	some of the arcane bits behind the implementation. These were formerly called
	active messages, since they were an implementation of the low-level hardware
	messaging.

	Overview:
	--------------------------------
	Our kernel messages are just work that is shipped remotely, delayed in time, or
	both. They currently consist of a function pointer and a few arguments. Kernel
	messages of a given type will be executed in order, with guaranteed delivery.

	Initially, they were meant to be a way to immediately execute code on another
	core (once interrupts are enabled), in the order in which the messages were
	sent. This is insufficient (and wasn't what we wanted for the task,
	incidentally). We simply want to do work on another core, but not necessarily
	instantly. And not necessarily on another core.

	Currently, there are two types, distinguished by which list they are sent to per
	core: immediate and routine. Routine messages are often referred to as RKMs.
	Immediate messages will get executed as soon as possible (once interrupts are
	enabled). Routine messages will be executed at convenient points in the kernel.
	This includes when the kernel is about to pop back to userspace
	(proc_restartcore()), or smp_idle()ing. Routine messages are necessary when
	their function does not return, such as a __launch_kthread. They should also be
	used if the work is not worth fully interrupting the kernel. (An IPI will still
	be sent, but the work will be delayed). Finally, they should be used if their
	work could affect currently executing kernel code (like a syscall).

	For example, some older KMSGs such as __startcore used to not return and would
	pop directly into user space. This complicted the KMSG code quite a bit. While
	these functions now return, they still can't be immediate messages. Proc
	management KMSGs change the cur_ctx out from under a syscall, which can lead to
	a bunch of issues.

	Immediate kernel messages are executed in interrupt context, with interrupts
	disabled. Routine messages are only executed from places in the code where the
	kernel doesn't care if the functions don't return or otherwise cause trouble.
	This means RKMs aren't run in interrupt context in the kernel (or if the kernel
	code itself traps). We don't have a 'process context' like Linux does, instead
	its more of a 'default context'. That's where RKMs run, and they run with IRQs
	disabled.

	RKMs can enable IRQs, or otherwise cause IRQs to be enabled. __launch_kthread
	is a good example: it runs a kthread, which may have had IRQs enabled.

	With RKMs, there are no concerns about the kernel holding locks or otherwise
	"interrupting" its own execution. Routine messages are a little different than
	just trapping into the kernel, since the functions don't have to return and may
	result in clobbering the kernel stack. Also note that this behavior is
	dependent on where we call process_routine_kmsg(). Don't call it somewhere you
	need to return to.

	An example of an immediate message would be a TLB_shootdown. Check current,
	flush if applicable, and return. It doesn't harm the kernel at all. Another
	example would be certain debug routines.

	History:
	--------------------------------
	KMSGs have a long history tied to process management code. The main issues were
	related to which KMSG functions return and which ones mess with local state (like
	clobbering cur_ctx or the owning_proc). Returning was a big deal because you
	can't just arbitrarily abandon a kernel context (locks or refcnts could be held,
	etc). This is why immediates must return. Likewise, there are certain
	invariants about what a core is doing that shouldn't be changed by an IRQ
	handler (which is what an immed message really is). See all the old proc
	management commits if you want more info (check for changes to __startcore).

	Other Uses:
	--------------------------------
	Kernel messages will also be the basis for the alarm system. All it is is
	expressing work that needs to be done. That being said, the k_msg struct will
	probably receive a timestamp field, among other things. Routine messages also
	will replace the old workqueue, which hasn't really been used in 40 months or
	so.

	To Return or Not:
	--------------------------------
	Routine k_msgs do not have to return. Immediate messages must. The distinction
	is in how they are sent (send_kernel_message() will take a flag), so be careful.

	To retain some sort of sanity, the functions that do not return must adhere to
	some rules. At some point they need to end in a place where they check routine
	messages or enable interrupts. Simply calling smp_idle() will do this. The
	idea behind this is that routine messages will get processed once the kernel is
	able to (at a convenient place).

	Missing Routine Messages:
	--------------------------------
	It's important that the kernel always checks for routine messages before leaving
	the kernel, either to halt the core or to pop into userspace. There is a race
	involved with messages getting posted after we check the list, but before we
	pop/halt. In that time, we send an IPI. This IPI will force us back into the
	kernel at some point in the code before process_routine_kmsg(), thus keeping us
	from missing the RKM.

	In the future, if we know the kernel code on a particular core is not attempting
	to halt/pop, then we could avoid sending this IPI. This is the essence of the
	optimization in send_kernel_message() where we don't IPI ourselves. A more
	formal/thorough way to do this would be useful, both to avoid bugs and to
	improve cross-core KMSG performance.

	IRQ Trickiness:
	--------------------------------
	You cannot enable interrupts in the handle_kmsg_ipi() handler, either in the
	code or in any immediate kmsg. Since we send the EOI before running the handler
	(on x86), another IPI could cause us to reenter the handler, which would spin on
	the lock the previous context is holding (nested IRQ stacks). Using irqsave
	locks is not sufficient, since they assume IRQs are not turned on in the middle
	of their operation (such as in the body of an immediate kmsg).

	Other Notes:
	--------------------------------
	Unproven hunch, but the main performance bottleneck with multiple senders and
	receivers of k_msgs will be the slab allocator. We use the slab so we can
	dynamically create the k_msgs (can pass them around easily, delay with them
	easily (alarms), and most importantly we can't deadlock by running out of room
	in a static buffer).

	Architecture Dependence:
	--------------------------------
	Some details will differ, based on architectural support. For instance,
	immediate messages can be implemented with true active messages. Other systems
	with maskable IPI vectors can use a different IPI for routine messages, and that
	interrupt can get masked whenever we enter the kernel (note, that means making
	every trap gate an interrupt gate), and we unmask that interrupt when we want to
	process routine messages.

	However, given the main part of kmsgs is arch-independent, I've consolidated all
	of it in one location until we need to have separate parts of the implementation.