Golang 系统调用Syscall + RawSyscall

go源码中关于系统调用的定义如下：

func Syscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)
func Syscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2 uintptr, err Errno)
func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)
func RawSyscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2 uintptr, err Errno)

其中Syscall和RawSyscall区别在于Syscall开始和结束，分别调用了 runtime 中的进入系统调用和退出系统调用的函数，说明Syscall函数受调度器控制，不会造成系统堵塞，而RawSyscall函数没有调用runtime，因此可能会造成堵塞，一般我们使用Syscall就可以了，RawSyscall最好用在不会堵塞的情况下。

func Syscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)

Syscall 的定义位于 src/syscall/asm_linux_amd64.s, 是用汇编写成的，封装了对linux底层的调用。接收4个参数，其中trap为中断信号，a1,a2,a3为底层调用函数对应的参数

举例说明：Go调用底层ioctl函数

trap中断类型传入syscall.SYS_IOCTL，SYS_IOCTL中断号表示调用linux底层ioctl函数
Syscall函数中剩下三个参数a1,a2,a3分别对应ioctl的三个参数。可以man命令查看linux ioctl函数参数，如下

int ioctl(int d, int request, ...);

第一个参数d指定一个由open/socket创建的文件描述符，即socket套接字
第二个参数request指定操作的类型，即对该文件描述符执行何种操作，设备相关的请求的代码
第三个参数为一块内存区域，通常依赖于request指定的操作类型

具体过程如下：
1 通过socket创建套接字
2 初始化struct ifconf与/或struct ifreq结构
3 调用ioctl函数，执行相应类型的SIO操作
4 获取返回至truct ifconf与/或struct ifreq结构中的相关信息

调用底层socket函数创建socket套接字，linux下用man命令查看socket函数用法

int socket(int domain, int type, int protocol);

其中domain为协议类型，type为套接字类型，protocol指定某个协议类型常值
domain的值有：

AF_INET IPv4协议
AF_INET6 Ipv6协议
AF_ROUTE 路由套接字
...

type的值有：

SOCK_STREAM 字节流套接字
SOCK_DGRAM 数据报套接字
SOCK_RAW 原始套接字
...

protocol的值有：

IPPROTO_IP IP传输协议
IPPROTO_TCP TCP传输协议
IPPROTO_UDP UDP传输协议
...

因此linux下调用socket生成套接字写法：

fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);

综上，转换成go语言中系统调用写法

fd, _, err := syscall.RawSyscall(syscall.SYS_SOCKET, syscall.AF_INET, syscall.SOCK_DGRAM, syscall.IPPROTO_IP)

此时即生成了的socket套接字fd
我们传给int ioctl(int d, int request, …);函数作为第一个参数，第二个参数request操作的类型我们传入SIOCETHTOOL，获取ethtool信息
SIOCETHTOOL 在源码中宏定义为

#define SIOCETHTOOL     0x8946

第三个参数为struct ifreq结构内存地址
Struct ifreq结构如下：

Struct ifreq{
Char ifr_name[IFNAMSIZ];
Union{
    Struct  sockaddr  ifru_addr;
    Struct  sockaddr  ifru_dstaddr;
    Struct  sockaddr  ifru_broadaddr;
    Struct  sockaddr  ifru_netmask;
    Struct  sockaddr  ifru_hwaddr;
    Short  ifru_flags;
    Int     ifru_metric;
    Caddr_t ifru_data;
}ifr_ifru;
};
#define ifr_addr        ifr_ifru.ifru_addr
#define ifr_broadaddr   ifr_ifru.ifru_broadadd
#define ifr_hwaddr      ifr_ifru_hwaddr

综上，linux调用ioctl函数如下：

fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
ioctl(fd, SIOCETHTOOL, &ifreq);

go语言：

fd, _, err := syscall.RawSyscall(syscall.SYS_SOCKET, syscall.AF_INET, syscall.SOCK_DGRAM, syscall.IPPROTO_IP)
if err != 0 {
        return syscall.Errno(err)
    }

_, _, ep := syscall.Syscall(syscall.SYS_IOCTL, uintptr(e.fd), SIOCETHTOOL, uintptr(unsafe.Pointer(&ifreq)))
if ep != 0 {
        return syscall.Errno(ep)
    }

和调度的交互

这里只列出Syscall和RawSyscall的源码：

//Syscall
TEXT ·Syscall(SB),NOSPLIT,$0-56
    CALL    runtime·entersyscall(SB)
    MOVQ    a1+8(FP), DI
    MOVQ    a2+16(FP), SI
    MOVQ    a3+24(FP), DX
    MOVQ    $0, R10
    MOVQ    $0, R8
    MOVQ    $0, R9
    MOVQ    trap+0(FP), AX    // syscall entry
    SYSCALL
    CMPQ    AX, $0xfffffffffffff001
    JLS    ok
    MOVQ    $-1, r1+32(FP)
    MOVQ    $0, r2+40(FP)
    NEGQ    AX
    MOVQ    AX, err+48(FP)
    CALL    runtime·exitsyscall(SB)
    RET
ok:
    MOVQ    AX, r1+32(FP)
    MOVQ    DX, r2+40(FP)
    MOVQ    $0, err+48(FP)
    CALL    runtime·exitsyscall(SB)
    RET

//RawSyscall
TEXT ·RawSyscall(SB),NOSPLIT,$0-56
    MOVQ    a1+8(FP), DI
    MOVQ    a2+16(FP), SI
    MOVQ    a3+24(FP), DX
    MOVQ    $0, R10
    MOVQ    $0, R8
    MOVQ    $0, R9
    MOVQ    trap+0(FP), AX    // syscall entry
    SYSCALL
    CMPQ    AX, $0xfffffffffffff001
    JLS    ok1
    MOVQ    $-1, r1+32(FP)
    MOVQ    $0, r2+40(FP)
    NEGQ    AX
    MOVQ    AX, err+48(FP)
    RET
ok1:
    MOVQ    AX, r1+32(FP)
    MOVQ    DX, r2+40(FP)
    MOVQ    $0, err+48(FP)
    RET

Syscall和RawSyscall的实现比较典型，可以看到这两个实现最主要的区别在于：
Syscall在进入系统调用的时候，调用了runtime·entersyscall(SB)函数，在结束系统调用的时候调用了runtime·exitsyscall(SB)。做到进入和退出syscall的时候通知runtime。

这两个函数runtime·entersyscall和runtime·exitsyscall的实现在proc.go文件里面。其实在runtime·entersyscall函数里面，通知系统调用时候，是会将g的M的P解绑，P可以去继续获取M执行其余的g，这样提升效率。

所以如果用户代码使用了 RawSyscall 来做一些阻塞的系统调用，是有可能阻塞其它的 g 的。RawSyscall 只是为了在执行那些一定不会阻塞的系统调用时，能节省两次对 runtime 的函数调用消耗。

runtime·entersyscall和runtime·exitsyscall这两个函数也是与scheduler交互的地方，后面会对源码进行分析

运行时支持

runtime·entersyscallruntime·exitsyscallsrc/pkg/runtime/proc.c

src/pkg/runtime/runtime.h

void runtime·entersyscall(void);
void runtime·entersyscallblock(void);
void runtime·exitsyscall(void);

void runtime·entersyscallblock(void)

void runtime·exitsyscall(void)

void ·entersyscall(int dummy) { ... }
void ·entersyscallblock(int dummy) { ... }

runtime·entersyscallruntime·entersyscallblock·entersyscall·entersyscallblockdummy

runtime·entersyscall

好了，我们回到函数实现分析上来，看看进入系统调用前，runtime究竟都做了那些特别处理。下面将这个函数分成3段进行分析：

dummysaveg->sched.spg->sched.pcsyscallspsyscallpcsyscallstacksyscallguardentersyscallGsyscall

#pragma textflag NOSPLIT
void
·entersyscall(int32 dummy)
{
// Disable preemption because during this function g is in Gsyscall status,
// but can have inconsistent g->sched, do not let GC observe it.
m->locks++;
// Leave SP around for GC and traceback.
save(runtime·getcallerpc(&dummy), runtime·getcallersp(&dummy));
g->syscallsp = g->sched.sp;
g->syscallpc = g->sched.pc;
g->syscallstack = g->stackbase;
g->syscallguard = g->stackguard;
g->status = Gsyscall;
if(g->syscallsp < g->syscallguard-StackGuard || g->syscallstack < g->syscallsp) {
// runtime·printf("entersyscall inconsistent %p [%p,%p]\n",
// g->syscallsp, g->syscallguard-StackGuard, g->syscallstack);
runtime·throw("entersyscall");
}

sysmonsysmon

if(runtime·atomicload(&runtime·sched.sysmonwait)) { // TODO: fast atomic
runtime·lock(&runtime·sched);
if(runtime·atomicload(&runtime·sched.sysmonwait)) {
runtime·atomicstore(&runtime·sched.sysmonwait, 0);
runtime·notewakeup(&runtime·sched.sysmonnote);
}
runtime·unlock(&runtime·sched);
save(runtime·getcallerpc(&dummy), runtime·getcallersp(&dummy));
}

mcachemPsyscallPsyscallPgcstopg->stackguard0 = StackPreemptmorestack

m->mcache = nil;
m->p->m = nil;
runtime·atomicstore(&m->p->status, Psyscall);
if(runtime·sched.gcwaiting) {
runtime·lock(&runtime·sched);
if (runtime·sched.stopwait > 0 && runtime·cas(&m->p->status, Psyscall, Pgcstop)) {
if(--runtime·sched.stopwait == 0)
runtime·notewakeup(&runtime·sched.stopnote);
}
runtime·unlock(&runtime·sched);
save(runtime·getcallerpc(&dummy), runtime·getcallersp(&dummy));
}
// Goroutines must not split stacks in Gsyscall status (it would corrupt g->sched).
// We set stackguard to StackPreempt so that first split stack check calls morestack.
// Morestack detects this case and throws.
g->stackguard0 = StackPreempt;
m->locks--;
}

runtime·lock(&runtime.sched)runtime·unlock(&runtime·sched)save

runtime·entersyscallblock

·entersyscall·entersyscallblocksysmon

·entersyscall

#pragma textflag NOSPLIT
void
·entersyscallblock(int32 dummy)
{
P *p;
m->locks++; // see comment in entersyscall
// Leave SP around for GC and traceback.
save(runtime·getcallerpc(&dummy), runtime·getcallersp(&dummy));
g->syscallsp = g->sched.sp;
g->syscallpc = g->sched.pc;
g->syscallstack = g->stackbase;
g->syscallguard = g->stackguard;
g->status = Gsyscall;
if(g->syscallsp < g->syscallguard-StackGuard || g->syscallstack < g->syscallsp) {
// runtime·printf("entersyscall inconsistent %p [%p,%p]\n",
// g->syscallsp, g->syscallguard-StackGuard, g->syscallstack);
runtime·throw("entersyscallblock");
}

Pidle

p = releasep();
handoffp(p);
if(g->isbackground) // do not consider blocked scavenger for deadlock detection
incidlelocked(1);
// Resave for traceback during blocked call.
save(runtime·getcallerpc(&dummy), runtime·getcallersp(&dummy));
g->stackguard0 = StackPreempt; // see comment in entersyscall
m->locks--;
}

syscallruntime·entersyscallruntime·entersyscallblock

runtime·entersyscallblockbool runtime.notetsleepg(Note *n, int64 ns)NoteNote

上述机制在runtime中多有使用，比如在“定时器”模块中 —— 后面有机会会详细介绍。

runtime·exitsyscall

该函数主要的功能是从syscall状态恢复，其结构比较清晰，主要分为两个步骤：

exitsyscallfasttruefalseruntime·exitsyscall

// The goroutine g exited its system call.
// Arrange for it to run on a cpu again.
// This is called only from the go syscall library, not
// from the low-level system calls used by the runtime.
#pragma textflag NOSPLIT
void
runtime·exitsyscall(void)
{
m->locks++; // see comment in entersyscall
if(g->isbackground) // do not consider blocked scavenger for deadlock detection
incidlelocked(-1);
if(exitsyscallfast()) {
// There's a cpu for us, so we can run.
m->p->syscalltick++;
g->status = Grunning;
// Garbage collector isn't running (since we are),
// so okay to clear gcstack and gcsp.
g->syscallstack = (uintptr)nil;
g->syscallsp = (uintptr)nil;
m->locks--;
if(g->preempt) {
// restore the preemption request in case we've cleared it in newstack
g->stackguard0 = StackPreempt;
} else {
// otherwise restore the real stackguard, we've spoiled it in entersyscall/entersyscallblock
g->stackguard0 = g->stackguard;
}
return;
}
m->locks--;

exitsyscallfastruntime.mcall

// Call the scheduler.
runtime·mcall(exitsyscall0);
// Scheduler returned, so we're allowed to run now.
// Delete the gcstack information that we left for
// the garbage collector during the system call.
// Must wait until now because until gosched returns
// we don't know for sure that the garbage collector
// is not running.
g->syscallstack = (uintptr)nil;
g->syscallsp = (uintptr)nil;
m->p->syscalltick++;
}

exitsyscall0runtime.mcallsyscallstacksyscallspsyscalltick

一点说明

Go语言之所以设计了M及P这两个概念，并对执行syscall的线程进行特别处理，适当进行M和P的解耦，主要是为了提高并发度，降低频繁、长时间的阻塞syscall带来的问题。但是必须意识到，这种机制本身也存在一定的开销，比如任务迁移可能影响CACHE、TLB的性能。

·entersyscall

对于runtime中的一些底层syscall，比如所有的底层锁操作 —— 在Linux中使用的是Futex机制 —— 相应的Lock/Unlock操作都使用了底层系统调用，此时线程会直接调用syscall而不需要其他的操作，这样主要是保证底层代码的高效执行。

syscallRawSyscallruntime·entersyscallruntime·exitsyscall