Zero-copy network transmission with vmsplice
First, understand that the presented code is a silly hack: it requires four system calls for each transmitted packet, so will never be fast. I post it here because the code can be helpful in other situations where you want to use vmsplice.
Second, note that splice support is a protocol-specific feature. Raw sockets support it as of recent kernels. I believe 2.6.36 has it, but would not be surprised if it is lacking in 2.6.34, for instance (please leave a comment if you know when it was introduced).
The basic idea is to send data to a network socket without copying using vmsplice(). But, the vmsplice syscall will only splice into a pipe, not a network socket. Thus, the data first has to be appended to a pipe and then has to be moved to the socket using a splice() call. One extra complication is that vmsplice() works on entire pages (as it relies on memory protection mechanisms). In this example, we transport a single packet per page, which mean that we have to flush the rest of the page contents to /dev/null. It is not impossible to fill a page with multiple packets and then splice() them to the network -- this indeed sounds much more worthwhile.
On to the code. I lifted this code from another project which always stored one packet per page, not necessarily page-aligned. That is most definitely not a requirement of splicing. In general, try to align pack multiple packets in a page and have the first be page aligned.
/// transmit a packet using splice static int do_transmit(void *page, int pkt_offset, int pktlen) { struct iovec iov[1]; int ret, len_tail; // send page to kernel pipe iov[0].iov_base = page; iov[0].iov_len = getpagesize(); ret = vmsplice(tx_splicefd[1], iov, 1, SPLICE_F_GIFT); if (ret != getpagesize()) { fprintf(stderr, "vmsplice()\n"); return 1; } // splice unused headspace to /dev/null (because our packet is not aligned) ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, pkt_offset, SPLICE_F_MOVE); if (ret != pkt_offset) { fprintf(stderr, "splice() header\n"); return 1; } // splice or sendfile packet to tx socket ret = splice(tx_splicefd[0], NULL, tx_rawsockfd, NULL, pktlen, SPLICE_F_MOVE); if (ret != pktlen) { fprintf(stderr, "splice() main\n"); return 1; } // splice unused tailspace to /dev/null len_tail = getpagesize() - pktlen - pkt_offset; ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, len_tail, SPLICE_F_MOVE); if (ret != len_tail) { fprintf(stderr, "splice() footer\n"); return 1; } return 0; }
This code makes use of one pipe and two other file descriptors. tx_splicefd is a regular pipe, tx_nullfd is an open file handle to /dev/null and tx_rawsockfd is a raw IP socket. They were created as follows:
/// source IP address in host byte order #define CONF_TXHOST_HBO ((127 << 1)) + 1 static int tx_splicefd[2], tx_nullfd, tx_rawsockfd; /// open a RAW or UDP socket for retransmission // @return 0 on success, -1 on failure static int do_init(void) { struct sockaddr_in saddr; // open tx socket tx_rawsockfd = socket(PF_INET, SOCK_RAW, IPPROTO_RAW); if (tx_rawsockfd < 0) { perror("socket() tx"); return -1; } // configure raw socket memset(&saddr, 0, sizeof(saddr)); saddr.sin_family = AF_INET; saddr.sin_port = htons(ETH_P_IP); saddr.sin_addr.s_addr = htonl(CONF_TXHOST_HBO); if (connect(tx_rawsockfd, &saddr, sizeof(saddr))) { perror("connect() tx"); return -1; } // when splicing, have to first send to kernel pipe, then to tx socket. // also, unwanted data must be flushed to /dev/null // create pipe to splice to kernel if (pipe(tx_splicefd)) { perror("pipe() tx"); return -1; } // open /dev/null for splicing trash tx_nullfd = open("/dev/null", O_WRONLY); if (tx_nullfd < 0) { perror("open() /dev/null"); return -1; } return 0; }
This comment has been removed by the author.
ReplyDelete