Zero-copy network transmission with vmsplice
First, understand that the presented code is a silly hack: it requires four system calls for each transmitted packet, so will never be fast. I post it here because the code can be helpful in other situations where you want to use vmsplice.
Second, note that splice support is a protocol-specific feature. Raw sockets support it as of recent kernels. I believe 2.6.36 has it, but would not be surprised if it is lacking in 2.6.34, for instance (please leave a comment if you know when it was introduced).
The basic idea is to send data to a network socket without copying using vmsplice(). But, the vmsplice syscall will only splice into a pipe, not a network socket. Thus, the data first has to be appended to a pipe and then has to be moved to the socket using a splice() call. One extra complication is that vmsplice() works on entire pages (as it relies on memory protection mechanisms). In this example, we transport a single packet per page, which mean that we have to flush the rest of the page contents to /dev/null. It is not impossible to fill a page with multiple packets and then splice() them to the network -- this indeed sounds much more worthwhile.
On to the code. I lifted this code from another project which always stored one packet per page, not necessarily page-aligned. That is most definitely not a requirement of splicing. In general, try to align pack multiple packets in a page and have the first be page aligned.
/// transmit a packet using splice
static int
do_transmit(void *page, int pkt_offset, int pktlen)
{
struct iovec iov[1];
int ret, len_tail;
// send page to kernel pipe
iov[0].iov_base = page;
iov[0].iov_len = getpagesize();
ret = vmsplice(tx_splicefd[1], iov, 1, SPLICE_F_GIFT);
if (ret != getpagesize()) {
fprintf(stderr, "vmsplice()\n");
return 1;
}
// splice unused headspace to /dev/null (because our packet is not aligned)
ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, pkt_offset, SPLICE_F_MOVE);
if (ret != pkt_offset) {
fprintf(stderr, "splice() header\n");
return 1;
}
// splice or sendfile packet to tx socket
ret = splice(tx_splicefd[0], NULL, tx_rawsockfd, NULL, pktlen, SPLICE_F_MOVE);
if (ret != pktlen) {
fprintf(stderr, "splice() main\n");
return 1;
}
// splice unused tailspace to /dev/null
len_tail = getpagesize() - pktlen - pkt_offset;
ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, len_tail, SPLICE_F_MOVE);
if (ret != len_tail) {
fprintf(stderr, "splice() footer\n");
return 1;
}
return 0;
}
This code makes use of one pipe and two other file descriptors. tx_splicefd is a regular pipe, tx_nullfd is an open file handle to /dev/null and tx_rawsockfd is a raw IP socket. They were created as follows:
/// source IP address in host byte order
#define CONF_TXHOST_HBO ((127 << 1)) + 1
static int tx_splicefd[2], tx_nullfd, tx_rawsockfd;
/// open a RAW or UDP socket for retransmission
// @return 0 on success, -1 on failure
static int
do_init(void)
{
struct sockaddr_in saddr;
// open tx socket
tx_rawsockfd = socket(PF_INET, SOCK_RAW, IPPROTO_RAW);
if (tx_rawsockfd < 0) {
perror("socket() tx");
return -1;
}
// configure raw socket
memset(&saddr, 0, sizeof(saddr));
saddr.sin_family = AF_INET;
saddr.sin_port = htons(ETH_P_IP);
saddr.sin_addr.s_addr = htonl(CONF_TXHOST_HBO);
if (connect(tx_rawsockfd, &saddr, sizeof(saddr))) {
perror("connect() tx");
return -1;
}
// when splicing, have to first send to kernel pipe, then to tx socket.
// also, unwanted data must be flushed to /dev/null
// create pipe to splice to kernel
if (pipe(tx_splicefd)) {
perror("pipe() tx");
return -1;
}
// open /dev/null for splicing trash
tx_nullfd = open("/dev/null", O_WRONLY);
if (tx_nullfd < 0) {
perror("open() /dev/null");
return -1;
}
return 0;
}
This comment has been removed by the author.
ReplyDelete