Zero-copy network transmission with vmsplice

This post completes a set that also includes asynchronous reading with PACKET_RX_RING and asynchronous writing with PACKET_TX_RING. In this post, we look at sending packets out over a raw socket in zero-copy fashion.

First, understand that the presented code is a silly hack: it requires four system calls for each transmitted packet, so will never be fast. I post it here because the code can be helpful in other situations where you want to use vmsplice.


Second, note that splice support is a protocol-specific feature. Raw sockets support it as of recent kernels. I believe 2.6.36 has it, but would not be surprised if it is lacking in 2.6.34, for instance (please leave a comment if you know when it was introduced).


The basic idea is to send data to a network socket without copying using vmsplice(). But, the vmsplice syscall will only splice into a pipe, not a network socket. Thus, the data first has to be appended to a pipe and then has to be moved to the socket using a splice() call. One extra complication is that vmsplice() works on entire pages (as it relies on memory protection mechanisms). In this example, we transport a single packet per page, which mean that we have to flush the rest of the page contents to /dev/null. It is not impossible to fill a page with multiple packets and then splice() them to the network -- this indeed sounds much more worthwhile.

On to the code. I lifted this code from another project which always stored one packet per page, not necessarily page-aligned. That is most definitely not a requirement of splicing. In general, try to align pack multiple packets in a page and have the first be page aligned.

 
/// transmit a packet using splice
static int
do_transmit(void *page, int pkt_offset, int pktlen)
{
  struct iovec iov[1];
  int ret, len_tail;

  // send page to kernel pipe
  iov[0].iov_base = page;
  iov[0].iov_len = getpagesize();

  ret = vmsplice(tx_splicefd[1], iov, 1, SPLICE_F_GIFT);
  if (ret != getpagesize()) {
    fprintf(stderr, "vmsplice()\n");
    return 1;
  }

  // splice unused headspace to /dev/null (because our packet is not aligned)
  ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, pkt_offset, SPLICE_F_MOVE);
  if (ret != pkt_offset) {
    fprintf(stderr, "splice() header\n");
    return 1;
  }

  // splice or sendfile packet to tx socket
  ret = splice(tx_splicefd[0], NULL, tx_rawsockfd, NULL, pktlen, SPLICE_F_MOVE);
  if (ret != pktlen) {
    fprintf(stderr, "splice() main\n");
    return 1;
  }

  // splice unused tailspace to /dev/null
  len_tail = getpagesize() - pktlen - pkt_offset;
  ret = splice(tx_splicefd[0], NULL, tx_nullfd, NULL, len_tail, SPLICE_F_MOVE);
  if (ret != len_tail) {
    fprintf(stderr, "splice() footer\n");
    return 1;
  }
  return 0;
}

This code makes use of one pipe and two other file descriptors. tx_splicefd is a regular pipe, tx_nullfd is an open file handle to /dev/null and tx_rawsockfd is a raw IP socket. They were created as follows:
 
/// source IP address in host byte order
#define CONF_TXHOST_HBO ((127 << 1)) + 1

static int tx_splicefd[2], tx_nullfd, tx_rawsockfd;

/// open a RAW or UDP socket for retransmission
//  @return 0 on success, -1 on failure
static int
do_init(void)
{
  struct sockaddr_in saddr;

  // open tx socket
  tx_rawsockfd = socket(PF_INET, SOCK_RAW, IPPROTO_RAW);
  if (tx_rawsockfd < 0) {
    perror("socket() tx");
    return -1;
  }

  // configure raw socket
  memset(&saddr, 0, sizeof(saddr));
  saddr.sin_family      = AF_INET;
  saddr.sin_port        = htons(ETH_P_IP);
  saddr.sin_addr.s_addr = htonl(CONF_TXHOST_HBO);

  if (connect(tx_rawsockfd, &saddr, sizeof(saddr))) {
    perror("connect() tx");
    return -1;
  }

  // when splicing, have to first send to kernel pipe, then to tx socket.
  // also, unwanted data must be flushed to /dev/null
  // create pipe to splice to kernel
  if (pipe(tx_splicefd)) {
    perror("pipe() tx");
    return -1;
  }

  // open /dev/null for splicing trash
  tx_nullfd = open("/dev/null", O_WRONLY);
  if (tx_nullfd < 0) {
    perror("open() /dev/null");
    return -1;
  }

  return 0;
}

Comments

Post a Comment