Asynchronous packet socket reading with PACKET_RX_RING

2016 update

This post is quite old by now. For a more recent example, take a look at github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c

Since Linux 2.6.2x, processes can read network packets asynchronously using a packet socket ring buffer. By setting the socket option SOL_SOCKET PACKET_RX_RING on a packet socket, the kernel allocates a ring buffer to hold packets. It will then copy all packets that a caller would have had to read using read() to this ring buffer. The caller then maps the ring into its virtual memory by executing an mmap() call on the packet socket and from then on can read packets without issuing any system calls. It signals the kernel that it has finished processing a packet by setting a value in a header structure that is prefixed to the packet. If the caller has processed all outstanding packets, it can block by isssuing a select() involving the packet socket.

This snippet shows how to set up a packet socket with ring

 

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <poll.h>

#include <arpa/inet.h>
#include <netinet/if_ether.h>
#include <sys/mman.h>
#include <sys/socket.h>
#include <sys/stat.h>

#include <linux/if_packet.h>

/// The number of frames in the ring
//  This number is not set in stone. Nor are block_size, block_nr or frame_size
#define CONF_RING_FRAMES          128

/// Offset of data from start of frame
#define PKT_OFFSET      (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + \
                         TPACKET_ALIGN(sizeof(struct sockaddr_ll)))

/// (unimportant) macro for loud failure
#define RETURN_ERROR(lvl, msg) \
  do {                    \
    fprintf(stderr, msg); \
    return lvl;            \
  } while(0);

/// Initialize a packet socket ring buffer
//  @param ringtype is one of PACKET_RX_RING or PACKET_TX_RING
static char *
init_packetsock_ring(int fd, int ringtype)
{
  struct tpacket_req tp;
  char *ring;

  // tell kernel to export data through mmap()ped ring
  tp.tp_block_size = CONF_RING_FRAMES * getpagesize();
  tp.tp_block_nr = 1;
  tp.tp_frame_size = getpagesize();
  tp.tp_frame_nr = CONF_RING_FRAMES;
  if (setsockopt(fd, SOL_PACKET, ringtype, (void*) &tp, sizeof(tp)))
    RETURN_ERROR(NULL, "setsockopt() ring\n");

  // open ring
  ring = mmap(0, tp.tp_block_size * tp.tp_block_nr,
               PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  if (!ring)
    RETURN_ERROR(NULL, "mmap()\n");

  return ring;
}

/// Create a packet socket. If param ring is not NULL, the buffer is mapped
//  @param ring will, if set, point to the mapped ring on return
//  @return the socket fd
static int
init_packetsock(char **ring, int ringtype)
{
  int fd;

  // open packet socket
  fd = socket(PF_PACKET, SOCK_DGRAM, htons(ETH_P_IP));
  if (fd < 0)
    RETURN_ERROR(-1, "Root priliveges are required\nsocket() rx. \n");

  if (ring) {
    *ring = init_packetsock_ring(fd, ringtype);

    if (!*ring) {
      close(fd);
      return -1;
    }
  }

  return fd;
}

static int
exit_packetsock(int fd, char *ring)
{
  if (munmap(ring, CONF_RING_FRAMES * getpagesize())) {
    perror("munmap");
    return 1;
  }

  if (close(fd)) {
    perror("close");
    return 1;
  }

  return 0;
}

/// Example application that opens a packet socket with rx_ring
int
init_main(int argc, char **argv)
{
  char *ring;
  int fd;

  fd = init_packetsock(&ring, PACKET_RX_RING);
  if (fd < 0)
    return 1;

  // TODO: add processing. See next snippet.

  if (exit_packetsock(fd, ring))
    return 1;

  return 0;
}

This snippet shows how to process packets at runtime using the packet ring. The first function reads a single packet from the ring, the second updates the header in the ring to release the frame back to the kernel.
 
static int rxring_offset;

/// Blocking read, returns a single packet (from packet ring)
static void *
process_rx(const int fd, char *rx_ring)
{
  struct tpacket_hdr *header;
  struct pollfd pollset;
  int ret;

  // fetch a frame
  header = (void *) rx_ring + (rxring_offset * getpagesize());
  assert((((unsigned long) header) & (getpagesize() - 1)) == 0);

  // TP_STATUS_USER means that the process owns the packet.
  // When a slot does not have this flag set, the frame is not
  // ready for consumption.
  while (!(header->tp_status & TP_STATUS_USER)) {

    // if none available: wait on more data
    pollset.fd = fd;
    pollset.events = POLLIN;
    pollset.revents = 0;
    ret = poll(&pollset, 1, -1 /* negative means infinite */);
    if (ret < 0) {
      if (errno != EINTR)
        RETURN_ERROR(NULL, "poll()\n");
      return NULL;
    }
  }

  // check data
  if (header->tp_status & TP_STATUS_COPY)
    RETURN_ERROR(NULL, "skipped: incomplete packed\n");
  if (header->tp_status & TP_STATUS_LOSING)
    fprintf(stderr, "dropped packets detected\n");

  // return encapsulated packet
  return ((void *) header) + PKT_OFFSET;
}

// Release the slot back to the kernel
static void
process_rx_release(char *rx_ring)
{
  struct tpacket_hdr *header;

  // clear status to grant to kernel
  header = (void *) rx_ring + (rxring_offset * getpagesize());
  header->tp_status = 0;

  // update consumer pointer
  rxring_offset = (rxring_offset + 1) & (CONF_RING_FRAMES - 1);
}
This code was copied from a project that required two separate functions. In most cases, you want to read, process and release a frame in a single loop. I'm not particularly proud of using a global variable for the current ring offset. Download the complete sourcecode as packet-rx-ring.c

Comments

  1. Hi Willem,

    Thanks for the great example! It got me jump-started on fast raw socket programing.

    I ran into the following wierd result: I have two NICs that are looped back, eth2 and eth3.
    when I used pkt-gen to send pkts on eth2, used packet-rx-ring to receive packets, it seems that every packets sent by pkt-gen is received twice by packet-rx-ring. one on eth2 and the other on eth3.

    Thanks
    Jim

    ReplyDelete
  2. That's right, Jim. Packet sockets intercept both incoming and outgoing packets.

    ReplyDelete
    Replies
    1. How to avoid that, ie, outgoing packet caught in rx_ring buffer ?

      I am using 1 tx_ring & rx_ring. Send a packet- {0, 0, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0} - thru TX. Then "PACKET_OUTGOING" is caught in rx_ring.

      Is there a socket option to avoid this ?

      Delete
    2. Hi guys, I have setted the TPACKET_V2 but when I receive a packet that have vlan id setted the vlan is ever o !! could you help me?

      Delete
  3. Hi Willem,

    I guess you're not reading these replies anymore, but the link to the complete sourcecode now defaults to some travel page. Any chance of an updated link?

    ReplyDelete
    Replies
    1. Thanks for the heads-up. Apparently I was not being notified, indeed. I believe I solved it.

      Delete
  4. Found the source code here:
    https://github.com/vieites4/rawsockets/blob/master/docs/snippets/packet-rx-ring.c

    ReplyDelete

Post a Comment