Linux Networking Deep Dive - Part 5: Advanced Features

Welcome to Part 5 of our comprehensive Linux networking series! In this installment, we dive deep into the most advanced networking features available in modern Linux systems. We’ll explore eBPF (extended Berkeley Packet Filter) for programmable packet processing, XDP (eXpress Data Path) for ultra-high performance networking, and sophisticated optimization techniques that can dramatically improve network performance.

These technologies represent the cutting edge of Linux networking, enabling everything from high-frequency trading systems to cloud-native network functions. Understanding these advanced features is essential for building next-generation network applications and infrastructure.

What You’ll Learn

By the end of this post, you’ll understand:

  • eBPF Architecture: How the eBPF virtual machine enables safe, efficient kernel programming
  • XDP Implementation: Ultra-fast packet processing before the network stack
  • Performance Optimization: Advanced techniques for maximum network throughput
  • Real-world Applications: How these technologies power modern network infrastructure

eBPF: Revolutionizing Kernel Programming

The eBPF Virtual Machine

Extended Berkeley Packet Filter (eBPF) has transformed Linux networking by providing a safe, efficient way to run programs in kernel space. Unlike traditional kernel modules, eBPF programs are verified for safety and can be loaded dynamically without rebooting the system.

The core eBPF program structure contains essential metadata and execution context:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct bpf_prog {
u16 pages; /* Number of allocated pages */
u16 jited:1, /* Program is JIT compiled */
jit_requested:1, /* Program was requested to be JITed */
gpl_compatible:1, /* Is filter GPL compatible? */
cb_access:1, /* Program accesses skb control block */
dst_needed:1, /* Program needs dst entry */
blinded:1, /* Was blinded */
is_func:1, /* Program is a bpf function */
kprobe_override:1; /* Do we override a kprobe? */
enum bpf_prog_type type; /* Type of BPF program */
u32 len; /* Number of filter blocks */
u32 jited_len; /* Size of jited insns in bytes */
unsigned int (*bpf_func)(const void *ctx, const struct bpf_insn *filter);
struct bpf_insn insnsi[0]; /* Instructions for interpreter */
};

This structure encapsulates everything needed to execute an eBPF program safely and efficiently, including JIT compilation status, program type, and the actual instruction sequence.

Program Verification and Safety

One of eBPF’s key innovations is its comprehensive verification system that ensures programs are safe before execution. The verifier performs static analysis to prevent infinite loops, out-of-bounds memory access, and other potentially dangerous operations.

The verification process includes several critical safety checks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
bpf_user_fixup_func_t fixup_func_proto)
{
struct bpf_verifier_env *env;
int ret = -EINVAL;

/* Create verification environment */
env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);
env->prog = *prog;
env->ops = bpf_verifier_ops[env->prog->type];

/* Verify program safety */
ret = do_check(env);
if (ret < 0)
goto skip_full_check;

/* Apply network-specific fixups for XDP/TC programs */
if (env->prog->type == BPF_PROG_TYPE_XDP ||
env->prog->type == BPF_PROG_TYPE_SCHED_CLS) {
ret = fixup_bpf_calls(env);
}

/* Check stack depth and optimize */
if (!ret && env->prog->aux->tested_insn_cnt == env->prog->len)
ret = opt_subreg_zext_lo32_rnd_hi32(env, attr);

return ret;
}

JIT Compilation for Performance

eBPF programs can be Just-In-Time compiled to native machine code for maximum performance. The JIT compiler translates eBPF instructions to optimized x86-64 assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *prog_buf,
int oldproglen, struct jit_context *ctx)
{
struct bpf_insn *insn = bpf_prog->insnsi;

for (i = 0; i < insn_cnt; i++, insn++) {
u8 opcode = BPF_OP(insn->code);

switch (insn->code) {
case BPF_ALU | BPF_ADD | BPF_X:
case BPF_ALU64 | BPF_ADD | BPF_X:
/* Emit x86 ADD instruction */
maybe_emit_mod(&prog, dst_reg, src_reg,
BPF_CLASS(insn->code) == BPF_ALU64);
EMIT2(0x01, add_2reg(0xC0, dst_reg, src_reg));
break;

/* ... more instruction encodings ... */
}
}

return proglen;
}

This JIT compilation enables eBPF programs to achieve near-native performance while maintaining safety guarantees.

XDP: Ultra-High Performance Packet Processing

XDP Architecture and Hook Points

XDP (eXpress Data Path) represents the fastest packet processing path in Linux, operating at three different levels depending on hardware capabilities and performance requirements:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Packet Processing Pipeline:
┌─────────────────┐
│ Hardware │
└─────────┬───────┘

┌─────────▼───────┐
│ XDP_HW │ ← Hardware offload (SmartNICs)
└─────────┬───────┘

┌─────────▼───────┐
│ NIC Driver │
└─────────┬───────┘

┌─────────▼───────┐
│ XDP_DRV │ ← Driver mode (fastest software)
└─────────┬───────┘

┌─────────▼───────┐
│ sk_buff_alloc │
└─────────┬───────┘

┌─────────▼───────┐
│ XDP_SKB │ ← Generic mode (fallback)
└─────────────────┘

XDP Program Structure

XDP programs operate on a minimal packet representation that provides direct access to packet data without the overhead of sk_buff allocation:

1
2
3
4
5
6
7
8
9
struct xdp_buff {
void *data; /* Packet data start */
void *data_end; /* Packet data end */
void *data_meta; /* Metadata area */
void *data_hard_start; /* Beginning of frame */
unsigned long handle; /* Driver-specific handle */
struct xdp_rxq_info *rxq; /* RX queue info */
u32 frame_sz; /* Total frame size */
};

XDP Return Actions

XDP programs must return one of five actions that determine packet fate:

1
2
3
4
5
6
7
enum xdp_action {
XDP_ABORTED = 0, /* Drop packet and trace exception */
XDP_DROP, /* Drop packet silently */
XDP_PASS, /* Allow packet to proceed normally */
XDP_TX, /* Transmit packet from same interface */
XDP_REDIRECT, /* Redirect to another interface/CPU */
};

Practical XDP Example: DDoS Mitigation

Here’s a real-world XDP program for basic DDoS mitigation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/in.h>
#include <bpf/bpf_helpers.h>

/* Rate limiting map */
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 10000);
__type(key, __u32);
__type(value, __u64);
} rate_map SEC(".maps");

SEC("xdp")
int ddos_mitigation(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
struct iphdr *ip;
__u32 src_ip;
__u64 *count, new_count = 1;
__u64 current_time = bpf_ktime_get_ns();

/* Bounds checking for Ethernet header */
if ((void *)(eth + 1) > data_end)
return XDP_ABORTED;

/* Only process IPv4 packets */
if (eth->h_proto != __constant_htons(ETH_P_IP))
return XDP_PASS;

ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_ABORTED;

src_ip = ip->saddr;
count = bpf_map_lookup_elem(&rate_map, &src_ip);

if (count) {
/* Check if rate limit exceeded (1000 packets/second) */
if (*count > 1000) {
return XDP_DROP;
}
__sync_fetch_and_add(count, 1);
} else {
bpf_map_update_elem(&rate_map, &src_ip, &new_count, BPF_ANY);
}

return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

Traffic Control with eBPF

TC eBPF Integration

Traffic Control (TC) eBPF programs provide programmable packet classification and action execution at both ingress and egress points:

1
2
3
4
5
6
7
8
9
10
struct cls_bpf_prog {
struct bpf_prog *filter;
struct list_head link;
struct tcf_result res;
bool exts_integrated;
bool offloaded;
u32 gen_flags;
unsigned int handle;
struct tcf_proto *tp;
};

Advanced Traffic Shaping Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Traffic shaping with eBPF */
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>

struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, __u32);
__type(value, __u64);
} bandwidth_map SEC(".maps");

SEC("classifier")
int traffic_shaper(struct __sk_buff *skb)
{
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct ethhdr *eth = data;
struct iphdr *ip;
__u32 src_ip;
__u64 *bytes_count, packet_size;

/* Parse headers with bounds checking */
if ((void *)(eth + 1) > data_end)
return TC_ACT_OK;

if (eth->h_proto != __constant_htons(ETH_P_IP))
return TC_ACT_OK;

ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return TC_ACT_OK;

src_ip = ip->saddr;
packet_size = data_end - data;
bytes_count = bpf_map_lookup_elem(&bandwidth_map, &src_ip);

if (bytes_count) {
/* Check bandwidth limit (1MB/sec = 1048576 bytes) */
if (*bytes_count > 1048576) {
return TC_ACT_SHOT; /* Drop packet */
}
__sync_fetch_and_add(bytes_count, packet_size);
} else {
bpf_map_update_elem(&bandwidth_map, &src_ip, &packet_size, BPF_ANY);
}

return TC_ACT_OK;
}

char _license[] SEC("license") = "GPL";

eBPF Maps: High-Performance Data Structures

Map Types and Use Cases

eBPF maps provide various data structures optimized for different use cases:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Hash map for O(1) lookups */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10000);
__type(key, __u32);
__type(value, struct connection_info);
} connection_map SEC(".maps");

/* Per-CPU maps for lock-free statistics */
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, 1000);
__type(key, __u32);
__type(value, struct stats);
} stats_map SEC(".maps");

/* LRU map for automatic memory management */
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 5000);
__type(key, struct flow_key);
__type(value, __u64);
} flow_cache SEC(".maps");

Lock-Free Performance with Per-CPU Maps

Per-CPU maps eliminate lock contention in high-throughput scenarios:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct stats {
__u64 packets;
__u64 bytes;
__u64 dropped;
};

SEC("xdp")
int packet_counter(struct xdp_md *ctx)
{
__u32 key = 0;
struct stats *value;
__u32 packet_size = ctx->data_end - ctx->data;

/* Each CPU has its own copy of the map entry */
value = bpf_map_lookup_elem(&stats_map, &key);
if (value) {
/* No locks needed - each CPU operates on its own data */
value->packets++;
value->bytes += packet_size;
}

return XDP_PASS;
}

Performance Optimization Strategies

Hardware-Level Optimizations

Multi-Queue Configuration

Modern network interfaces support multiple hardware queues to distribute processing across CPU cores:

1
2
3
4
5
6
7
8
# Check current queue configuration
ethtool -l eth0

# Set number of queues to match CPU cores
ethtool -L eth0 combined 8

# Verify RSS (Receive Side Scaling) configuration
ethtool -x eth0

Hardware Offloads

Enable hardware acceleration features to reduce CPU overhead:

1
2
3
4
5
6
7
8
9
10
# Enable key hardware offloads
ethtool -K eth0 rx-checksumming on
ethtool -K eth0 tx-checksumming on
ethtool -K eth0 scatter-gather on
ethtool -K eth0 tcp-segmentation-offload on
ethtool -K eth0 generic-segmentation-offload on
ethtool -K eth0 large-receive-offload on

# Check offload status
ethtool -k eth0 | grep ": on"

Interrupt and CPU Affinity

IRQ Balancing

Optimize interrupt handling for network-intensive workloads:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Disable automatic IRQ balancing
systemctl stop irqbalance
systemctl disable irqbalance

# Find network device IRQs
grep eth0 /proc/interrupts | awk '{print $1}' | tr -d ':'

# Set IRQ affinity to specific CPUs
echo 2 > /proc/irq/24/smp_affinity_list
echo 4,6 > /proc/irq/25/smp_affinity_list

# Use CPU bitmask for complex configurations
echo f0 > /proc/irq/26/smp_affinity # CPUs 4-7

NUMA Optimization

Consider NUMA topology for optimal memory access:

1
2
3
4
5
6
7
8
# Check NUMA configuration
numactl --hardware

# Find network device NUMA node
cat /sys/class/net/eth0/device/numa_node

# Bind application to same NUMA node
numactl --cpunodebind=0 --membind=0 your_network_application

Kernel Parameter Tuning

Socket Buffer Configuration

1
2
3
4
5
6
7
8
9
10
11
12
# Increase socket buffer sizes
echo 'net.core.rmem_max = 536870912' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 536870912' >> /etc/sysctl.conf
echo 'net.core.rmem_default = 65536' >> /etc/sysctl.conf
echo 'net.core.wmem_default = 65536' >> /etc/sysctl.conf

# TCP buffer tuning
echo 'net.ipv4.tcp_rmem = 4096 65536 536870912' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 536870912' >> /etc/sysctl.conf

# Apply changes
sysctl -p

Network Core Parameters

1
2
3
4
5
6
7
8
9
10
11
# Increase network device queue length
echo 'net.core.netdev_max_backlog = 5000' >> /etc/sysctl.conf

# Increase maximum connections
echo 'net.core.somaxconn = 65535' >> /etc/sysctl.conf

# Enable TCP window scaling
echo 'net.ipv4.tcp_window_scaling = 1' >> /etc/sysctl.conf

# Optimize TCP congestion control
echo 'net.ipv4.tcp_congestion_control = bbr' >> /etc/sysctl.conf

Application-Level Optimizations

Zero-Copy Networking

Leverage technologies like AF_XDP for zero-copy packet processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* AF_XDP socket setup for zero-copy */
#include <linux/if_xdp.h>
#include <bpf/xsk.h>

struct xsk_socket_info {
struct xsk_ring_cons rx;
struct xsk_ring_prod tx;
struct xsk_umem_info *umem;
struct xsk_socket *xsk;
};

int setup_xdp_socket(const char *interface, int queue_id)
{
struct xsk_socket_config xsk_cfg = {};
struct xsk_umem_config umem_cfg = {};

/* Configure zero-copy mode */
xsk_cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
xsk_cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
xsk_cfg.xdp_flags = XDP_FLAGS_DRV_MODE; /* Driver mode */
xsk_cfg.bind_flags = XDP_ZEROCOPY; /* Zero-copy */

/* Setup UMEM for packet buffers */
umem_cfg.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
umem_cfg.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
umem_cfg.frame_size = 2048;
umem_cfg.frame_headroom = 0;

return xsk_socket__create(&xsk_info.xsk, interface, queue_id,
umem, &xsk_info.rx, &xsk_info.tx, &xsk_cfg);
}

Monitoring and Debugging

eBPF Program Introspection

1
2
3
4
5
6
7
8
9
10
11
# List all loaded eBPF programs
bpftool prog list

# Show detailed program information
bpftool prog show id 123 --pretty

# Dump JIT compiled code
bpftool prog dump jited id 123

# Monitor program performance
bpftool prog profile id 123 duration 10 cycles instructions

XDP Statistics and Monitoring

1
2
3
4
5
6
7
8
# Monitor XDP statistics
ip link show dev eth0 | grep -A 10 xdp

# Check XDP program performance
cat /proc/net/xdp_stats

# Monitor packet drops and errors
ethtool -S eth0 | grep -E "(drop|error|miss)"

Performance Profiling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Profile network application with perf
perf record -g -e cpu-cycles,instructions,cache-misses \
-- timeout 60 network_application

perf report --call-graph=graph,0.5,caller

# Monitor network performance with eBPF
bpftrace -e '
tracepoint:skb:kfree_skb {
@drops[str(args->protocol)] = count();
}
interval:s:5 {
print(@drops);
clear(@drops);
}'

Real-World Applications

Load Balancing with XDP

XDP enables extremely fast load balancing directly in the kernel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 4);
__type(key, __u32);
__type(value, __u32);
} backend_servers SEC(".maps");

SEC("xdp")
int xdp_load_balancer(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
struct iphdr *ip;
__u32 hash, server_idx = 0;
__u32 *server_ip;

/* Parse packet headers */
if ((void *)(eth + 1) > data_end)
return XDP_ABORTED;

if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;

ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_ABORTED;

/* Simple hash-based load balancing */
hash = ip->saddr ^ ip->daddr;
server_idx = hash % 4;

server_ip = bpf_map_lookup_elem(&backend_servers, &server_idx);
if (server_ip) {
/* Rewrite destination IP for load balancing */
ip->daddr = *server_ip;

/* Recalculate checksum */
ip->check = 0;
ip->check = ip_checksum(ip);
}

return XDP_TX; /* Send back out same interface */
}

Container Network Security

eBPF enables fine-grained container network security policies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1000);
__type(key, struct flow_key);
__type(value, __u8);
} allowed_flows SEC(".maps");

struct flow_key {
__u32 src_ip;
__u32 dst_ip;
__u16 src_port;
__u16 dst_port;
__u8 protocol;
};

SEC("classifier")
int container_firewall(struct __sk_buff *skb)
{
struct flow_key key = {};
__u8 *allowed;

/* Extract flow information */
if (parse_packet(skb, &key) < 0)
return TC_ACT_OK;

/* Check if flow is allowed */
allowed = bpf_map_lookup_elem(&allowed_flows, &key);
if (!allowed)
return TC_ACT_SHOT; /* Drop unauthorized traffic */

return TC_ACT_OK;
}

Looking Forward

These advanced Linux networking features represent the foundation for next-generation network infrastructure. From cloud-native networking to edge computing, eBPF and XDP are enabling new possibilities in network programming and performance optimization.

Key trends to watch:

  • Hardware acceleration: Integration with SmartNICs and DPUs
  • Cloud networking: Service mesh acceleration and container networking
  • Security: Real-time threat detection and mitigation
  • Observability: Deep network visibility and tracing

In our next post, we’ll explore the practical tools and scripts that leverage these advanced features for real-world network debugging and performance testing.

Series Navigation


This post is part of the comprehensive Linux Networking Deep Dive series. Each part builds upon previous concepts while exploring advanced topics in depth.