Files
wl-webrtc/docs/superpowers/specs/2026-04-03-wl-webrtc-architecture-design.md
dailz 6d49222de8 feat: Phase 1 MVP with audit fixes — Wayland screen capture + VAAPI encoding
Phase 1 MVP implementation of wl-webrtc: Wayland screen capture tool
with hardware-accelerated VAAPI H.264 encoding and WebTransport output.

Includes all 9 runtime bug fixes from code audit (fix-audit-issues plan):

CRITICAL:
- C2: h264_metadata BSF with repeat_sps/repeat_pps in encode pipeline
- C4: FpsLimit wired as timing gate in on_copy_complete

HIGH:
- C3+A2: DRM device discovery via dmabuf feedback MainDevice event,
  unified resolve_drm_path() helper (CLI > compositor > auto > fallback)
- H2: Separate physical_size (mm) from mode_size (pixels) in wl_output
- H1+A3: Multi-output warning + named-output-not-found error

MEDIUM:
- M5: tv_sec u32->u64 to avoid Y2106 timestamp truncation
- M4: Guard against SHM Buffer event (DMA-BUF only)

Key components:
- src/avhw.rs: FFmpeg VAAPI encoder + filter graph + BSF pipeline
- src/state.rs: Wayland event loop + output negotiation + screencopy
- src/cap_wlr_screencopy.rs: wlr-screencopy capture source
- src/fps_limit.rs: Frame rate limiting with configurable target
- src/transform.rs: Frame format conversion utilities
2026-04-05 23:35:00 +08:00

34 KiB

wl-webrtc Architecture Design

Date: 2026-04-03 Status: Draft Author: Sisyphus (AI-assisted design)


1. Overview

1.1 Problem Statement

Build a low-latency Wayland screen sharing server that captures the desktop via GPU, encodes with hardware acceleration (VAAPI/Vulkan), and streams to a browser for remote viewing and eventual remote control.

1.2 Goals

  • Glass-to-glass latency < 50ms on LAN
  • GPU-accelerated pipeline: capture + encode entirely on GPU, only encoded bitstream crosses to CPU
  • Browser-only client: no native app installation required
  • Single binary deployment: embedded web UI, no external dependencies
  • Linux Wayland only: no cross-platform abstraction needed
  • Annex B mode: encoder must emit in-band SPS/PPS with every keyframe via the h264_metadata bitstream filter (repeat_sps=1 repeat_pps=1) — NOT repeat_headers=1 (that option is libx264-only and does NOT exist for h264_vaapi)
  • Annex B streaming: encoder outputs Annex B (start-code-prefixed) NAL units with SPS/PPS injected per-IDR via h264_metadata BSF (repeat_sps=1 repeat_pps=1), browser decodes in Annex B mode via WebCodecs. Note: repeat_headers=1 is a libx264-only option, NOT available for h264_vaapi.

1.3 Non-Goals (Phase 1)

  • Multi-client support (Phase 2)
  • Audio streaming (Phase 3)
  • Remote input injection (Phase 2)
  • Firefox support (Phase 3 — WebRTC fallback)
  • Adaptive bitrate (Phase 3)

1.4 Technology Stack

Component Technology Rationale
Screen capture wayland-client + DMA-BUF Zero-copy GPU capture via DMA-BUF
GPU encoding FFmpeg (ffmpeg-next) VAAPI/Vulkan H.264/HEVC hardware encoding
Transport wtransport (WebTransport over HTTP/3) Full HTTP/3 + WebTransport protocol, built on quinn + rustls
Browser decode WebCodecs VideoDecoder Direct decode control, no MSE buffering
Web UI axum + rust-embed Single binary, compile-time embedded static files
Event loop mio Proven with Wayland file descriptor callbacks
Async runtime tokio Required by wtransport, also powers axum
Sync/async bridge async_channel Both sync send() and async recv(), bridges mio → tokio naturally

1.5 Transport Decision: Why Not WebRTC

WebRTC was evaluated and rejected as the primary transport for this use case:

Factor WebRTC (webrtc-rs) WebTransport + WebCodecs
Glass-to-glass latency 30-110ms (unavoidable 20-60ms jitter buffer) 12-38ms (no jitter buffer)
Rust ecosystem webrtc-rs v0.20.0-alpha, mid-rewrite wtransport production-grade, built on quinn
Protocol overhead ICE/DTLS/SRTP/SDP — designed for P2P NAT traversal QUIC TLS 1.3 — server-to-client, simpler
Decode control Browser controls jitter buffer, cannot opt out Application controls every frame decode
GPU data path Sample { data: Bytes }, must copy to CPU Same copy, but shorter pipeline
Browser support All browsers Chrome/Edge only (Firefox lacks WebCodecs)

Transport library choice: We use the wtransport crate (v0.7) instead of raw quinn + h3. The browser's WebTransport API requires a full HTTP/3 server with the WebTransport extension (RFC 9297). Raw QUIC is NOT sufficient — there is no browser API for raw QUIC connections. The wtransport crate provides the complete protocol stack (HTTP/3 + WebTransport) built on top of quinn 0.11 and rustls 0.23, with support for datagrams, unidirectional streams, and bidirectional streams.

WebRTC will be added as a Phase 3 fallback for Firefox compatibility.


2. Architecture

2.1 Thread Model

┌─────────────────────────────────────────────────────────────┐
│                      wl-webrtc process                       │
│                                                              │
│  Main Thread (mio event loop)                                │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Wayland event queue dispatch                        │   │
│  │  Screen capture (DMA-BUF, zero-copy from compositor) │   │
│  │  GPU encode (FFmpeg VAAPI/Vulkan, sync calls)        │   │
│  │  State machine transitions                           │   │
│  │  FPS limiting                                        │   │
│  └──────────────────────┬───────────────────────────────┘   │
│                         │                                    │
│              async_channel::bounded<16>(EncodedFrame)              │
│                         │                                    │
│  Tokio Runtime Thread Pool (2+ threads)                      │
│  ┌──────────────────────▼───────────────────────────────┐   │
│  │  wtransport WebTransport server                      │   │
│  │  HTTP/3 + WebTransport session management            │   │
│  │  Frame distribution to connected clients             │   │
│  │  axum HTTP server (Web UI + control API)             │   │
│  │  rust-embed static file serving                      │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Design rationale:

  • Capture + encode on main thread: GPU encoding is synchronous (3-8ms per frame at 30-60fps), doesn't block the mio event loop at these frame rates. This avoids cross-thread synchronization for the GPU pipeline.
  • wtransport on tokio: wtransport is built on quinn and tokio. axum requires tokio. Both coexist naturally. Both the WebTransport server and the HTTP static file server share the same tokio runtime.
  • async_channel::bounded(16): Channel capacity of 16 frames provides ~260ms of buffer at 60fps — enough to absorb transport jitter without excessive latency. The sender uses try_send(): if the channel is full, the frame is dropped and logged. This is standard practice in real-time streaming — newer frames are always more valuable than older ones. try_send() returns Err(TrySendError::Full(_)) on a full channel, which the main loop handles by discarding the frame. This avoids blocking the main mio event loop, which must remain responsive for Wayland event dispatch. Do NOT use send_blocking() on the mio thread — it would stall the capture pipeline if the transport consumer falls behind.

2.2 Module Dependency Graph

                        ┌──────────┐
                        │ main.rs  │  entry point, CLI, orchestration
                        └──┬──┬─┬──┘
                           │  │ │
              ┌────────────┘  │ └────────────┐
              ▼               ▼              ▼
        ┌──────────┐   ┌──────────┐   ┌────────────┐
        │ state.rs │   │ avhw.rs  │   │ transport.rs│
        │ StateMachine   │ HW ctx   │   │ QUIC server │
        │ CaptureSource  │          │   │ Sessions    │
        └──┬───┬────┘   └────┬─────┘   └────────────┘
           │   │              │
     ┌─────┘   └──────┐       │
     ▼                ▼       ▼
┌─────────┐   ┌────────────┐ ┌────────┐
│cap_wlr_ │   │cap_ext_    │ │filter.rs│
│screen   │   │image_copy  │ │ crop/   │
│copy     │   │            │ │ scale/  │
└─────────┘   └────────────┘ │transpose│
                             └────────┘
  ┌────────────┐  ┌──────────────┐
  │transform.rs│  │signaling.rs   │
  │ coordinate │  │ axum + embed  │
  │ transform  │  │ Web UI serve  │
  └────────────┘  └──────────────┘
  ┌────────────┐
  │fps_limit.rs│
  └────────────┘

Dependency layers (bottom-up):

  1. transform.rs, fps_limit.rs — leaf modules, zero internal dependencies
  2. avhw.rs, filter.rs — FFmpeg wrapper layer
  3. cap_wlr_screencopy.rs, cap_ext_image_copy.rs — capture backends, depend on state + avhw
  4. state.rs — state machine + CaptureSource trait
  5. transport.rs, signaling.rs — network layer
  6. main.rs — orchestration

2.3 Project File Structure

wl-webrtc/
├── Cargo.toml
├── README.md
├── src/
│   ├── main.rs                # ~300 lines — CLI, startup, orchestration
│   ├── state.rs               # ~600 lines — State<S>, EncConstructionStage, InFlightSurface
│   ├── avhw.rs                # ~450 lines — FFmpeg HW device/frame contexts
│   ├── filter.rs              # ~200 lines — FFmpeg video filter graph
│   ├── cap_wlr_screencopy.rs  # ~170 lines — wlr-screencopy backend
│   ├── cap_ext_image_copy.rs  # ~240 lines — ext-image-copy-capture backend
│   ├── transform.rs           # ~220 lines — coordinate transforms
│   ├── fps_limit.rs           # ~130 lines — VRR-aware frame rate limiter
│   ├── transport.rs           # ~400 lines — QUIC/WebTransport server
│   ├── signaling.rs           # ~200 lines — axum HTTP + WebSocket control
│   └── nalu.rs                # ~150 lines — Annex B NAL unit splitting, framing protocol
├── static/
│   ├── index.html             # Web UI shell
│   ├── player.js              # WebCodecs decoder + Canvas renderer
│   └── style.css              # Minimal styling
└── protocols/                 # Wayland protocol XML files

3. Data Flow

3.1 Zero-Copy Capture Pipeline

GPU Frame Pool ─alloc()→ HW Surface
                          ↓
               av_hwframe_map → DMA-BUF fd
                          ↓
               zwp_linux_dmabuf → WlBuffer (fd shared)
                          ↓
               Compositor writes directly to GPU Surface
                          ↓
               FFmpeg VAAPI/Vulkan encode (GPU-internal)
                          ↓
               AVPacket.data (Annex B with 00 00 00 01 start codes)
                          ↓               ← GPU→CPU copy via vaMapBuffer (unavoidable)
               Bytes::from(Vec<u8>) wrapper
                          ↓
                async_channel::bounded::send(EncodedFrame)  // sync, non-blocking on main thread

3.2 Transport Pipeline

async_channel::bounded::recv(EncodedFrame)
                          ↓
                Frame byte-splitting at MTU boundaries (not NAL-aligned)
                           ↓
                ┌─ Keyframe → QUIC reliable stream (guaranteed delivery)
                └─ Delta frame → QUIC datagram (unreliable, low latency)
                          ↓
               Quinn WebTransport send
                          ↓
               Browser WebTransport.receive()
                          ↓
               Frame reassembly (if fragmented)
                          ↓
               WebCodecs VideoDecoder.decode(EncodedVideoChunk)
                          ↓
               Canvas.drawImage(VideoFrame)

3.3 Latency Budget

Stage Latency Notes
Wayland capture (KMS/dmabuf) 1-3ms Zero-copy from compositor
GPU encode (VAAPI H.264) 3-8ms Synchronous, main thread
vaMapBuffer CPU copy <1ms Unavoidable GPU→CPU
async_channel <0.1ms In-process
QUIC datagram (LAN) 1-10ms LAN transit, merged with network
WebCodecs decode 2-5ms Browser hardware decode
Canvas render 1-2ms requestAnimationFrame
Total (LAN) 9-29ms Well under 50ms target (corrected: removed double-counted network transit)

3.4 EncodedFrame Structure

#[derive(Clone)]
struct EncodedFrame {
    data: Bytes,           // Annex B NALUs with start codes
    pts_us: i64,           // Presentation timestamp (microseconds, for WebCodecs)
    duration: Duration,    // Frame duration for timestamp calculation
    frame_type: FrameType, // Keyframe or Delta (matches transport framing)
    width: u32,            // Frame width (may differ from capture on ROI)
    height: u32,           // Frame height
}

Timestamp convention: pts_us is in microseconds (not nanoseconds), matching WebCodecs' EncodedVideoChunk.timestamp requirement. The server tracks a monotonic PTS starting from 0, incrementing by 1_000_000 / fps per frame.


4. State Machine

4.1 EncConstructionStage

                    ┌──────────────────┐
  App start         │ ProbingOutputs   │  Discover Wayland outputs,
    │               └────────┬─────────┘  collect geometry info
    ▼                        │ All outputs probed
┌───────────────┐            ▼
│ ProbingOutputs├──→ ┌──────────────────┐
└───────────────┘    │EverythingButFmt  │  HW device ctx created,
                       └────────┬─────────┘  encoder initialized
                                │ negotiate_format()
                                ▼
                         ┌───────────┐
                  ┌─────→│ Streaming │──── Active capture + encode + transport
                  │      └─────┬─────┘
                  │            │ Output disconnected
                  │  Format    │        ┌──────────────┐
                  │  changed   │        │OutputWentAway│  Keep enc + transport,
                  │            │        └──────┬───────┘  drop capture objects
                  └────────────┘               │ Same output reconnects
                       ←───────────────────────┘

    Intermediate transient exists at all transition arrows (mem::replace)

Key design choice: Streaming state holds both EncState (encoding pipeline) AND TransportState (active WebTransport sessions). On OutputWentAway, both are preserved — only capture objects are discarded.

4.2 InFlightSurface

None → AllocQueued → Allocd(Frame) → CopyQueued { surface, drm_map, frame, buffer } → None

4-state enum with assert!(matches!(...)) runtime guards. RAII cleanup on each state transition. Single-frame-in-flight constraint prevents buffer exhaustion.

4.3 TransportSessionState (new)

┌───────────┐     connect      ┌───────────┐     disconnect     ┌───────────┐
│ Listening │ ──────────────→  │ Active    │ ──────────────→   │ Closed    │
│ (quinn    │                  │ (sending  │                    │ (cleanup) │
│  endpoint)│                  │  frames)  │                    │           │
└───────────┘                  └───────────┘                    └───────────┘

Multiple sessions can be Active simultaneously (Phase 2). Phase 1 supports exactly one.


5. Design Patterns

The architecture employs several established software design patterns for managing complexity:

# Pattern Usage in wl-webrtc
1 Strategy Trait + Generic State CaptureSource trait with CapWlrScreencopy / CapExtImageCopy backends
2 Polymorphic Enum State Machine EncConstructionStage — 5 variants with type-safe transitions
3 Type-Safe Frame Lifecycle InFlightSurface — 4-state enum with runtime guards
4 Pin<Box> Self-Referential Vulkan device context — for self-referential FFmpeg structs
5 Independent Thread Pipe tokio runtime replaces mpsc audio thread; same atomic flag pattern
6 VRR-Aware Frame Rate Control FpsLimit<T> — one-frame-buffer delay for correct drop decisions
7 Generic Dispatch 3-Layer Wayland protocol dispatch — generic event handling
8 Three-Stage Safe Construction Incremental resource acquisition with partial state rollback
9 Hot-Plug Auto-Recovery OutputWentAway — preserve encoder/transport, rebuild capture
10 Zero-Copy GPU Pipeline DMA-BUF capture + GPU-internal encode, minimal CPU involvement

6. Transport Protocol Design

6.1 WebTransport Connection Setup

Server generates self-signed TLS certificate (via wtransport built-in rcgen support)
  → wtransport::Endpoint::server(server_config, addr)
  → Browser: new WebTransport("https://server:PORT/wt")
  → wtransport handles full HTTP/3 + WebTransport handshake internally
  → Session established (datagrams + streams available)

Transport library: We use wtransport crate (v0.7) which provides a complete WebTransport-over-HTTP/3 server implementation built on top of quinn 0.11 and rustls 0.23. This handles all protocol details (HTTP/3 SETTINGS, CONNECT method with :protocol = webtransport, session management, datagram framing per RFC 9297). Raw quinn or h3 would require building this protocol stack manually.

6.2 Frame Framing Protocol

QUIC datagrams have a practical MTU of ~1200 bytes. A 1080p H.264 frame is typically 10KB-200KB. Application-level framing:

Datagram format:
┌──────────┬──────────┬──────────┬──────────┬──────────┬─────────────┐
│ type (1) │ frame_id │ pts_us   │ seq_num  │ total    │ payload     │
│          │ (4 bytes)│ (8 bytes)│ (2 bytes)│ (2 bytes)│ (variable)  │
└──────────┴──────────┴──────────┴──────────┴──────────┴─────────────┘

type:
  0x01 = Keyframe fragment (sent via reliable stream, not datagram)
  0x02 = Delta frame fragment (sent via datagram)
  0x03 = Keyframe complete (small enough for single datagram)
  0x04 = Delta frame complete
  0x10 = Codec config (SPS/PPS for H.264, VPS/SPS/PPS for HEVC)

pts_us: Presentation timestamp in microseconds (i64, big-endian).
        Passed directly to WebCodecs EncodedVideoChunk.timestamp.
        For fragmented frames, every fragment carries the same pts_us.

Key design decisions:

  • Keyframes via reliable WebTransport stream: SPS/PPS + IDR data must not be lost. Use session.open_uni().await for reliable delivery.
  • Delta frames via datagram: Loss-tolerant. If a delta frame is lost, the decoder waits for the next keyframe. This avoids accumulated corruption.
  • Frame reassembly in browser: Buffer fragments by frame_id, reassemble when all total fragments arrive, decode complete frame.
  • Timestamp in microseconds: The fragment header carries pts_us: i64 (presentation timestamp in microseconds) so the browser can pass it directly to EncodedVideoChunk.timestamp. This is required by WebCodecs — a sequential frame_id counter is NOT a valid timestamp.

6.3 Codec Configuration Exchange

The encoder MUST be configured with the h264_metadata bitstream filter (repeat_sps=1 repeat_pps=1) to guarantee SPS/PPS are injected into every IDR frame. Note: repeat_headers=1 is a libx264-only option and does NOT exist for h264_vaapi. The browser configures the decoder in Annex B mode (no description at configure() time), and SPS/PPS arrive in-band with each keyframe.

On session establishment, the server sends a codec configuration message over the reliable QUIC stream to inform the browser of the codec and dimensions:

{
  "type": "codec_config",
  "codec": "avc1.42E01F",
  "width": 1920,
  "height": 1080,
  "framerate": 60
}

Browser uses this to configure VideoDecoder — without description, which activates Annex B mode:

decoder.configure({
  codec: config.codec,
  codedWidth: config.width,
  codedHeight: config.height,
  // NO description — Annex B mode. SPS/PPS arrive in-band with each keyframe.
});

Why no AVCC description? Per the WebCodecs AVC registration spec, providing description forces the decoder into AVC (length-prefixed) mode for ALL frames. Since our encoder outputs Annex B (start-code-prefixed), we must omit description and rely on in-band parameter sets guaranteed by the h264_metadata BSF (repeat_sps=1 repeat_pps=1). Note: repeat_headers=1 is a libx264-only option — it does NOT work with h264_vaapi.

Timestamp handling: The FragmentHeader carries both a frame_id (u32) for reassembly ordering and pts_us (i64) — the presentation timestamp in microseconds. The browser uses pts_us directly as EncodedVideoChunk.timestamp. This is required by WebCodecs — a sequential frame_id counter is NOT a valid timestamp. Every fragment of a frame carries the same pts_us value so the browser can extract it from any fragment during reassembly.


7. Browser-Side Design

7.1 Web UI (static/index.html + player.js)

Single-page application with minimal dependencies:

┌──────────────────────────────────────┐
│  wl-webrtc                           │
│  ┌──────────────────────────────┐    │
│  │                              │    │
│  │     <canvas> (video)         │    │
│  │     WebCodecs → drawImage    │    │
│  │                              │    │
│  └──────────────────────────────┘    │
│  Status: Connected | Latency: 23ms   │
│  Resolution: 1920x1080 @ 60fps       │
│  [Fullscreen] [Disconnect]           │
└──────────────────────────────────────┘

7.2 WebCodecs Decoder Pipeline

CRITICAL: Annex B mode only. Per the W3C AVC WebCodecs Registration, if description is provided at configure() time, ALL subsequent EncodedVideoChunk data must be in AVC format (4-byte length-prefixed). If description is absent, the bitstream is assumed to be in Annex B format (start-code-prefixed). Since our encoder outputs Annex B, we must NOT provide description.

The encoder MUST be configured with the h264_metadata bitstream filter (repeat_sps=1 repeat_pps=1) to guarantee SPS/PPS are injected into every IDR frame. Note: repeat_headers=1 is a libx264-only option and does NOT exist for h264_vaapi. This enables the decoder to initialize from keyframe data alone.

// Simplified player.js flow
const transport = new WebTransport("https://server:PORT/wt");
const decoder = new VideoDecoder({
  output: (frame) => {
    ctx.drawImage(frame, 0, 0);
    frame.close();
  },
  error: (e) => console.error(e),
});

// Configure WITHOUT description → Annex B mode.
// SPS/PPS are delivered in-band with each keyframe (via h264_metadata BSF repeat_sps=1 repeat_pps=1 on encoder).
decoder.configure({
  codec: "avc1.42E01F",
  codedWidth: 1920,
  codedHeight: 1080,
  // NO description field — Annex B mode
});

// Receive frames
const reader = transport.datagrams.readable.getReader();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  const frame = reassembleFrame(value);
  if (frame.complete) {
    decoder.decode(new EncodedVideoChunk({
      type: frame.isKeyframe ? "key" : "delta",
      timestamp: Number(frame.ptsUs),
      data: frame.data, // Annex B — valid because no description was provided
    }));
  }
}

7.3 No Annex B → AVCC Conversion Needed

Because we configure the decoder in Annex B mode (no description), no format conversion is needed on the browser side. The server sends raw Annex B NAL units with start codes (00 00 00 01), and the decoder accepts them directly.

The encoder MUST be configured with the h264_metadata bitstream filter (repeat_sps=1 repeat_pps=1) to guarantee SPS/PPS are included in every IDR frame. Note: repeat_headers=1 (and -flags2 +repeat_headers) are libx264-only options — they do NOT work with h264_vaapi. The BSF approach is encoder-agnostic and works with all FFmpeg hardware encoders. This ensures the decoder can re-initialize after any keyframe, even if it missed earlier configuration data.


8. Error Handling & Recovery

8.1 Display Hot-Plug

  1. wl_registry.global_remove → set output_went_away flag
  2. on_copy_fail() detects flag → transition to OutputWentAway
  3. Preserve: encoder context, transport sessions, WebRTC connections
  4. Discard: Wayland protocol objects (invalidated)
  5. Wait for same-name output ("DP-1") to reappear
  6. Create new CaptureSource, reuse old encoder, continue streaming

8.2 Network Disconnection

  • QUIC handles keepalive and retransmission internally
  • Client page refresh → new WebTransport session → server auto-starts sending current frame stream
  • Server is stateless per session — no recovery needed, just reconnect

8.3 Dynamic Format Change

Capture format changes (resolution, rotation):

  1. Rebuild: frames_rgb, video_filter, enc_video, frames_yuv
  2. Preserve: hw_device_ctx, transport_state
  3. Send new codec configuration to browser via reliable stream
  4. Browser reconfigures VideoDecoder with new SPS/PPS and dimensions

8.4 Frame Loss Handling

  • Lost delta frame → decoder continues, minor artifact until next keyframe
  • Lost keyframe → decoder cannot continue → request keyframe from server via reliable stream
  • Server receives keyframe request → sets next input frame to AV_PICTURE_TYPE_I

8.5 Graceful Shutdown

Shutdown is triggered by SIGINT/SIGTERM via signal-hook + mio integration:

  1. Main loop sets running = false flag → stops queuing new captures
  2. Wait for in-flight frame to complete (drain InFlightSurface)
  3. Flush encoder (avcodec_flush_buffers) → drain remaining packets
  4. Send final frames through channel
  5. Drop frame_tx sender → signals EOF to transport
  6. Transport server drains pending frames, sends GOAWAY to clients
  7. tokio::runtime::shutdown_background() terminates async tasks
  8. Drop Wayland protocol objects (compositor handles cleanup)
  9. FFmpeg contexts freed via Drop implementations

Key concern: Do NOT use blocking send_blocking() on the main thread — use try_send() so the main loop never stalls during shutdown. If the channel is full, the frame is dropped (acceptable during shutdown).

NOTE: wayland-client 0.31 uses Connection::connect_to_env() and GlobalList instead of the old 0.29 API (Display::connect_to_env() / GlobalManager::new()). See plan Task 11 for correct API usage.

8.6 First Keyframe Delivery

When a new WebTransport session is established, the client needs a keyframe before it can decode any delta frames. Two strategies:

  1. Force IDR on connect: Set AV_PICTURE_TYPE_I on the next encoded frame when a new session is detected
  2. Buffer last keyframe: Store the most recent keyframe in TransportServer, resend to new clients

Phase 1 uses strategy 1 (force IDR) for simplicity. The transport server sets a needs_keyframe: bool flag on new sessions, which the encode loop checks.


9. Dependencies

[dependencies]
# Wayland screen capture
wayland-client = "0.31"
wayland-protocols = { version = "0.32", features = ["client", "unstable", "staging"] }
wayland-protocols-wlr = { version = "0.3", features = ["client"] }
drm-fourcc = "2"

# GPU encoding
ffmpeg-next = "8"

# WebTransport (HTTP/3 + WebTransport protocol, built on quinn + rustls)
wtransport = { version = "0.7", features = ["self-signed"] }

# Web UI
axum = { version = "0.8", features = ["ws"] }
tower-http = { version = "0.6", features = ["cors"] }
rust-embed = { version = "8", features = ["mime-guess"] }

# Async runtime
tokio = { version = "1", features = ["full"] }

# Sync/async bridge (sync send() on mio thread, async recv() on tokio)
async-channel = "2"

# Event loop
mio = "1"

# Utilities
clap = { version = "4", features = ["derive"] }
tracing = "0.1"
tracing-subscriber = "0.3"
anyhow = "1"
bytes = "1"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
signal-hook = { version = "0.3", features = ["iterator"] }
base64 = "0.22"
mime_guess = "2"

Encoder configuration note: The VAAPI H.264 encoder MUST be configured with the h264_metadata bitstream filter (repeat_sps=1 repeat_pps=1) to guarantee SPS/PPS parameter sets are emitted in-band with every IDR frame. This is required for WebCodecs Annex B decode mode on the browser side. Important: repeat_headers=1 and -flags2 +repeat_headers are libx264-only options — they do NOT work with h264_vaapi. The BSF approach is encoder-agnostic and works with all FFmpeg hardware encoders.


10. Implementation Phases

Phase 1 — MVP: Screen → Browser Streaming

# Module Description Estimated Effort
1 main.rs CLI args, startup sequence Small
2 cap_*.rs Implement capture backends (wlr-screencopy + ext-image-copy) Medium
3 avhw.rs Implement FFmpeg HW device/frame context management Medium
4 filter.rs Implement GPU video filter graph Small
5 transform.rs Implement coordinate transforms for Wayland outputs Small
6 fps_limit.rs Implement VRR-aware frame rate limiter Small
7 state.rs State machine adapted for transport Medium
8 transport.rs QUIC server + frame distribution Large (new code)
9 nalu.rs Annex B framing protocol Small (new code)
10 signaling.rs axum server + static files Small (new code)
11 static/* Browser Web UI + WebCodecs player Medium (new code)

Deliverable: Run wl-webrtc, open https://localhost:PORT in Chrome, see live screen at <50ms latency.

Phase 2 — Remote Input + Stability

# Feature Description
12 Remote input Browser mouse/keyboard → wlr-virtual-pointer/virtual-keyboard
13 Hot-plug recovery Display disconnect/reconnect
14 Dynamic format Resolution/rotation change handling
15 Multi-client Multiple simultaneous browser viewers

Phase 3 — Optimization + Compatibility

# Feature Description
16 Adaptive bitrate Network-aware VAAPI bit_rate adjustment
17 Audio pipeline Synchronous audio capture + encoding + transport
18 WebRTC fallback webrtc-rs path for Firefox compatibility
19 Performance dashboard Real-time stats in Web UI

11. Open Questions

  1. ffmpeg-next vs direct VAAPI bindings: ffmpeg-next adds FFI overhead but provides mature encoding pipeline. Direct vaapi-dmabuf bindings would be more Rust-native but much more implementation work. Decision: ffmpeg-next for Phase 1, evaluate direct bindings in Phase 3. NOTE: ffmpeg-next safe API does NOT wrap hardware contexts (AVBufferRef, AVHWFramesContext). Use raw ffmpeg_next::ffi directly for all HW context operations — see wl-screenrec/src/avhw.rs for the reference pattern.

  2. Frame fragmentation strategy: Current design fragments large frames across QUIC datagrams at byte boundaries (not NAL-aligned). The framing protocol reassembles by frame_id, so a lost fragment invalidates the entire frame. Alternative: send all frames via reliable QUIC streams and accept slightly higher latency. Decision: Start with datagrams for delta frames, measure latency, evaluate.

  3. Self-signed certificate UX: Browser will show SSL warning. Options: (a) accept for LAN, (b) guide user to trust CA, (c) use HTTP/2 prior knowledge. Decision: Accept for Phase 1, add CA trust guide in Phase 2.

  4. HEVC vs H.264 default: H.264 has universal browser support. HEVC has better compression but spotty browser support. Decision: H.264 default, HEVC as option flag.

  5. WebCodecs bitstream format: Decision: Annex B mode (no description at configure time). SPS/PPS are guaranteed in-band via the h264_metadata BSF (repeat_sps=1 repeat_pps=1). Important: The repeat_headers=1 encoder option is libx264-only — it does NOT work with h264_vaapi. The BSF approach is encoder-agnostic and works with all FFmpeg hardware encoders. Per the W3C AVC WebCodecs Registration, providing description forces AVC (length-prefixed) mode for ALL subsequent frames. Since our encoder outputs Annex B, we must omit description.