Notebook 05 — What we still omit, and why#

v0.2 closed two of the gaps we listed in early drafts of this chapter — the DH ratchet and the skipped-key cache. The remaining gaps below are still genuinely out of scope for this educational toy. This chapter inventories what is not yet built, what threat each one addresses, and what closing it would cost.

1. ✅ DH ratchet (post-compromise security) — added in v0.2#

Originally listed here as omitted; implemented in v0.2. Every time the conversation flips direction, the next sender generates a fresh X25519 keypair, attaches the public half to the message header, and mixes DH(new_eph, peer_dh_pub) into the root key. Notebook 03 demonstrates the same compromise scenario surviving thanks to this rotation.

Implementation: pqmsg/session.py (_maybe_rotate_send_dh, _dh_ratchet_recv, _kdf_root). Spec: tests/test_dh_ratchet.py::test_post_compromise_recovers_after_dh_ping_pong.

2. Authentication (signed prekeys, identity binding)#

What we have: TOFU — Trust On First Use. When Alice imports bob.pub, she trusts that the bytes really are Bob’s. There is no signature, no fingerprint comparison, no published key directory.

What real Signal has:

  1. Long-term identity key — Ed25519, never rotated.

  2. Signed prekey — a medium-term X25519 key, signed by the identity key. Rotated weekly.

  3. One-time prekeys — a batch of single-use X25519 keys uploaded to the server. The very first message consumes one, giving the initial handshake an extra layer of forward secrecy on the receiver side too.

  4. Safety numbers — a fingerprint of both identity keys that users can compare out-of-band to detect MITM by the server.

Cost to add:

  • A signature scheme (Ed25519 or, for fully-PQ authentication, ML-DSA / SLH-DSA — see the companion book ch. 10).

  • A prekey bundle format and a directory service (or a publish-and-fetch dance over the file queue).

  • A safety-number rendering function.

Without these, an attacker who can inject contact files (any malware on the user’s laptop, any tampering with the file queue) can swap in their own keys at the moment of import and MITM every message thereafter. TOFU is only as strong as the channel that delivers the first key.

3. ✅ Out-of-order delivery (skipped-key cache) — added in v0.2#

Originally listed here as omitted; implemented in v0.2. When a receive index jumps ahead of the expected one, the receiver pre-derives and caches the missing message keys, bounded by MAX_SKIP = 100. A late or reordered message within the window decrypts cleanly; a flood beyond the window is rejected to prevent DoS.

Implementation: _try_skipped, _skip_chain in pqmsg/session.py. Spec: test_out_of_order_within_chain_decrypts and test_skipped_key_cache_bounded.

4. Metadata privacy#

What we have: sender, recipient, sent_at, and (for the first message) the size of the prekey bundle are all in plaintext on disk. Anyone who reads the inbox directory sees the social graph and timing.

What real systems do (partial list — none solves it fully):

  • Signal sealed sender: the recipient’s server learns that a message arrived for Bob, but not from whom. Sender identity is encrypted under Bob’s identity key.

  • Tor / onion routing: hides network-level metadata at the cost of latency.

  • Padding & cover traffic: hide message lengths and timing — nobody ships this at scale because the bandwidth cost is enormous.

Cost to add for this toy: sealed sender is ~30 lines (encrypt the sender field under the recipient’s long-term key). Network-level metadata defense is out of scope of any application protocol.

Honest assessment: metadata leakage is the hardest unsolved problem in messaging.

5. Multi-device, group chats, async delivery#

All three are major design problems:

  • Multi-device: if Alice has a phone and a laptop, both must be able to send/receive in the same conversation. Signal solves this by making each device its own Signal identity, with a device-list signed by the user’s master key, and pairwise sessions between every pair of devices. Group chats then become “send the message N times”.

  • Group chats: at small N, pairwise sessions work. At large N, you need Sender Keys (per-group symmetric chain) or MLS (RFC 9420 — tree-based key agreement for thousands of members). MLS is what WhatsApp and Discord moved to.

  • Async delivery: our file queue is naïvely async — recipient just polls. A real system needs server-side queueing, push notifications, and prekey replenishment so the sender doesn’t run out of one-time keys when Alice is offline for a month.

Each of these is a chapter of its own. The two-party, online-only, single-device toy in notebooks 02–04 is the simplest possible starting point.

6. Implementation hardening#

Inherited from the companion book’s notebook 09 “Gaps vs. production”, and just as relevant here:

  • Constant-time: our ML-KEM implementation (pqc_edu) uses Python ints; many operations are timing-variable. A real implementation uses constant-time C / Rust.

  • Memory zeroization: we leak intermediate keys to Python’s GC. A real implementation calls mlock and explicitly wipes buffers.

  • Side channels: power, EM, cache. Out of scope for any pure-Python project.

  • KAT validation: pqc_edu is not validated against NIST Known Answer Tests. Production code must be.

  • Random number quality: we trust os.urandom. Real systems audit the entropy chain end-to-end.

A summary table#

Property

This toy

Real Signal-class system

Confidentiality (eavesdropper)

AEAD (tamper detection)

Forward secrecy (past msgs)

Post-compromise security

✅ (DH ratchet, v0.2)

✅ (DH ratchet)

Identity authentication

❌ TOFU only

✅ Ed25519 + safety #s

Out-of-order tolerance

✅ (cache, v0.2)

✅ skipped-key cache

Sealed sender

Multi-device

Group chats

✅ Sender Keys / MLS

Async + offline delivery

partial (file poll)

✅ server + push

Constant-time crypto

KAT-validated KEM

Six checks vs four. v0.2 added the DH ratchet and skipped-key cache rows. The four still missing — authentication beyond TOFU, sealed sender, multi-device, group chats — are operational/UX problems on top of the cryptographic core, not in it. Shipping a real product means closing every remaining row.