Moxel 4-GPU Benchmarks and Setting Up the LLM Phishing A/B Test
Proper Moxel benchmarks with prefetching enabled across 4x RTX 3090s. Also the experimental setup for comparing LLM-generated vs human-crafted phishing templates in PhishRig.
Moxel: 4-GPU benchmark results
Got clean numbers this week. Setup: 4x RTX 3090 (24GB each, 96GB total VRAM), PCIe Gen 4, Mixtral 8x7B loaded across all four cards. Testing three scenarios, batch inference, single-stream generation, and latency under load.
Single-stream generation (batch size 1):
| Config | Tokens/sec | Notes |
|---|---|---|
| 2x 3090, no prefetch (week 6 baseline) | ~11 | PCIe bottleneck visible |
| 4x 3090, no prefetch | ~9 | More cards, more hops |
| 4x 3090, with prefetch | ~23 | Prefetch hides most transfer latency |
| A100 80GB (reference, single card) | ~45 | NVLink native |
The prefetch improvement is the meaningful number. Going from 9 to 23 tokens/second by predicting which weights will be needed next and moving them while the current layer computes. The transformer architecture helps enormously here, the execution graph is deterministic, so prefetch accuracy is close to 100%.
The gap to A100 (~23 vs ~45) is mostly bandwidth. We’re hitting about 51% of A100 performance for single-stream generation, using consumer hardware that costs a fraction of enterprise GPU pricing. For batch processing the picture is better.
Batch inference (batch size 16):
| Config | Throughput (tokens/sec/request) | Total throughput |
|---|---|---|
| 4x 3090, with prefetch | ~31 | ~496 total |
| A100 80GB (reference) | ~38 | ~608 total |
At batch size 16, the gap closes to about 82% of A100 performance. This makes sense, larger batches mean compute-bound rather than bandwidth-bound, which plays to the 3090’s strengths.
Practical implication: Moxel is not a replacement for proper NVLink-connected hardware when you need low-latency single-stream generation. But for batch processing workloads, and for research where you need large models to fit at all, 4x consumer GPUs at ~96GB pooled gives you something usable. The economics are compelling: 4x RTX 3090 at current market prices is significantly cheaper than a single A100 80GB, and you’re getting 82% of the batch throughput.
Still work to do: the placement policy is naive (round-robin across GPUs). Smarter placement based on access patterns should squeeze out another 10-15% based on profiling data.
LLM phishing A/B test setup
I mentioned in week 2 that I wanted to test LLM-generated phishing templates against human-crafted ones. PhishRig is now at a point where I can actually run this.
Experimental setup:
Control group: human-crafted phishing email templates. These are the standard simulation templates we’ve been using, credential reset notifications, IT security alerts, HR benefit updates. Written by a red teamer, reviewed and refined over multiple engagements.
Test group: templates generated using Claude Sonnet via a structured prompt. The prompt includes: target role/department, spoofed sender identity, scenario type, tone guidelines, and a constraint to avoid common phishing indicators (excessive urgency, generic salutations, poor grammar). No human editing of the output.
Metrics:
- Open rate (tracked via pixel)
- Click rate (link in email)
- Credential entry rate (landing page)
- Time to click (how quickly after opening)
What I’m controlling for: delivery infrastructure, subject line randomisation, target segmentation by department. The only variable is whether the email body was human-written or LLM-generated.
Results aren’t in yet, the simulation is running over two weeks to get statistical significance. But early data (five days in) is showing the LLM-generated templates performing comparably to human-crafted ones on open and click rate. Credential entry is where the real test is.
I’ll write this up properly when I have complete data. If LLM templates are genuinely competitive, that has real implications, not just for red teaming economics (cheaper and faster at scale) but for what defenders need to train users against.
Quick ProvStamp update
The core sign/verify API is in progress. Signing works for JPEG and PNG. Working through the certificate chain validation logic, the C2PA trust list verification is more involved than I expected. Specifically, path validation for intermediate CAs has edge cases in how the C2PA spec defines trust anchors vs. what standard X.509 path validation does. Working through it.
Next week: LLM phishing A/B test results. And a ThreatWatch feed quality update, adding the freshness scoring I built in week 8 to the public dashboard view.