Manual annotation does not scale. By correlating two streams, the degradation read on one provides the ground-truth for the other. Mid-level fusion for a multimodal context.
Manual annotation is slow, expensive and hard to reproduce. On continuous signals, it becomes unmanageable. The question is not how to annotate faster, but how to avoid annotating at all.
Ground-truth from correlation
Two streams observe the same phenomenon from two angles. When one signal degrades, that degradation leaves a measurable trace on the other stream. So we read the truth of one signal in the behavior of the second.
The label comes from this correlation, not from an operator. If stream A loses quality in a way that is recognizable on stream B, B serves as the reference to label A. The process automates and produces thousands of consistent labels, with no annotator bias.
Mid-level fusion
We fuse neither the raw pixels nor the final decisions alone. We fuse at mid-level, on representations that are already structured yet still rich. This level preserves the cross-stream information while staying compact enough for training.
The context becomes multimodal without human labeling. Each stream enriches the other and supervises it. The principle to keep. When two signals measure the same reality, one can annotate the other, and manual annotation disappears.