Results

Evaluation

Benchmarked on a self-captured 0.1 lux IMX327 RAW validation set and on ReCRVD (external generalization), against Noisy, VBM3D and FastDVDNet. Because methods output different domains, everything is compared after an identical RAW→PNG visualization (linear gain + gamma).

How we measure

PSNR ↑ pixel fidelity SSIM ↑ structural similarity LPIPS ↓ perceptual distance tOF ↓ optical-flow temporal consistency tLPIPS ↓ inter-frame flicker

Evaluation is split into a RAW-domain pass (how much noise the model actually removes in RAW) and a PNG-domain pass (perceived quality after visualization).


RAW-domain evaluation

Comparing the noisy input against RViDeNet-ECBAM directly in RAW. On both datasets, PSNR and SSIM improve substantially.

DatasetMethodScenesPSNRrawSSIMraw
Self-captured 0.1 luxNoisy645.0260.9537
Self-captured 0.1 luxRViDeNet657.0560.9958
ReCRVDNoisy1021.8050.6927
ReCRVDRViDeNet1039.3310.9782

PNG-domain evaluation

Under the shared RAW→PNG visualization, the best method depends on the dataset. On the self-captured set RViDeNet-ECBAM leads on PSNR, SSIM and tOF; on ReCRVD, VBM3D (σ=20) gives the highest PSNR while RViDeNet-ECBAM gives the best LPIPS. Rows are sorted by PSNR within each dataset.

DatasetMethodVariantPSNRSSIMLPIPStOFtLPIPS
Self-capturedRViDeNet31.5730.6480.5550.2930.090
Self-capturedVBM3Dσ=3030.9680.6290.7710.3260.079
Self-capturedVBM3Dσ=5030.8400.6210.8100.2850.022
Self-capturedFastDVDNet30.2240.5780.7531.1650.212
Self-capturedVBM3Dσ=4030.1920.6240.7930.2880.031
Self-capturedVBM3Dσ=2029.1100.5350.4031.1080.210
Self-capturedNoisy22.0180.2150.4252.4280.133
ReCRVDVBM3Dσ=2036.9190.9150.2291.2630.156
ReCRVDVBM3Dσ=3036.3490.9290.2311.4110.130
ReCRVDFastDVDNet35.9480.9160.2061.7370.148
ReCRVDVBM3Dσ=4035.6780.9220.2531.6450.126
ReCRVDVBM3Dσ=5034.9660.9150.2701.8210.123
ReCRVDRViDeNet34.5460.9180.1761.1650.149
ReCRVDNoisy28.9160.5370.6133.4760.219

Best PSNR / lowest LPIPS / lowest tOF per dataset highlighted. tOF & tLPIPS measure temporal stability (lower is better).

On the self-captured set, RViDeNet's LPIPS (0.555) is higher than VBM3D σ=20 (0.403) — even above the noisy input (0.425). Strong denoising over-smooths high-frequency texture, and perceptual metrics treat residual noise as a kind of texture while penalising the missing detail of a smoothed result. We address this with alpha blending below.


Scene-by-scene (PNG)

Each cell shows Noisy → RViDeNet with the change (Δ). PSNR/SSIM higher is better; LPIPS/tOF/tLPIPS lower is better.

Self-captured 0.1 lux set

ScenePSNR ΔSSIM ΔLPIPS ΔtOF ΔtLPIPS Δ
whiteball+9.76+0.444+0.154−2.296−0.040
wood+8.73+0.388+0.169−2.024−0.042
snake+9.28+0.420+0.091−1.936−0.054
elephant+9.11+0.407+0.153−2.159−0.037
giraffe+10.63+0.490+0.053−2.073−0.054
pink+9.86+0.448+0.161−2.302−0.032

Consistent PSNR +8.7–+10.6 dB and SSIM +0.39–+0.49 on every scene, with tOF dropping ~2.0 (large temporal-consistency gain). LPIPS rises slightly, reflecting texture loss from strong denoising.

ReCRVD (external)

ScenePSNR ΔSSIM ΔLPIPS ΔtOF ΔtLPIPS Δ
Beauty+9.16+0.661−0.656−3.041−0.048
Lips+4.61+0.332−0.190−1.023−0.100
SunBath+9.28+0.495−0.608−7.679−0.035
boxing−0.83+0.110−0.122−0.127−0.048
breakdance+9.47+0.595−0.820−1.941−0.107
camel+4.94+0.324−0.398−0.890−0.086
dogs-jump+9.10+0.531−0.684−2.518−0.080
parkour+7.59+0.375−0.447−3.633−0.100
rollerblade+1.03+0.223−0.283−1.437−0.059
vietnam+1.96+0.159−0.163−0.827−0.039

Every scene except boxing improves on PSNR, and LPIPS / tOF / tLPIPS improve on all scenes. boxing already starts at 34.4 dB noisy (a slight −0.83 PSNR drop, all other metrics improve). SunBath and parkour show the largest temporal gains (tOF −7.68 / −3.63).


What it looks like

Noisy input is dominated by low-light noise that buries object boundaries. FastDVDNet softens noise but stays dark and blurry. VBM3D is stable on static backgrounds but turns moving objects translucent where it cannot compensate motion. RViDeNet-ECBAM gives the most stable visual quality across full frames and crops — recovering number plates, doll contours and lip boundaries — though some background texture is flattened by smoothing.

Full-frame comparison of Noisy, FastDVDNet, VBM3D and RViDeNet-ECBAM
Figure 2. Full frames, same input across methods (Noisy / FastDVDNet / VBM3D / Ours). Rows (a)–(c) are self-captured 0.1 lux scenes; (d) is the ReCRVD Lips scene.
Zoomed crops comparison of Noisy, FastDVDNet, VBM3D and RViDeNet-ECBAM
Figure 3. Zoomed crops of the same frames. Our output keeps fine structure — the number plate “가 3456”, doll and figure contours, and lip highlights — sharper than the baselines while suppressing noise.
Noisy vs. RViDeNet-ECBAM, scene_giraffe
Figure 4. scene_giraffe — left: noisy input, right: RViDeNet-ECBAM. The denoised stream recovers structure and is temporally stable. More side-by-side clips in the gallery.

Alpha blending

To soften the over-smoothing seen in PNG LPIPS, we blend the RViDeNet output with the original noisy input:

Ifinal = α · Idenoised + (1 − α) · Inoisy

Larger α favours the denoised result; smaller α keeps more of the noisy input's texture (and noise). At α=1.0 the denoising is strongest; α=0.7–0.9 sit between noise removal and texture preservation, letting us pick a point on that trade-off. This was an addition beyond the initial plan, driven by the observed texture loss.


Does denoising help detection?

We run a YOLOv11x detector on the self-captured set. Without detection GT annotations, this is a proxy study (not mAP): average detections per frame, average mean confidence, and the detected-frame ratio (fraction of frames with at least one detection).

MethodAvg Det / FrameAvg Mean ConfDetected Frame Ratio
RViDeNet1.640.3460.726
VBM3D σ=401.230.3070.647
VBM3D σ=301.210.3110.647
VBM3D σ=501.010.2420.497
FastDVDNet0.790.2110.498
VBM3D σ=200.630.1940.444
Noisy0.0160.00660.016

On noisy input, detection is essentially impossible (≤0.016 on all three). RViDeNet leads every method — even VBM3D σ=30/40, which scored higher on PSNR/SSIM — showing pixel-level fidelity and real detectability do not always coincide.

Qualitatively, on a frame where the noisy input yields zero detections, the RViDeNet output yields five (including the giraffe). Denoising extends beyond image quality into downstream computer-vision usability.

Detections on noisy vs. denoised video

The same YOLOv11x detector run on the noisy input and on our output, with boxes drawn on each, playing side by side. Watch both at once: the noisy stream barely registers a detection, while the denoised stream picks up objects consistently.

giraffeYOLOv11x
Noisy + YOLO
Ours + YOLO
whiteballYOLOv11x
Noisy + YOLO
Ours + YOLO