DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture

cs.CV Young-Seo Chang, Yatong An, Jae-Sang Hyun · Mar 22, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.

Critical review

Verdict

Bottom line

The paper presents a sound hybrid approach that validates the necessity of physics-based pre-transforms for depth compression, but it systematically conflates high reconstruction accuracy with "lossless-grade" fidelity. While the internal ablations are convincing, cross-method comparisons are confounded by differing entropy coders (learned vs. JPEG), and the lack of open-source code substantially limits reproducibility. The work is a solid incremental advance over MWD-based baselines but stops short of rigorous comparison against contemporary end-to-end depth codecs on standardized metrics.

“preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG”

paper · Abstract

“Results marked with † are reported from the original publication, where JPEG(Q=90) is used as the entropy codec. Our method employs end-to-end learned entropy coding.”

paper · Table 1 caption

“The code and trained models that support the findings of this study are available from the corresponding author upon reasonable request.”

paper · Data Availability

What holds up

The Only-TCM ablation (Table 11) provides compelling evidence that applying standard learned image codecs directly to raw depth maps fails catastrophically (22.38 dB PSNR vs. 44.31 dB with MWD), confirming that depth statistics differ fundamentally from natural images. The 4-bit quantization ablation (Table 8) rigorously demonstrates that 4-bit representation strikes an optimal rate-distortion balance, with 3-bit and 2-bit variants showing sharp quality degradation. The Transformer-CNN vs. CNN-only comparison (Table 9) supports the architectural choice with measurable gains (up to 0.75 dB).

“Only-TCM ... 1.443 ... 22.38 ... Ours (4-bit) ... 0.363 ... 44.31”

paper · Table 11

“4-bit ... 0.363 ... 44.31 ... 2-bit ... 0.427 ... 20.42”

paper · Table 8

“Ours (4-bit) ... 49.89 ... 0.307 ... CNN-only ... 49.22 ... 0.259”

paper · Table 9

Main concerns

The paper's use of "lossless-grade accuracy" and "commensurate with lossless PNG" is misleading: PSNR values of 44–50 dB indicate visible lossy compression, not lossless reconstruction. The Accuracy metric (1 - NRMSE) can obscure absolute error magnitudes when depth ranges are large. The comparison claiming 60.3% bitrate reduction over prior work confounds the MWD encoding improvement with the switch from JPEG to a learned entropy coder (Table 1). No direct comparison is made against Sebai et al. or other recent end-to-end depth codecs on identical datasets and metrics, making SOTA claims difficult to verify. The fringe period $P=8$ is fixed without ablation or theoretical justification for its optimality across diverse depth ranges.

“preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG”

paper · Abstract

“DepthTCM reduces bitrate by 60.3% compared to previous multiwavelength-based methods”

paper · Section 1

“Direct metric comparison is omitted due to different datasets and evaluation protocols”

paper · Table 12

“we use a fixed value of $P=8$ in all experiments for consistency”

paper · Section 3.2

Evidence and comparison

Internal evidence strongly supports the core hypothesis that MWD encoding enables efficient learned compression. However, external comparisons are problematic: the N-DEPTH baseline uses JPEG while DepthTCM uses a learned codec, making it impossible to isolate the contribution of the physics-aware encoding from the entropy model upgrade. Table 12 acknowledges that direct comparison with Sebai et al. is omitted due to protocol differences, yet the abstract positions the work as advancing the state-of-the-art without this validation. The KITTI results (Table 3) show the method handles sparse depth reasonably, but the PSNR of 23.77 dB at 0.580 bpp suggests significant degradation in outdoor scenarios compared to indoor benchmarks.

“N-DEPTH† ... JPEG 90 ... 0.774 ... Ours (4-bit) ... Learned ... 0.307”

paper · Table 1

“Direct metric comparison is omitted due to different datasets and evaluation protocols”

paper · Table 12

“Ours (4-bit) ... 0.580 ... 23.77 ... 5.465”

paper · Table 3

Reproducibility

Reproducibility is severely hindered by the "available upon reasonable request" code policy rather than open-source release. While training hyperparameters are specified (batch size 4, 100 epochs, learning rate $1\times 10^{-4}$, $\lambda$ varying for rate control), critical architectural details rely on external citation of Liu et al.'s TCM without explicit reproduction of the specific configuration. The ScanNet v2 subset uses "approximately 25,000" frames without specifying the exact scene IDs or frame indices, making exact dataset replication impossible. Inference timing results (Table 7) are reported on an RTX 4080 SUPER without specifying precision (FP32/FP16) or CUDA/cuDNN versions.

“available from the corresponding author upon reasonable request”

paper · Data Availability

“sampling approximately 25,000 RGB-D frames from the training split”

paper · Section 4.1

“training runs for 100 epochs on an NVIDIA GeForce RTX 4080 SUPER with batch size 4”

paper · Section 4.1

Abstract

We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.