DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture
DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.
The paper presents a sound hybrid approach that validates the necessity of physics-based pre-transforms for depth compression, but it systematically conflates high reconstruction accuracy with "lossless-grade" fidelity. While the internal ablations are convincing, cross-method comparisons are confounded by differing entropy coders (learned vs. JPEG), and the lack of open-source code substantially limits reproducibility. The work is a solid incremental advance over MWD-based baselines but stops short of rigorous comparison against contemporary end-to-end depth codecs on standardized metrics.
The Only-TCM ablation (Table 11) provides compelling evidence that applying standard learned image codecs directly to raw depth maps fails catastrophically (22.38 dB PSNR vs. 44.31 dB with MWD), confirming that depth statistics differ fundamentally from natural images. The 4-bit quantization ablation (Table 8) rigorously demonstrates that 4-bit representation strikes an optimal rate-distortion balance, with 3-bit and 2-bit variants showing sharp quality degradation. The Transformer-CNN vs. CNN-only comparison (Table 9) supports the architectural choice with measurable gains (up to 0.75 dB).
The paper's use of "lossless-grade accuracy" and "commensurate with lossless PNG" is misleading: PSNR values of 44–50 dB indicate visible lossy compression, not lossless reconstruction. The Accuracy metric (1 - NRMSE) can obscure absolute error magnitudes when depth ranges are large. The comparison claiming 60.3% bitrate reduction over prior work confounds the MWD encoding improvement with the switch from JPEG to a learned entropy coder (Table 1). No direct comparison is made against Sebai et al. or other recent end-to-end depth codecs on identical datasets and metrics, making SOTA claims difficult to verify. The fringe period $P=8$ is fixed without ablation or theoretical justification for its optimality across diverse depth ranges.
Internal evidence strongly supports the core hypothesis that MWD encoding enables efficient learned compression. However, external comparisons are problematic: the N-DEPTH baseline uses JPEG while DepthTCM uses a learned codec, making it impossible to isolate the contribution of the physics-aware encoding from the entropy model upgrade. Table 12 acknowledges that direct comparison with Sebai et al. is omitted due to protocol differences, yet the abstract positions the work as advancing the state-of-the-art without this validation. The KITTI results (Table 3) show the method handles sparse depth reasonably, but the PSNR of 23.77 dB at 0.580 bpp suggests significant degradation in outdoor scenarios compared to indoor benchmarks.
Reproducibility is severely hindered by the "available upon reasonable request" code policy rather than open-source release. While training hyperparameters are specified (batch size 4, 100 epochs, learning rate $1\times 10^{-4}$, $\lambda$ varying for rate control), critical architectural details rely on external citation of Liu et al.'s TCM without explicit reproduction of the specific configuration. The ScanNet v2 subset uses "approximately 25,000" frames without specifying the exact scene IDs or frame indices, making exact dataset replication impossible. Inference timing results (Table 7) are reported on an RTX 4080 SUPER without specifying precision (FP32/FP16) or CUDA/cuDNN versions.
We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.