Learning to Optimize Joint Source and RIS-assisted Channel Encoding for Multi-User Semantic Communication Systems

cs.NI cs.LG Haidong Wang, Songhan Zhao, Bo Gu, Shimin Gong, Hongyang Du, Ping Wang · Mar 22, 2026
Local to this browser
What it does
The paper addresses the scalability bottleneck in multi-user semantic communications by proposing JSRE (Joint Source and RIS-assisted channel Encoding), a framework that unifies all users under a single semantic encoder-decoder by...
Why it matters
The core innovation leverages RIS phase shifts to create channel orthogonality while using CSI-conditioned semantic features to avoid per-user model training, coupled with a Truncated Deep Reinforcement Learning (T-DRL) algorithm that...
Main concern
The JSRE framework presents a compelling architectural advance by unifying multi-user semantic communication via CSI embedding rather than separate encoders per user, backed by a thoughtful T-DRL optimization approach that decouples user...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper addresses the scalability bottleneck in multi-user semantic communications by proposing JSRE (Joint Source and RIS-assisted channel Encoding), a framework that unifies all users under a single semantic encoder-decoder by embedding channel state information (CSI) into the encoding process. The core innovation leverages RIS phase shifts to create channel orthogonality while using CSI-conditioned semantic features to avoid per-user model training, coupled with a Truncated Deep Reinforcement Learning (T-DRL) algorithm that accelerates convergence via model caching and a surrogate similarity estimator. This matters because existing approaches like DeepMA require linearly growing model storage with user count, rendering them impractical for dense deployments.

Critical review
Verdict
Bottom line

The JSRE framework presents a compelling architectural advance by unifying multi-user semantic communication via CSI embedding rather than separate encoders per user, backed by a thoughtful T-DRL optimization approach that decouples user scheduling from model retraining through caching and LoRA fine-tuning. However, the reliance on an unverified exponential approximation for the similarity estimator's warm-start and the absence of theoretical conditions for when CSI provides sufficient semantic orthogonality leave key claims empirically supported but analytically unsubstantiated. While the numerical results demonstrate clear energy efficiency and scalability gains over DeepMA and NOMASC, reproducibility is hindered by the lack of released code and incomplete hyperparameter specifications.

What holds up

The unified semantic encoder-decoder design represents a significant architectural improvement over per-user architectures. By embedding user-specific CSI $h_{r,k,c}$ into the encoding process as formalized in equation (5), "$s_{r,k,c} = E_\theta(w_{r,k,c}, h_{r,k,c}, \beta_{r,k})$", the framework avoids the linear growth in model storage with the number of users, maintaining nearly constant FLOPs as shown in Fig. 9(a). The T-DRL ablation study in Table II rigorously validates that the model caching mechanism reduces training duration from over 20 hours to 6 hours 11 minutes while maintaining comparable energy efficiency (5.52 vs 5.50 suts/J), confirming that "avoiding redundant retraining... significantly accelerates convergence."

“$s_{r,k,c} = E_\theta(w_{r,k,c}, h_{r,k,c}, \beta_{r,k})$”
paper · Section IV, Equation (5)
“T-DRL w/o caching ... 20h 18m ... 5.50 ... T-DRL ... 6h 11m ... 5.52”
paper · Table II
Main concerns

The theoretical justification for CSI providing sufficient orthogonality to distinguish users remains heuristic. While the paper asserts that "user-specific CSI ... can help differentiate users even when they transmit individual semantic feature symbols simultaneously," it provides no conditions under which this channel-semantic coupling guarantees separability, relying instead on empirical SSIM metrics (Fig. 6). Furthermore, the warm-start for the similarity estimator relies on the exponential approximation "$\xi_{r,k,c} = 1 - \exp\{-(k_1\gamma_{r,k,c} + b_1)(k_2\beta_{r,k} + b_2)\}$" from equation (21), which assumes a specific parametric relationship without validating that the constants $k_1, b_1, k_2, b_2$ remain stable across diverse image content or varying RIS configurations, potentially introducing systematic bias into the DRL reward estimation.

“user-specific CSI ... can help differentiate users even when they transmit individual semantic feature symbols simultaneously via a shared encoder-decoder model”
paper · Section IV, first paragraph
“$\xi_{r,k,c} = 1 - \exp\{-(k_1\gamma_{r,k,c} + b_1)(k_2\beta_{r,k} + b_2)\}$”
paper · Section V-B, Equation (21)
Evidence and comparison

The numerical evaluation convincingly demonstrates that JSRE achieves higher coding efficiency than NOMASC and DeepMA (Fig. 5(a)), with the performance gap widening as user count increases, supporting the scalability claims. However, the comparison assumes synchronized performance saturation at 25 dB SNR (Fig. 5(b)) to isolate source-coding effects, which may obscure channel-induced fairness variations in practical dynamic environments. The fairness analysis in Fig. 9(b) employs an aggregate metric (total efficiency divided by user count) that could mask transient user starvation, and the SSIM evaluation on Kodak24—while standard—represents only a limited test of generalization from the ImageNet training corpus.

“JSRE and DeepMA achieve higher coding efficiency, and this advantage becomes more pronounced as the number of users grows”
paper · Section VI-A, Fig. 5(a)
“we fix the channel SNR at 25 dB in all subsequent evaluations”
paper · Section VI-A
Reproducibility

Reproducing the results would require significant effort due to the absence of publicly available code and incomplete experimental documentation. While the paper specifies using an ImageNet subset of "310,000 images" and lists physical layer parameters in Table I, critical implementation details—including the LoRA rank for model caching, PPO hyperparameters (learning rate, clipping coefficient $\epsilon$ from equation (17)), and the specific architecture dimensions of the Transformer-based actor network—are omitted. The hardware specification cites an "NVIDIA RTX 2080 Ti GPU" but does not clarify whether the reported training times in Table II involve single-GPU training or distributed implementations, nor does it specify the software framework or version.

“trained on a subset of ImageNet [39] consisting of 310,000 images”
paper · Section VI, first paragraph
“$J(\theta_a) = \mathbb{E}_t[\min(\rho_t \hat{A}, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A})]$”
paper · Section V-A, Equation (17)
Abstract

In this paper, we explore a joint source and reconfigurable intelligent surface (RIS)-assisted channel encoding (JSRE) framework for multi-user semantic communications, where a deep neural network (DNN) extracts semantic features for all users and the RIS provides channel orthogonality, enabling a unified semantic encoding-decoding design. We aim to maximize the overall energy efficiency of semantic communications across all users by jointly optimizing the user scheduling, the RIS's phase shifts, and the semantic compression ratio. Although this joint optimization problem can be addressed using conventional deep reinforcement learning (DRL) methods, evaluating semantic similarity typically relies on extensive real environment interactions, which can incur heavy computational overhead during training. To address this challenge, we propose a truncated DRL (T-DRL) framework, where a DNN-based semantic similarity estimator is developed to rapidly estimate the similarity score. Moreover, the user scheduling strategy is tightly coupled with the semantic model configuration. To exploit this relationship, we further propose a semantic model caching mechanism that stores and reuses fine-tuned semantic models corresponding to different scheduling decisions. A Transformer-based actor network is employed within the DRL framework to dynamically generate action space conditioned on the current caching state. This avoids redundant retraining and further accelerates the convergence of the learning process. Numerical results demonstrate that the proposed JSRE framework significantly improves the system energy efficiency compared with the baseline methods. By training fewer semantic models, the proposed T-DRL framework significantly enhances the learning efficiency.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.