Feature Incremental Clustering with Generalization Bounds

math.ST cs.LG stat.TH Jing Zhang, Chenping Hou · Mar 23, 2026

What it does

Why it matters

This paper proposes four k-means-based algorithms (FIC-FT, FIC-DR, FIC-DA, FIC-MR) tailored to different data-access constraints, from full historical access to model-only reuse. The core theoretical contribution establishes generalization...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Feature incremental clustering addresses dynamic scenarios where data arrives in expanding feature spaces—such as activity recognition systems that acquire new sensors over time. This paper proposes four k-means-based algorithms (FIC-FT, FIC-DR, FIC-DA, FIC-MR) tailored to different data-access constraints, from full historical access to model-only reuse. The core theoretical contribution establishes generalization error bounds for all four settings, revealing that model reuse (FIC-MR) can achieve a fast $\tilde{\mathcal{O}}(1/n_2)$ convergence rate when the pre-trained model aligns well with the current distribution.

Critical review

Verdict

Bottom line

This paper presents the first comprehensive theoretical treatment of feature incremental clustering, a practically relevant setting where feature spaces expand over time. By deriving algorithm-specific generalization bounds, the authors successfully identify the critical factors governing each method: clustering capability of old features for FIC-FT, reconstruction distribution discrepancy for FIC-DR, sample size limitations for FIC-DA, and pre-trained model quality for FIC-MR. The theoretical framework is notably anchored in a novel Rademacher complexity analysis for k-means (Theorem 1) that supports the subsequent FIC bounds.

“Consider training the $k$-means clustering model... with probability at least $1-\delta$, we have $\mathcal{L}_{\mathbb{Q}}(\mathbf{U})\leq\hat{\mathcal{L}}_{D}(\mathbf{U})+\frac{\alpha+(2\sqrt{2}+3\beta)\sqrt{\log(1/\delta)}}{\sqrt{n}}+\frac{6\gamma\log(1/\delta)}{n}$”

Zhang & Hou · Theorem 1, Section IV

What holds up

The problem formulation is novel and well-motivated by real-world applications such as sensor-based activity recognition. The theoretical analysis for FIC-MR is particularly strong: when the pre-trained centers $\mathbf{U}^0$ align with the current distribution (i.e., $\Gamma(\mathbf{U}^0) \to 0$), the generalization bound tightens to a fast rate of $\tilde{\mathcal{O}}(1/n_2)$ instead of the standard $\tilde{\mathcal{O}}(\sqrt{k/n_2})$. This provides a rigorous justification for model reuse in incremental learning. The experimental evaluation across nine diverse datasets consistently validates the theoretical predictions, showing FIC-MR outperforms alternatives especially when current-stage data is scarce.

“Notably, when the cluster centers $\mathbf{U}^{0}$ align well with the current distribution (i.e., $\Gamma(\mathbf{U}^{0})\to 0$), the bound in ([16]) tightens to a fast rate of $\tilde{\mathcal{O}}(\frac{1}{n_{2}})$.”

Zhang & Hou · Remark 5, Section IV

“FIC-MR consistently outperforms all compared algorithms in most scenarios. For instance, on the Indoor dataset with RA=10%, it achieves over a 5% ACC improvement over the best baseline.”

Zhang & Hou · Section V-B

Main concerns

The reconstruction mechanism in FIC-DR relies on a heuristic k-means-based completion objective (Eq. 5) that lacks theoretical guarantees for accurate feature recovery; consequently, the $\mathcal{Y}$-discrepancy term in Theorem 3 may be uncontrollably large in practice. The generalization bounds contain unspecified problem-dependent constants (e.g., $V$, $\gamma$) that limit their quantitative utility for hyperparameter selection. Additionally, the analysis assumes bounded features $\|\mathbf{x}\| \leq \gamma$ and relies on correlations between old and new features for reconstruction, assumptions that may fail in high-dimensional or adversarial settings. The paper also omits comparison to modern deep clustering baselines, restricting evaluation to k-means variants.

“$\min_{\mathbf{U}\in\mathcal{H}^{k},\mathbf{X}_{\text{all}}}\frac{1}{n_{1}+n_{2}}\sum_{i=1}^{n_{1}+n_{2}}\min_{s=1,\ldots,k}\lVert\mathbf{x}_{i}-\mathbf{u}_{s}\rVert^{2} \quad \mathrm{s.t.} \quad \mathbf{X}_{1}^{(1)}=\mathbf{X}_{1}^{(1)},\ \ \mathbf{X}_{2}=\mathbf{X}_{2}$”

Zhang & Hou · Section III-B, Eq. (5)

“$\operatorname{disc}_{\mathcal{Y}}(\tilde{D}_{p_{\bm{\omega}}},D_{c})=\sup_{\mathbf{U}\in\mathcal{H}^{k}}\lvert\hat{\mathcal{L}}_{\tilde{D}_{p_{\bm{\omega}}}}(\mathbf{U})-\hat{\mathcal{L}}_{D_{c}}(\mathbf{U})\rvert$”

Zhang & Hou · Theorem 3, Section IV

“If $\lVert\mathbf{x}\rVert\leq\gamma$ for $\forall\mathbf{x}\in\mathcal{X}$”

Zhang & Hou · Theorem 1, Section IV

Evidence and comparison

The empirical results generally support the theoretical hierarchy of methods, with FIC-MR > FIC-DA > FIC-DR across most datasets, validating that model reuse is preferable to distribution-sensitive reconstruction when historical data is unavailable. However, the evaluation relies on external validity metrics (ACC, NMI, F-score) that require ground-truth labels, which creates a methodological tension since the proposed algorithms optimize an unsupervised squared-norm criterion and the generalization bounds are derived for expected clustering risk. The comparison to baselines (KM-P1 and KM-C1) is fair for isolating the value of incremental features, but the absence of streaming clustering or deep incremental clustering baselines limits the assessment of practical competitiveness.

“FIC-DR exhibits unstable and inferior performance compared to FIC-MR. This degradation likely stems from noise and distribution mismatch during feature reconstruction, empirically validating our theoretical stability analysis of FIC-DR.”

Zhang & Hou · Section V-B

Reproducibility

The experimental protocol is described in sufficient detail to permit replication: nine public datasets are employed with a standardized split (50% previous, 20% test, remainder current), and the single hyperparameter $\theta$ in FIC-MR is searched over a disclosed grid $\{10^0, 10^1, 10^2, 10^3\}$. However, no code or implementation has been made publicly available, and the FIC-DR reconstruction procedure lacks algorithmic specifics (e.g., convergence criteria, initialization of missing features). Furthermore, the experimental results report means and standard deviations over only ten runs, which may be insufficient for establishing statistical significance given the variance observed in Tables II and III.

“For all compared algorithms, we only need to tune a single parameter $\theta$ in FIC-MR, searching over the grid $\{10^{0},10^{1},10^{2},10^{3}\}$. Experimental results are reported as the mean (standard deviation) over 10 independent runs.”

Zhang & Hou · Section V-A2

Abstract

In many learning systems, such as activity recognition systems, as new data collection methods continue to emerge in various dynamic environmental applications, the attributes of instances accumulate incrementally, with data being stored in gradually expanding feature spaces. How to design theoretically guaranteed algorithms to effectively cluster this special type of data stream, commonly referred to as activity recognition, remains unexplored. Compared to traditional scenarios, we will face at least two fundamental questions in this feature incremental scenario. (i) How to design preliminary and effective algorithms to address the feature incremental clustering problem? (ii) How to analyze the generalization bounds for the proposed algorithms and under what conditions do these algorithms provide a strong generalization guarantee? To address these problems, by tailoring the most common clustering algorithm, i.e., $k$-means, as an example, we propose four types of Feature Incremental Clustering (FIC) algorithms corresponding to different situations of data access: Feature Tailoring (FT), Data Reconstruction (DR), Data Adaptation (DA), and Model Reuse (MR), abbreviated as FIC-FT, FIC-DR, FIC-DA, and FIC-MR. Subsequently, we offer a detailed analysis of the generalization error bounds for these four algorithms and highlight the critical factors influencing these bounds, such as the amounts of training data, the complexity of the hypothesis space, the quality of pre-trained models, and the discrepancy of the reconstruction feature distribution. The numerical experiments show the effectiveness of the proposed algorithms, particularly in their application to activity recognition clustering tasks.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.