SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

cs.CL Migyeong Kang, Jihyun Kim, Hyolim Jeon, Sunwoo Hwang, Jihyun An, Yonghoon Kim, Haewoon Kwak, Jisun An, Jinyoung Han · Mar 23, 2026

What it does

Why it matters

SynSym addresses this by using GPT-4o to generate synthetic training data across four stages: symptom concept expansion, dual-style (clinical/colloquial) expression generation, clinically-grounded multi-symptom composition, and LLM-based...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Psychiatric symptom identification from social media requires expensive expert annotation and suffers from inconsistent labeling across platforms. SynSym addresses this by using GPT-4o to generate synthetic training data across four stages: symptom concept expansion, dual-style (clinical/colloquial) expression generation, clinically-grounded multi-symptom composition, and LLM-based quality filtering. The framework produces 18,254 samples covering 14 DSM-5 symptoms, enabling models to match real-data performance and generalize across diverse social media platforms.

Critical review

Verdict

Bottom line

SynSym is a methodologically sound contribution to clinical NLP that convincingly demonstrates synthetic data can substitute for expensive expert annotations in multi-label psychiatric symptom detection. The cross-dataset evaluation and ablation studies are rigorous, though the reliance on proprietary GPT-4o and deliberate exclusion of figurative language limit applicability to indirect symptom expressions common on platforms like Twitter.

“Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data”

paper · Abstract

“This work is significant in that it represents the first attempt to apply synthetic data to the task of symptom prediction”

paper · Section 6

What holds up

The dual-style generation strategy (clinical and colloquial) and incorporation of clinical co-occurrence patterns are well-motivated design choices that address real limitations of LLMs avoiding sensitive terminology. Table 4 shows models trained solely on SynSym data achieve comparable Macro-F1 to MentalBERT trained on real data (e.g., 0.778 vs 0.811 on PsySym), with further gains when combined with real data. Expert validation by two psychiatrists yielded high scores (4.61/5 for sub-concepts, 4.99/5 for expressions with >94% inter-rater agreement), supporting clinical validity.

“Expanded Sub-Concepts: Expert 1: 4.61, Expert 2: 4.57, Agreement: 94.86%; Synthetic Expressions: Expert 1: 4.99, Expert 2: 5.00, Agreement: 99.66%”

paper · Table 3

“When prompted to generate direct statements, LLMs often avoid clinically explicit or sensitive terms... To address this, we enforce the generation of clinical expressions”

paper · Section 3.2

Main concerns

The framework deliberately excludes metaphorical and figurative expressions—which constitute significant portions of datasets like D2S—to preserve label reliability, limiting deployment on platforms where users express symptoms indirectly. While the paper claims novelty as the 'first attempt to apply synthetic data to symptom prediction,' prior work (Ghanadian et al., 2024; Vedanta and Rao, 2024) used LLMs for synthetic mental health data; the distinction rests on multi-label granularity, which should be emphasized more clearly. Validation relied on only two psychiatrists reviewing 300 expressions, raising questions about scalability for larger corpora. Additionally, the evaluation relies on benchmark datasets with known reliability issues: Milintsevich et al. found remarkably low agreement (κ=0.09) between PRIMATE's crowd-sourced labels and professional re-annotations.

“models often struggle to generalize to new platforms or unseen styles of symptom expressions”

paper · Section 1

“PRIMATE Labels against 'answerable': F1=.15... Our MHP reannotated 170 posts from the PRIMATE dataset... We observe a high number of false positives in the PRIMATE labels”

Milintsevich et al. · Table 3

“D2S is characterised by more figurative or abstract expressions... which were excluded from our synthetic data generation due to concerns that such expressions may blur class boundaries”

paper · Section 4.2.3

Evidence and comparison

Comparisons to BERT, DeBERTa, MentalBERT, and GPT-4o prompting baselines are fair and consistently reported with confidence intervals across 5-fold cross-validation. The cross-dataset generalization experiment (Table 5) is particularly compelling: SynSym-trained models outperform multi-source training and achieve strong zero-shot transfer across PsySym, PRIMATE, and D2S, supporting claims of style invariance. The ablation study (Table 6) confirms that removing clinical co-occurrence knowledge (CK), dual-style generation (DU), or symptom expansion (SE) degrades performance. However, the work could benefit from comparison with other LLM-based augmentation techniques beyond back-translation.

“SynSym (ours): PsySym Rec. 0.732, F1 0.778; PRIMATE Rec. 0.712, F1 0.557; D2S Rec. 0.525, F1 0.518”

paper · Table 5

“w/o EV + CK + DU + SE: PsySym F1 0.812, PRIMATE F1 0.625, D2S F1 0.585”

paper · Table 6

Reproducibility

The authors commit to releasing code and synthetic datasets, and report detailed hyperparameters (AdamW lr 5e-5/3e-5, batch size 32/64, max length 512) and prompt templates in Appendix A. However, full reproduction is hindered by dependence on GPT-4o, which is proprietary and subject to versioning drift; the paper uses temperature 0.0 for deterministic expansion but 0.8 for generation. No API version date is specified, and generating 18,254 samples requires substantial compute credits. The synthetic data evaluation relies on benchmark datasets with inconsistent annotation schemes, requiring complex remapping (Appendix C.1) that introduces additional variability.

“AdamW optimizer with default weight decay of 0.01 and learning rate of 5e-5... temperature parameter to 0.0... increased the temperature to 0.8”

paper · Appendix C.2

“To facilitate cross-dataset evaluation, we remapped DSM-5–based symptom labels to PHQ-9 categories... three exceptions were handled with specific rationale”

paper · Appendix C.1

Abstract

Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.