Feed - arxlens

0

TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints

cs.LG cs.CY Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty · Mar 23, 2026

Federated learning enables privacy-preserving medical AI but struggles with unreliable uncertainty estimates when clinical data is heterogeneous and imbalanced across sites. TrustFed addresses this by introducing representation-aware conformal prediction, which assigns test samples to calibration clients based on feature-space similarity and aggregates local thresholds via a soft-nearest strategy to provide finite-sample coverage guarantees without centralizing raw data. Validated on over 430,000 images across six distinct imaging modalities, the work advances federated learning from privacy-preserving training toward clinically trustworthy deployment with statistically calibrated uncertainty.

Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.

Read abstractHide abstract

0

Triangulating Temporal Dynamics in Multilingual Swiss Online News

cs.CL cs.CY Bros Victor, Dufraisse Evan, Popescu Adrian et al. · Mar 23, 2026

This paper analyzes temporal dynamics in Swiss digital news across French, German, and Italian language regions using a triangulated methodology that combines quantitative NLP with qualitative interpretation. The authors process 1.7 million articles to study how different event types—Brexit, Swiss Wolf, Christmas, and the British Royal Family—are covered across linguistic boundaries, introducing domestication profiles and proximity salience ratios to quantify cultural proximity effects.

Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country's three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.

Read abstractHide abstract

0

Engineering Distributed Governance for Regional Prosperity: A Socio-Technical Framework for Mitigating Under-Vibrancy via Human Data Engines

cs.CY cs.LG Amil Khanzada, Takuji Takemoto · Mar 23, 2026

This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework tackling 'under-vibrancy'—a condition of low visitor density suppressing economic activity—in declining regions like Fukui, Japan. Contrasting with overtourism literature, it integrates Google Business Profile search intent, Japan Meteorological Agency micro-climate data, edge-AI cameras, and 97,719 survey responses to forecast tourism flows and quantify economic leakage. The work promises algorithmic governance via 'dual-nudge' interventions to redirect visitors and coordinate merchant behavior, backed by claims of $R^2=0.810$ explanatory power.

Most research in urban informatics and tourism focuses on mitigating overtourism in dense global cities. However, for regions experiencing demographic decline and structural stagnation, the primary risk is "under-vibrancy", a condition where low visitor density suppresses economic activity and diminishes satisfaction. This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework previously validated in biological crisis management, and adapts it for regional economic flow optimization. Using high-granularity data from Japan's least-visited prefecture (Fukui), we utilize an AI-driven decision support system (DSS) to analyze two datasets: a raw Fukui spending database (90,350 records) and a regional standardized sentiment database (97,719 responses). The system achieves in-sample explanatory power of 81% (R^2 = 0.810) and out-of-sample predictive performance of 68% (R^2 = 0.683). We quantify an annual opportunity gap of 865,917 unrealized visits, equivalent to approximately 11.96 billion yen (USD 76.2 million) in lost revenue. We propose a dual-nudge governance architecture leveraging the DHDE to redistribute cross-prefectural flows and reduce economic leakage.

Read abstractHide abstract

0

Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power

cs.CL cs.CY Bros Victor, Barbini Matilde, Gerard Patrick et al. · Mar 23, 2026

This paper investigates how interrogative stances function as markers of voice and power in French-language digital news. Analyzing over 1.2 million articles from 24 outlets (2023–2024) through a mixed-methods pipeline combining LLM pseudo-labeling and qualitative annotation, the authors operationalize pragmatic concepts like answerhood and dialogicity at scale. The study reveals that questions are sparse but structurally significant, predominantly serving framing functions rather than information-seeking, and centering elite actors over diffuse publics.

Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the "Politics of Questions" in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist's narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.

Read abstractHide abstract

0

Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

cs.CL cs.AI cs.CY K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain et al. · Mar 22, 2026

This paper tackles the problem of measuring dialectal bias in LLMs for Bengali, a low-resource language with nine major regional variants. The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic failure of traditional metrics like BLEU and WER for agglutinative dialectal Bengali, showing that LLM-as-judge better predicts human quality assessments.

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

Read abstractHide abstract

0

WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making

cs.CY cs.AI Zongjie Li, Chaozheng Wang, Yuchong Xie et al. · Mar 22, 2026

WARBENCH is a benchmark for evaluating LLMs in military decision-making, addressing critical gaps in current frameworks by testing International Humanitarian Law (IHL) compliance, edge deployment constraints, fog-of-war robustness, and explicit reasoning. Using 136 high-fidelity scenarios derived from real post-WWII conflicts, the authors expose severe structural flaws: state-of-the-art models collapse under complex terrain and asymmetric force distributions, while edge-optimized models exhibit legal violation rates approaching 70%.

Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex terrain and high force asymmetry. Second, while state of the art closed source models maintain functional compliance, edge-optimized small models expose extreme operational risks with legal violation rates approaching 70 percent. Furthermore, models experience catastrophic performance degradation under 4-bit quantization and systematic information loss. Conversely, explicit reasoning mechanisms serve as highly effective structural safeguards against inadvertent violations. Ultimately, these findings demonstrate that current models remain fundamentally unready for autonomous deployment in high stakes tactical environments.

Read abstractHide abstract

Nothing here yet