Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing

cs.CV Yaelle Zribi (ENC), Florian Cafiero (ENC, LRE), Vincent L\'epinay, Chahan Vidal-Gor\`ene (CJM, LIPN) · Mar 23, 2026

What it does

Why it matters

The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0. 75$), consistent with a stillness-before-punchline pattern, and through a short-horizon...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Stand-up comedy depends as much on timing and embodied presence as on verbal content, yet computational humor has largely focused on text alone. This paper introduces TIC-TALK, a multimodal corpus of 90 professionally filmed Netflix specials (2015–2024) with temporally aligned annotations for language, gesture, and audience response. The processing pipeline combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, and YOLOv8 for shot classification and pose keypoint extraction, all aligned hierarchically without resampling. The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0.75$), consistent with a stillness-before-punchline pattern, and through a short-horizon laughter prediction benchmark.

Critical review

Verdict

Bottom line

The paper presents a technically sound and well-documented pipeline for multimodal performance analysis that successfully operationalizes timing, gesture, and audience response as measurable signals. The hierarchical temporal alignment strategy preserves native signal granularity ($\Delta t_{laugh} = 0.8$ s, $\Delta t_{pose} = 1$ s, $\Delta t_{topic} = 60$ s) and enables meaningful cross-modal analyses. However, the central claim about kinetic energy predicting laughter is correlational and potentially confounded by filming conventions and the artefactual T6 topic, while the modest performance of the prediction benchmark (multimodal AUROC 0.647 vs. history-only 0.643) suggests limited practical utility for anticipatory models. The exclusive reliance on professionally edited Netflix specials also embeds platform-specific conventions that constrain generalizability to live or amateur performance contexts.

“kinetic energy negatively predicts audience laughter rate ($r = -0.75$, $N = 24$), consistent with a stillness-before-punchline pattern”

Zribi et al., TIC-TALK · Abstract

“filming conventions are embedded in the signal”

Zribi et al., TIC-TALK · Section 4 (Discussion)

What holds up

The hierarchical temporal alignment without resampling is architecturally robust, correctly handling varying granularities through strict temporal containment rather than interpolation. The three kinematic signals derived from raw 17-joint coordinates—arm spread $A_t$, kinetic energy $E_t$, and trunk lean $\theta_t$—are conceptually well-motivated proxies for performance dynamics that avoid premature discretization. Documentation of model training is exemplary, including specific BERTopic hyperparameters (UMAP with $n_{neighbors}=15$, HDBSCAN with $min_{cluster\_size}=15$), YOLOv8-cls training on 594 frames achieving F1=0.91, and the three-step outlier reduction procedure for topic assignment with centroid-based reassignment thresholded at cosine similarity $\geq 0.30$.

“Signals are merged by hierarchical temporal containment without resampling”

Zribi et al., TIC-TALK · Figure 1 caption

“Topic segments also store 384-dim sentence-BERT embeddings (all-MiniLM-L6-v2)”

Zribi et al., TIC-TALK · Abstract

Main concerns

The headline finding that kinetic energy negatively predicts laughter ($r = -0.75$, $N = 24$ topics) overstates its explanatory power: the correlation is not causal, and the authors acknowledge that "filming conventions, performer mobility, and the artefactual T6 are plausible confounders." Indeed, topic T6 is explicitly labeled a "structural artefact corresponding to subtitle encoding markers and on-stage entry sequences," yet it remains in the analysis with an outlier kinetic energy value ($\bar{E}_t = 2.24$) that likely inflates the correlation magnitude. The short-horizon prediction task—while demonstrating methodological feasibility—achieves only marginal gains from multimodal fusion (0.004 AUROC improvement over history-only), with the authors noting that temporal autocorrelation dominates: "a hot room stays hot, independently of what is being said or shown." Furthermore, the 60-second topic granularity appears poorly matched to comedic timing dynamics, as evidenced by the near-absence of belly laughs at this resolution, suggesting the segmentation may be too coarse to capture actual punchline delivery.

“T6 is a structural artefact; see Section 4”

Zribi et al., TIC-TALK · Table 1 caption

“temporal autocorrelation of audience laughter: a hot room stays hot, independently of what is being said or shown”

Zribi et al., TIC-TALK · Section 3.4

Evidence and comparison

The evidence supports descriptive claims about topic-level laughter stratification, with personal/bodily themes generating higher laughter rates than geopolitical content and validating the corpus against established humor theory (replicating Yang et al., 2015; Annamoradnejad and Zoghi, 2024). Comparisons to related work are generally fair: the authors correctly distinguish their multimodal focus from Barriere et al. (2025), noting that StandUp4AI "adds multilingual laughter labels, but focuses on audio," and from Pope et al. (2026), highlighting the contrast between edited specials and unedited live performances. However, the framing of prediction results emphasizes the "best multimodal system" (AUROC 0.647) without sufficient highlighting that vision and text contribute marginal gains (precision improvement of only 0.025) over laughter history alone, potentially overstating the utility of the full multimodal approach for anticipatory prediction.

“The StandUp4AI dataset (Barriere et al., 2025) adds multilingual laughter labels, but focuses on audio”

Zribi et al., TIC-TALK · Introduction

“Test set: 45,894 anchors from 14 held-out shows. AUPRC of a random classifier equals the positive rate.”

Zribi et al., TIC-TALK · Table 2 caption

Reproducibility

Reproducibility is compromised by the proprietary nature of the source material: while code is available in an anonymous repository, the underlying Netflix content cannot be redistributed, and the authors release only derived annotations, with "No audio, image or video is distributed." This limitation is partially mitigated by detailed documentation including Whisper-AT inference at 0.8 s stride and YOLOv8s-pose processing specifications. However, the 1 fps pose extraction applied only to full-body frames (22% of total frames) introduces a dependency on shot classifier accuracy that is difficult to verify without access to raw video. The hierarchical JSON data structure is clearly specified, enabling reuse of annotations by researchers with access to the same Netflix specials, though the inability to inspect raw keypoint detection against source frames limits independent validation of the kinematic findings.

“No audio, image or video is distributed”

Zribi et al., TIC-TALK · Section 3.1 (Delivered outputs)

“22% are full-body frames yielding raw keypoint sequences for pose analysis”

Zribi et al., TIC-TALK · Section 3.2 (Summary statistics)

Abstract

Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.