Text To Speech: Wiseguy

| Slang | Canonical spelling | Phoneme override (ARPAbet) | |-------|--------------------|-----------------------------| | fuggedaboutit | forgetaboutit | F AH G EH D AH B AW T IH T | | gabagool | capicola | K AA P IH G AA L | | mook | mook | M UH K | | yous | yous | Y UW Z |

Author: [Generated for Academic Purposes] Publication Date: April 14, 2026 Journal: Journal of Synthetic Media and Paralinguistics , Vol. 19, Issue 2 Abstract This paper presents the design, implementation, and evaluation of WiseGuy TTS , a specialized text-to-speech system capable of generating speech in the distinctive prosodic, lexical, and phonemic style of the mid-20th-century American "wise guy" persona. Unlike generic TTS systems that aim for neutral narration, WiseGuy TTS incorporates dynamic pitch contouring, syllable stress patterns, phoneme-level duration adjustments (drawl, clipping), and a custom lexeme substitution engine for vernacular authenticity. We detail a three-component architecture: (1) a prosody-aware grapheme-to-phoneme (G2P) module, (2) a neural vocoder fine-tuned on dialog from post-war crime films, and (3) a rule-based stylistic filter. Subjective evaluation (Likert scale, n=120) shows high recognizability of the "wise guy" character (4.7/5) but moderate naturalness (3.9/5) due to exaggerated rhythmic patterns. Applications include cinematic dubbing, interactive gaming NPCs, and accessibility for dialect preservation. wiseguy text to speech

Higher MCD is expected – stylistic speech distorts spectral envelope. The 3.2× higher F0 variation confirms successful prosodic exaggeration. | Metric | Baseline | WiseGuy | p-value | |--------|----------|---------|---------| | Authenticity (1-5) | 1.3 (0.4) | 4.7 (0.5) | <0.001 | | Naturalness (1-5) | 4.5 (0.6) | 3.9 (0.8) | <0.05 | | Keyword accuracy (%) | 98.2% | 91.5% | <0.01 | | Slang | Canonical spelling | Phoneme override

Expressive TTS, paralinguistic style transfer, New York English, prosodic modeling, dialect synthesis 1. Introduction Generic TTS systems (Amazon Polly, Microsoft Azure Neural TTS) excel at clear, neutral speech but fail to convey paralinguistic identity—the subtle markers of region, class, attitude, and subculture. This paper addresses a specific expressive gap: the “wise guy” voice—a rhetorical style characterized by rapid tempo, upward terminal inflections, vowel nasalization, and domain-specific jargon (e.g., fuggedaboutit , gabagool , mook ). While previous work has tackled emotional TTS (happy, sad, angry) and basic accents (British, Australian), no system has targeted a socially situated persona so reliant on timing and attitude. Higher MCD is expected – stylistic speech distorts