Skip to main content

WebMAUS: Automatic Segmentation and Labelling of Speech Signals over the Web

The web application WebMAUS allows the user to automatically align speech recordings to their corresponding text form. Two input files need to be uploaded by the user: a media file containing a recorded speech signal and a file containing some textual encoding of the words spoken in the recording. In case the latter is a simple text, the contents are text-normalized and tokenized into a chain of words. The application then produces a phonological pronunciation encoding of the content in SAMPA (Speech Assessment Methods Phonetic Alphabet), that basically reflects the standard citation pronunciation of the content. Based on this phonological form, a statistically weighted graph of all possible realisations (pronunciation variants) within the selected language is created based on a machine-learned expert system. Finally this graph is aligned to the speech signal using standard techniques from automatic speech recognition. The result of this process is an orthographic and a phonetic alignment (segmentation and labelling, S&L) of the recorded speech, which is then rendered into the desired target format (BPF, Emu, TextGrid) and returned to the user via the web browser. 

The web application can be called from the web and can be found here. In the web interface the user has three options to automatically segment and label his speech data. In the WebMAUS Basic a text and signal file can be uploaded, the grapheme to phoneme conversion is automatically done and passed to MAUS, which produces the S&L. In the WebMAUS General interface, the user has the possibility to upload a signal file, together with a BAS Paritur Format (BPF) file, which already contains the canonic pronunciation and possibly other information about the content of the signal. There the user additionally can choose between several options that change the outcome of the segmentation. The WebMAUS Multiple service allows the user to upload a set of file pairs (up to 300 pairs, which have to have the same base filename, e.g. signal17.wav and signal17.txt) which can either be a signal/text file pair or a signal/BPF file pair. This service is for batch processing of larger sets of files and the most used service of the three.

CLARIN Centre
BAS München
Project leader
Thomas Kisler
Attachments
Acknowledgements

Christoph Draxler, BAS München, BAS CLARIN-D Project Management
Florian Schiel, BAS München, Munich Automatic Segmentation
Uwe Reichel, BAS München, Balloon (Grapheme-to-Phoneme Conversion)
Thomas Kisler, BAS München, WebMAUS Developer
Dieter van Uytvanck, CLARIN , Technical Director

The work was carried out within the CLARIN-D project (BMBF-funded).