Umhuqa Phansi: IsiZulu Corpus Generator

Project description

This study embraces the advancing scholarship in the Humanities known as the Digital Humanities. “Digital Humanities methods involve the use of the novel computational methods, such as computer software and carefully processed and machine-readable data to solve research problems in the humanities and social sciences or challenge existing theoretical assumptions” (Khumalo, 2020: 39). Further, “corpora have been widely used in linguistics research and beyond. They are at the core of many human language technologies like spellcheckers, machine learning, translators and lexicons” (Khumalo, 2020: 41). In this work, a computer programme with an audio playback function is written using the Tkinter library in the Python 3 programing language. In this study, as it is the case with the herein written audio playback program, argument is made for the isiZulu monolingual General User Interface (GUI) platforms in code programing for computers.

Matplotlib is a suite that is integrated into the Python programing language for drawing graphs. The graphs in this study depict a 3 dimensional space. The graphs have the y axis, the x axis and z axis – seen in section 3, under the experimentation and results section. A phonological sound unit can carry significant and substantive phonetic and acoustic information. This information can include the wavelength frequency of the sound, the pitch, the decibels, the nature of formants and the total amount of time that it took to articulate a particular sound unit. All this information can be schematically represented in a graph. The sound units here are selected from the compiled spoken corpus, uploaded to the Praat program for annotation, description and analysis and then portrayed into matplotlib for a visual illustration.

Moreover, this study will give a practical application of isiZulu spoken corpora in the analysis and learning of isiZulu. This it will do by retrieving a sample of the articulated isiZulu sound units, in a form of spoken isiZulu vowels, and then display these vowels in a sound spectrogram for study, teaching and learning purposes.

One of the justifications for arguing for a corpus-based approach to language study, teaching and learning is that “corpora contain textual records of real communication, and it makes a lot of sense to use them to support as many aspects of the development of a learner’s communicative competence as possible” (Braun, 2006: 02). Further, “in other learning environments, especially in the school context, the application of corpora has so far remained an exception […] even the mere awareness of corpora among teachers is low” (Braun, 2006: 02). For developing languages like isiZulu, the application of corpora in teaching and learning is virtually non-existent. This study intends to inspire work on corpora for developing languages and their pedagogies .

The selected corpus is a pedagogically appropriate corpus (Braun, 2006). This corpus is annotated for the purpose of observing the articulation of isiZulu vowels . The articulations have been compiled for the analysis and teaching of isiZulu vowels. The sound samples of isiZulu vowels are then displayed in the sound spectrogram from the Praat program. There is a nationwide collected spoken corpus of isiZulu that is referred to as IsiZulu Oral Corpus (IOC) (Khumalo, 2020: 37). The IOC is intended as digital resource for language research and language teaching (Khumalo, 2020: 37). The corpus sample selected herein, however, is not necessarily for from the IOC. The selected corpus herein is compiled for the sole purpose of studying the isiZulu vowels .

For isiZulu and other developing languages, the compilation of corpora would have to adhere to the standardisation processes. “One of the requirements for developing any language is the planning of its corpus. This refers to the standardisation and also intellectualisation of a language (Ndimande-Hlongwa, 2010: 208). We also witness the application of standardisation in developed languages like English in the form of text to speech voice technology, speech recognition software and other tools such as auto-correct. Ndimande-Hlongwa (2010: 208) states that isiZulu is a standard language and therefore it adheres to the “…notions of what is ‘correct’ and ‘incorrect’ […].” Thus, one argues for compliance and adherence to the processes of standardisation in the compilation of corpora for the developing languages.

Project team

Mthuli Buthelezi


University of KwaZulu-Natal


isiZulu, spoken corpus, vowels, sound spectrogram