Skip to content

Multi LIngual Corpus Logo (MuSSeL)

Multilingual Corpus of Second Language Speech


About MuSSeL 

The Multilingual Corpus of Second Language Speech (MuSSeL) is being developed by researchers at the University of Utah’s Second Language Teaching & Research Center. It provides researchers and teachers with an unprecedentedly large and varied set of transcribed and tagged L2 speech samples as well as access to the original MP3 recordings.

When complete, the corpus will include samples from three learning contexts (child classroom, adult classroom, and adult post-immersion) across six languages: Chinese, French, German, Portuguese, Russian, and Spanish. For each speech sample, a user can listen to the audio file and access the transcription in three file formats (CHAT, TEXT, and PDF). The transcripts are tagged according to the CHAT protocols established by CHILDES (MacWhinney, 2000) and can be used to run various analyses in the CLAN program. All samples come from two testing situations: ACTFL Assessment of Performance toward Proficiency in Languages(AAPPL)online tool in the case of child samples and ACTFL Oral Proficiency Interview by Computer (OPIc) for adult samples. The corpus is searchable using various filters, e.g., language, age group, gender, learning context, topic, and proficiency level. For more detailed information about the construction of the corpus, transcription conventions, and more, please navigate to the FAQ page.

MuSSeL is a developing corpus, and new samples will continue to be added to it, so check back regularly.

We are grateful for seed funding from the VP for Research and the College of Humanities at the University of Utah, and funding to support corpus development from the Language Flagship, and a Title VI Language Resource Center grant.

 


Frequently Asked Questions

Corpus Files

MuSSeL file names include the student ID and a file number separated by an “_.” The student ID consists of 11 characters and specifies students’ language, unique ID, age group, and the overall assigned rating on the ACTFL test. The file number counts each student’s produced speech files. Each student usually has multiple speech files. For example, file c0002cadr01_1 means that the speaker is a Chinese adult learner with an ILR rating of 1.

mussel file name table

Other character combinations or codes in file names:

Character 1: Language

c for Chinese, f for French, p for Portuguese, r for Russian, and s for Spanish.

Characters 7 & 8: Context or Learner age group

  • ad” marks adult students.
  • If the learner is a child, a 2-digit code indicates child’s grade level (i.e., 03, 04, 05, 06, 07, 08, 09, 10).

Characters 10 & 11: Rating

Child Ratings

  • ba” means child rating below N1. “b” stands for below, and “a” stands for Form “A” in the AAPPL test. N1 is the lowest rating possible when a student takes Form A of the AAPPL test and the AAPPL rating, “below N1”, means that raters were unable to rate the speaker’s performance based on the ACTFL rating scale.
  • "bb" means child rating below N4. The first “b” stands for below, and the second “b” stands for Form “B” in the AAPPL test. N4 is the lowest rating possible when a student takes Form B of the AAPPL test and the AAPPL rating, “below N4” means that the student took Form B of the AAPPL test and raters were unable to assign a rating to the speaker since the performance was below the lowest rating possible.
  • "aa" marks an Advanced child rating. The character “a” was repeated to achieve consistency in file name length.
  • Other child rating codes are n1, n2, n3, n4, i1, i2, i3, i4, i5.

Adult Ratings

  • 00”, “01”, “02”, “03” represent adult ratings, 0, 1, 2, and 3 (ILR rating scale for the OPIc Test), respectively.
  • 0p”, “1p”, “2p” correspond to adult ratings, 0+, 1+, and 2+ (ILR rating scale for the OPIc Test), respectively.
  • aa” specifies the adult rating, AL-AH. This rating is given when the rater cannot choose a specific sub-level (Low, Medium, High) under the Advanced level.
  • "ss" marks adult rating, S or superior.
  • Other adult rating codes are nl, nm, nh, il, im, ih, al, am, ah.

 

c0003cadr0p_1
c0003cadr0p_3
c0003cadr0p_4
c0003cadr0p_6
c0003cadr0p_7
c0003cadr0p_8

When a file number is missing from the database, it could mean one of the following:

  • The file was EMPTY or fully unintelligible.
  • The file was corrupt, and wouldn’t open, so we were unable to transcribe it. (Rare case)

 

The MP3 file is the original speech file produced on the test by the speaker. The MP3 files were transcribed according to the 2021 CHAT transcription protocols established by CHILDES (MacWhinney, 2000). CHA files are the transcriptions written in CLAN. Including CHA files allows the users to enjoy the multitude of tools for tagging and linguistics analyses available on the CLAN program. The TXT file is a copy of the transcriptions in CHA format with a few modifications: 1) angle brackets have surrounded the TXT file headers to separate the headers from the main text and allow the analysis of the main text in corpus analysis tools, 2) the bullets or time stamps that link the audio files to the CHA files have been removed since they have no use in TXT files. TXT files may be used by corpus users who are unfamiliar with the CLAN program or prefer to use other corpus analysis tools. Finally, the PDF file format allows the users to preview the files in their browsers before downloading them. Including the additional formats improves overall accessibility for MuSSeL.

The mentioned MP3 files included personal information, so they were purposefully removed from the database to protect speakers’ identities. The corresponding transcription files, however, have already been deidentified. The removed MP3 files will be added to the database after deidentification. 

 

Corpus Search Filters

Gender data was not collected from adult speakers in the past few years. Most recent adult data usually include gender information.

The adult files in MuSSeL come from the Oral Proficiency Interview by Computer (OPIc) tests. “An OPIc can be rated according to the ACTFL scale, the Interagency Language Roundtable (ILR) scale, or the Common European Framework of Reference for Languages (CEFR) scale (Language Testing International, n.d.). In MuSSeL, the adult files either had the ACTFL rating or the ILR rating. “An ACTFL OPIc reports a rating between Novice and Superior on the ACTFL scale. An ILR OPIc rating reported is between ILR 0 (No Proficiency) and ILR 3 (Professional Proficiency)” (Language Testing International, n.d.). The following table demonstrates the correspondence between ACTFL and CEFR ratings on OPIc tests.

ACTFL and CEFR Rating Alignment on OPIc(adapted from the ACTFL report, Assigning CEFR Ratings to ACTFL Assessments)

ACTFL Proficiency Scale

ACTFL Rating on OPIc

Corresponding CEFR Rating

Superior

Superior

C2

Advanced

Advanced (AL-AH)

B2-C1

Advanced High

C1

Advanced Mid

B2.2

Advanced Low

B2.1

Intermediate

Intermediate High

B1.2

Intermediate Mid

B1.1

Intermediate Low

A2

Novice

Novice High

A1

Novice Mid

0

 

As for the ILR rating scale, we have not provided the corresponding CEFR or ACTFL ratings since there is no consensus in the literature on the alignment between the ILR-scaled score (0, 0+, 1, 1+, 2, 2+, 3) and the other two scales.

AAPPL Rating Alignment with ACTFL and CEFR Scales (ACTFL Proficiency Guidelines, 2012)

ACTFL Proficiency Scale

Corresponding ACTFL Rating

AAPPL Score

Corresponding CEFR Rating

Advanced

Advanced Low-High

A

B2-C1

Intermediate

Intermediate High

I5

B1.2

Intermediate Mid

I4

B1.1

I3

I2

Intermediate Low

I1

A2

Novice

Novice High

N4

A1

Novice Mid

N3

0

N2

Novice Low

N1

0

 

Below N4

 

Below N1

 

Topic lists in MuSSeL were created in five steps: 1) transcribers identified the subjects of the speech files and added annotations to the transcription files, 2) all topics were collected from the files and the database spreadsheets and ordered based on frequency, 3) topics with lower frequency were merged into broader categories, 4) the list of topics under each broad category was recorded and reshared with the transcribers, 5) the assigned topics were revised on the database spreadsheet to match the finalized list of topics. The following is a list of topics that emerged from the child sub-corpus. The bullet-faced topics are the broad categories, and the topics in parentheses are micro topics included in the broad category.

List of Topics in the Child Sub-Corpus

  1. About Yourself (Age, Birthdays, Books, Languages Spoken, Favorite Subjects)
  2. Activities (Favorite Activities, Leisure Activities, Activities on Vacation)
  3. Clothes (Clothes, Winter Clothing)
  4. Colors (Favorite Colors)
  5. Current Affairs (Politics, Social Dilemmas, News)
  6. Family (Family Activities, Family Members)
  7. Food (Eating Preferences, Food)
  8. Holidays
  9. Introductions (Introductions, Hello)
  10. Locations (Current Location, Geography, Hometown, Home, Places, Travel, Favorite Places)
  11. Other People (Celebrities, Famous people, Mascots, Friends)
  12. Routines (Daily Routine, Evening, Morning Routine)
  13. School (Learning at School, School Activities)
  14. Sports (Favorite sports)
  15. Time, Seasons, Climate (Climate, Weather, Temperature, Favorite Season, Summer)
  16. Unspecified (Unable to Specify a Topic, Unintelligible Content)

 

The following is a list of topics that emerged from the adult sub-corpus. The bullet-faced topics are the broad categories, and the topics in parentheses are micro topics included in the broad category.

List of Topics in the Adult Sub-Corpus

  1. About Yourself (Introductions, Favorites, Childhood, Future Plans, Pets)
  2. Business & Technology (Technology, Business, Cell Phones, Technology & Communication, Trade)
  3. Current affairs (Poverty, Healthcare, Politics, News)
  4. Education (Studies, School, Language Learning)
  5. Entertainment (Movies, Music)
  6. Events & Activities (Outdoors, Going Out, Activities, Hobbies, Summer Activities, Celebrations)
  7. Family (Family Members, Housework)
  8. Food (Eating preferences, Ordering food)
  9. Jobs (Profession, occupation)
  10. Locations (Accommodation, Places, Place Description, Neighborhood, Parks)
  11. People (Meeting a friend, Celebrities)
  12. Questions (Asking Questions about a Variety of Topics)
  13. Routines (Evenings, Weekend Activities)
  14. Sports (Favorite sports, Exercise routines, Workout Invitation)
  15. Travel (Transportation, Commute)
  16. Unspecified (Unable to Specify a Topic, Unintelligible Content)

 

 


Resources for Language Teachers and Researchers

Publications & Presentations

Kia, E., & Schnur, E. (2021, Nov.). Developing instructional activities using a multilingual learnercorpus [Paper presentation]. ACTFL 2021 Virtual Convention, United States.

Schnur, E., Rubio, F. & Hacking, J.(2019, Sep. 12–14). Introducing language teachers to learnercorpora: The development of online tutorials for pedagogic uses of the MuSSeL corpus [Presentation Abstract]. 5th Learner Corpus Research Conference. Warsaw, Poland. https://lcr2019.ils.uw.edu.pl/files/2019/08/Book-of-Abstracts_final-Aug3.pdf

Hacking, J., Schnur, E.&Rubio, F.(2018, Sep. 24–26). MuSSeL: Designing and building a corpusof multilingual second language speech [Paper presentation]. SlaviCorp2018 Conference. Prague, Czech Republic. https://slavicorp.ff.cuni.cz/programme/presentations/

Schnur, E.,Hacking, J.&Rubio, F.(2018, Sep. 20–22). MuSSeL: Designing and building a corpusof multilingual second language speech [Paper presentation].American Association of Corpus Linguistics. Atlanta, GA, United States.

 

Tutorials

Disclaimer:  The following tutorials introduce the pilot version of MuSSeL and the old search filters, which do not match the current status of the corpus.

Speaker: Dr. Erin Schnur

Date: June 2019

Schnur, E. (2019, June 5). MuSSeL corpus tutorial: Introducing AntConc software and basic corpus searches[Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_c5x9e2ur

Speaker: Dr. Erin Schnur

Date: Feb. 2019

Schnur, E. (2019, February 5). Tutorials for language teachers: Using the multilingual spoken second language (MuSSeL) corpus[Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_k3o5digo

Speaker: Dr. Erin Schnur

Date: Nov. 2018

This five-minute tutorial introduces the multilingual spoken second language (MuSSeL) corpus, explains the pilot corpus search filters, and describes how to use AntConc (a corpus analysis freeware by Laurence Anthony) to explore MuSSeL by providing examples.

Schnur, E. (2018, November 9). The multilingual corpus of second language speech (MuSSeL)[Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_y8lostzz

 

Webinars and Workshop

The corpus lab at the Second Language Teaching and Research Center (L2TReC) offers webinars and workshops for language teachers on the use of corpus linguistics in the classroom. Please stay tuned to our social media accounts for future events.

Speaker: Dr. Elnaz Kia

Date: May 6, 2021

This one-hour webinar was initially offered live on May 6, 2021. To access the webinar recording and supplementary materials, please complete the webinar registration.

This webinar aims to introduce language teachers to learner corpora as a source to create authentic pedagogical materials. The webinar includes three sections:

  1. An introduction to corpus linguistics: This section discusses the definition and importance of corpora and corpus linguistics.
  2. Pedagogical applications of learner corpora: In this section, the presenter defines learner corpora and introduces several freely available multilingual spoken and written learner corpora.
  3. Developing data-driven activities: This section involves various examples of activities and research studies based on learner corpora.

Kia, E. (2021, May 6). Creating data-driven pedagogical materials using learner corpora: A guide for language teachers [Webinar]. University of Utah. https://l2trec.utah.edu/news/2021-spring-creating-data-driven-pedagogical-materials-using-learner-corpora.php

Speaker: Dr. Elnaz Kia

Date: March 5, 2021

This two-hour webinar was originally offered live on March 5, 2021. To access the webinar recording and supplementary materials, please complete the webinar registration.

After watching this webinar, you will be able to:

  1. Learn about valuable pedagogical applications of corpora.
  2. Check your intuitions about actual language use in different registers (e.g., academic language, informal conversation).
  3. Use basic corpus linguistic tools to inform your instruction and develop authentic teaching materials.
  4. Add freely available online foreign language corpora to your teaching toolkit.

Citation (APA):

Kia, E. (2021, March 5). Corpus linguistics for language teachers: An introduction [Webinar].University of Utah. https://l2trec.utah.edu/news/2021-spring-corpus-linguistics-for-language-teachers-webinar.php

 

 

Corpus-Based Teaching & Learning Materials

We are in the process of developing a series of L2 teaching and learning materials using MuSSeL, with help from undergraduate and graduate students in the Department of Linguistics and the Department of World Languages and Cultures at the U. Materials will be in the forms of mini-lessons and activities focusing on linguistic features that are particularly challenging for speakers of Chinese, English, French, Portuguese, and Spanish in Dual Language Immersion Programs. We are hoping to publish the materials on our webpage by Spring 2022.

 

Other Corpus Resources

Corpora in English

Corpora in Other Languages

Corpus Analysis Tools

 

Last Updated: 11/10/21