Skip to content

Multi LIngual Corpus Logo (MuSSeL)

Multilingual Corpus of Second Language Speech

About MuSSeL 

The Multilingual Corpus of Second Language Speech (MuSSeL) is being developed by researchers at the University of Utah’s Second Language Teaching & Research Center. It provides researchers and teachers with an unprecedentedly large and varied set of transcribed and tagged L2 speech samples as well as access to the original MP3 recordings.

When complete, the corpus will include samples from three learning contexts (child classroom, adult classroom, and adult post-immersion) across six languages: Chinese, French, German, Portuguese, Russian, and Spanish. For each speech sample, a user can listen to the audio file and access the transcription in three file formats (CHAT, TEXT, and PDF). The transcripts are tagged according to the CHAT protocols established by CHILDES (MacWhinney, 2000) and can be used to run various analyses in the CLAN program. All samples come from two testing situations: ACTFL Assessment of Performance toward Proficiency in Languages(AAPPL)online tool in the case of child samples and ACTFL Oral Proficiency Interview by Computer (OPIc) for adult samples. The corpus is searchable using various filters, e.g., language, age group, gender, learning context, topic, and proficiency level. For more detailed information about the construction of the corpus, transcription conventions, and more, please navigate to the FAQ page.

MuSSeL is a developing corpus, and new samples will continue to be added to it, so check back regularly.

We are grateful for seed funding from the VP for Research and the College of Humanities at the University of Utah, and funding to support corpus development from the Language Flagship, and a Title VI Language Resource Center grant.


Frequently Asked Questions

Corpus Files

MuSSeL file names include the student ID and a file number separated by an “_.” The student ID consists of 11 characters and specifies students’ language, unique ID, age group, and the overall assigned rating on the ACTFL test. The file number counts each student’s produced speech files. Each student usually has multiple speech files. For example, file c0002cadr01_1 means that the speaker is a Chinese adult learner with an ILR rating of 1.

mussel file name table

Other character combinations or codes in file names:

Character 1: Language

c for Chinese, f for French, p for Portuguese, r for Russian, and s for Spanish.

Characters 7 & 8: Context or Learner age group

  • ad” marks adult students.
  • If the learner is a child, a 2-digit code indicates child’s grade level (i.e., 03, 04, 05, 06, 07, 08, 09, 10).

Characters 10 & 11: Rating

Child Ratings

  • ba” means child rating below N1. “b” stands for below, and “a” stands for Form “A” in the AAPPL test. N1 is the lowest rating possible when a student takes Form A of the AAPPL test and the AAPPL rating, “below N1”, means that raters were unable to rate the speaker’s performance based on the ACTFL rating scale.
  • "bb" means child rating below N4. The first “b” stands for below, and the second “b” stands for Form “B” in the AAPPL test. N4 is the lowest rating possible when a student takes Form B of the AAPPL test and the AAPPL rating, “below N4” means that the student took Form B of the AAPPL test and raters were unable to assign a rating to the speaker since the performance was below the lowest rating possible.
  • "aa" marks an Advanced child rating. The character “a” was repeated to achieve consistency in file name length.
  • Other child rating codes are n1, n2, n3, n4, i1, i2, i3, i4, i5.

Adult Ratings

  • 00”, “01”, “02”, “03” represent adult ratings, 0, 1, 2, and 3 (ILR rating scale for the OPIc Test), respectively.
  • 0p”, “1p”, “2p” correspond to adult ratings, 0+, 1+, and 2+ (ILR rating scale for the OPIc Test), respectively.
  • aa” specifies the adult rating, AL-AH. This rating is given when the rater cannot choose a specific sub-level (Low, Medium, High) under the Advanced level.
  • "ss" marks adult rating, S or superior.
  • Other adult rating codes are nl, nm, nh, il, im, ih, al, am, ah.



When a file number is missing from the database, it could mean one of the following:

  • The file was EMPTY or fully unintelligible.
  • The file was corrupt, and wouldn’t open, so we were unable to transcribe it. (Rare case)


The MP3 file is the original speech file produced on the test by the speaker. The MP3 files were transcribed according to the 2021 CHAT transcription protocols established by CHILDES (MacWhinney, 2000). CHA files are the transcriptions written in CLAN. Including CHA files allows the users to enjoy the multitude of tools for tagging and linguistics analyses available on the CLAN program. The TXT file is a copy of the transcriptions in CHA format with a few modifications: 1) angle brackets have surrounded the TXT file headers to separate the headers from the main text and allow the analysis of the main text in corpus analysis tools, 2) the bullets or time stamps that link the audio files to the CHA files have been removed since they have no use in TXT files. TXT files may be used by corpus users who are unfamiliar with the CLAN program or prefer to use other corpus analysis tools. Finally, the PDF file format allows the users to preview the files in their browsers before downloading them. Including the additional formats improves overall accessibility for MuSSeL.

The mentioned MP3 files included personal information, so they were purposefully removed from the database to protect speakers’ identities. The corresponding transcription files, however, have already been deidentified. The removed MP3 files will be added to the database after deidentification. 


Corpus Search Filters

Gender data was not collected from adult speakers in the past few years. Most recent adult data usually include gender information.

The adult files in MuSSeL come from the Oral Proficiency Interview by Computer (OPIc) tests. “An OPIc can be rated according to the ACTFL scale, the Interagency Language Roundtable (ILR) scale, or the Common European Framework of Reference for Languages (CEFR) scale (Language Testing International, n.d.). In MuSSeL, the adult files either had the ACTFL rating or the ILR rating. “An ACTFL OPIc reports a rating between Novice and Superior on the ACTFL scale. An ILR OPIc rating reported is between ILR 0 (No Proficiency) and ILR 3 (Professional Proficiency)” (Language Testing International, n.d.). The following table demonstrates the correspondence between ACTFL and CEFR ratings on OPIc tests.

ACTFL and CEFR Rating Alignment on OPIc(adapted from the ACTFL report, Assigning CEFR Ratings to ACTFL Assessments)

ACTFL Proficiency Scale

ACTFL Rating on OPIc

Corresponding CEFR Rating





Advanced (AL-AH)


Advanced High


Advanced Mid


Advanced Low



Intermediate High


Intermediate Mid


Intermediate Low



Novice High


Novice Mid



As for the ILR rating scale, we have not provided the corresponding CEFR or ACTFL ratings since there is no consensus in the literature on the alignment between the ILR-scaled score (0, 0+, 1, 1+, 2, 2+, 3) and the other two scales.

AAPPL Rating Alignment with ACTFL and CEFR Scales (ACTFL Proficiency Guidelines, 2012)

ACTFL Proficiency Scale

Corresponding ACTFL Rating


Corresponding CEFR Rating


Advanced Low-High




Intermediate High



Intermediate Mid





Intermediate Low




Novice High



Novice Mid




Novice Low




Below N4


Below N1


List of topics in MuSSeL were created in five steps: 1) transcribers identified the subjects of the speech files and added annotations to the transcription files, 2) all topics were collected from the files and the database spreadsheets and ordered based on frequency, 3) topics with lower frequency were merged into broader categories, 4) the list of topics under each broad category was recorded and reshared with the transcribers, 5) the assigned topics were revised on the database spreadsheet to match the finalized list of topics. The following are lists of topics that emerged from the child and adult sub-corpora.

List of Topics in the Child Sub-Corpus

  1. About Yourself
  2. Activities
  3. Clothes
  4. Colors
  5. Current Affairs
  6. Family
  7. Food
  8. Holidays
  9. Introductions
  10. Locations
  11. Other People
  12. Routines
  13. School
  14. Sports
  15. Time, Seasons, Climate
  16. Unspecified (Unable to Specify a Topic, Unintelligible Content)


List of Topics in the Adult Sub-Corpus

  1. About Yourself
  2. Business & Technology
  3. Current affairs
  4. Education
  5. Entertainment
  6. Events & Activities
  7. Family
  8. Food
  9. Jobs
  10. Locations
  11. People
  12. Questions
  13. Routines
  14. Sports
  15. Travel
  16. Unspecified (Unable to Specify a Topic, Unintelligible content)                 


Resources for Language Teachers and Researchers

Publications & Presentations

Kia, E. & Rubio, F. (2022, Sep. 22–24). Lexical bundles and L2 Spanish writing development: A case of dual language immersion. Paper to be presented at the sixth International Conference for Learner Corpus Research (LCR 2022), The University of Padua, Padua, Italy.

 Kia, E. & Rubio, F. (2022, Sep. 9–11 ). Applications of the corpus of multilingual second language speech (MuSSeL) for research and pedagogy. Paper to be presented at the fifteenth American Association for Corpus Linguistics (AACL 2022) conference, Northern Arizona University, Flagstaff, AZ, United States.

Kia, E., & Schnur, E. (2021, Nov. 19–21). Developing instructional activities using a multilingual learner corpus [Paper presentation]. ACTFL 2021 Virtual Convention, United States.

Schnur, E., Rubio, F. & Hacking, J. (2019, Sep. 12–14). Introducing language teachers to learner corpora: The development of online tutorials for pedagogic uses of the MuSSeL corpus [Paper Presentation]. Fifth International Conference for Learner Corpus Research (LCR 2019), Warsaw, Poland.

Hacking, J., Schnur, E. & Rubio, F. (2018, Sep. 24–26). MuSSeL: Designing and building a corpus of multilingual second language speech [Paper presentation]. SlaviCorp 2018 Conference, Prague, Czech Republic.

Schnur, E., Hacking, J. & Rubio, F. (2018, Sep. 20–22). MuSSeL: Designing and building a corpus of multilingual second language speech [Paper presentation]. American Association of Corpus Linguistics (AACL 2018) Conference, Atlanta, GA, United States.



Disclaimer:  The following tutorials introduce the pilot version of MuSSeL and the old search filters, which do not match the current status of the corpus.

Speaker: Dr. Erin Schnur

Date: June 2019

Schnur, E. (2019, June 5). MuSSeL corpus tutorial: Introducing AntConc software and basic corpus searches[Video Tutorial]. University of Utah.

Speaker: Dr. Erin Schnur

Date: Feb. 2019

Schnur, E. (2019, February 5). Tutorials for language teachers: Using the multilingual spoken second language (MuSSeL) corpus[Video Tutorial]. University of Utah.

Speaker: Dr. Erin Schnur

Date: Nov. 2018

This five-minute tutorial introduces the multilingual spoken second language (MuSSeL) corpus, explains the pilot corpus search filters, and describes how to use AntConc (a corpus analysis freeware by Laurence Anthony) to explore MuSSeL by providing examples.

Schnur, E. (2018, November 9). The multilingual corpus of second language speech (MuSSeL)[Video Tutorial]. University of Utah.


Webinars and Workshop

The corpus lab at the Second Language Teaching and Research Center (L2TReC) offers webinars and workshops for language teachers on the use of corpus linguistics in the classroom. Please stay tuned to our social media accounts for future events.

Presenter: Dr. Elnaz Kia

Date: May 6, 2021

This one-hour webinar was initially offered live on May 6, 2021. To access the webinar recording and supplementary materials, please complete the webinar registration.

This webinar aims to introduce language teachers to learner corpora as a source to create authentic pedagogical materials. The webinar includes three sections:

  1. An introduction to corpus linguistics: This section discusses the definition and importance of corpora and corpus linguistics.
  2. Pedagogical applications of learner corpora: In this section, the presenter defines learner corpora and introduces several freely available multilingual spoken and written learner corpora.
  3. Developing data-driven activities: This section involves various examples of activities and research studies based on learner corpora.

Kia, E. (2021, May 6). Creating data-driven pedagogical materials using learner corpora: A guide for language teachers [Webinar]. University of Utah.

Presenter: Dr. Elnaz Kia

Date: March 5, 2021

This two-hour webinar was originally offered live on March 5, 2021. To access the webinar recording and supplementary materials, please complete the webinar registration.

After watching this webinar, you will be able to:

  1. Learn about valuable pedagogical applications of corpora.
  2. Check your intuitions about actual language use in different registers (e.g., academic language, informal conversation).
  3. Use basic corpus linguistic tools to inform your instruction and develop authentic teaching materials.
  4. Add freely available online foreign language corpora to your teaching toolkit.

Citation (APA):

Kia, E. (2021, March 5). Corpus linguistics for language teachers: An introduction [Webinar].University of Utah.



Corpus-Based Teaching & Learning Materials

We are in the process of developing a series of L2 teaching and learning materials using MuSSeL, with help from undergraduate and graduate students in the Department of Linguistics and the Department of World Languages and Cultures at the U. Materials will be in the forms of mini-lessons and activities focusing on linguistic features that are particularly challenging for speakers of Chinese, French, Portuguese, Russian, and Spanish in Dual Language Immersion Programs. We are hoping to publish the materials on our webpage by Fall 2022.


Other Corpus Resources

Corpora in English

Corpora in Other Languages

Corpus Analysis Tools



Last Updated: 12/8/22