Skip to content

Multilingual Corpus of Second Language Speech

Multi LIngual Corpus Logo (MuSSeL)

 

 

About MuSSeL

The Multilingual Corpus of Second Language Speech (MuSSeL) is being developed by researchers at the University of Utah’s Second Language Teaching & Research Center. It provides researchers and teachers with an unprecedentedly large and varied set of transcribed and tagged L2 speech samples as well as access to the original MP3 recordings.

Learn More About MuSSeL

Interested in the MuSSeL corpus?

Before applying for full access, we invite you to explore our sample corpus. This preview offers a glimpse into the variety of texts and the organizational structure of the corpus.

We encourage you to examine the sample corpus to ensure its suitability for your research objectives. Should it meet your requirements, detailed instructions are available to guide you through the process of gaining access to the full MuSSeL corpus.

Explore Sample Corpus


How to Access MuSSeL


STEP 1

Choose Access Level

Allows users to open individual speech samples (MP3 and TXT) in a web browser. Users won't be able to download the corpus for linguistic analyses. Temporary Online Access will automatically terminate three months after the registration date.

Required Materials:

  • Project Description for Temporary Access
    • Provide a brief description of your project, specifically focusing on how your project intends to explore second language development with the use of MuSSeL (100-300 words).

Allows users to download the whole corpus for offline use. Users are required to show proof of IRB approval or exemption to receive full offline access. Online Access to the corpus will expire on the expected project completion date unless you request an extension.

Required Materials:

  • Project Description for Full Access
    • Provide a concise yet detailed description of your research purpose, specifically focusing on how your project intends to explore second language development with the use of MuSSeL. Emphasize how the anticipated outcomes will contribute to the field of second language learning and/or teaching (200-500 words).
  • Expected Project Completion Date
    • Please be aware that your online access to the corpus files is set to expire on the date you have chosen. Should you require continued access beyond this date, you must submit a request for an extension.
  • Proof of IRB Approval or Exemption for Full Access
    • Please submit a PDF copy of the Institutional Review Board (IRB) approval or exemption document. For guidance on the application process for IRB approval, click here.

 

STEP 2

Fill Out Database
Registration

Link to Registration Form

NOTICE: You must prepare the required materials for the desired access level listed in Step 1 before filling out the form.

STEP 3

Sign a Material Transfer Agreement

Only for Non-UofU Affiliates

Frequently Asked Questions

MuSSeL file names include the student ID and a file number separated by an “_.” The student ID consists of 11 characters and specifies students’ language, unique ID, age group, and the overall assigned rating on the ACTFL test. The file number counts each student’s produced speech files. Each student usually has multiple speech files. For example, file c0002cadr01_1 means that the speaker is a Chinese adult learner with an ILR rating of 1.

mussel file name table

Other character combinations or codes in file names:

Character 1: Language

c for Chinese, f for French, p for Portuguese, r for Russian, and s for Spanish.

Characters 7 & 8: Context or Learner age group

  • ad” marks adult students.
  • If the learner is a child, a 2-digit code indicates child’s grade level (i.e., 03, 04, 05, 06, 07, 08, 09, 10).

Characters 10 & 11: Rating

Child Ratings

  • ba” means child rating below N1. “b” stands for below, and “a” stands for Form “A” in the AAPPL test. N1 is the lowest rating possible when a student takes Form A of the AAPPL test and the AAPPL rating, “below N1”, means that raters were unable to rate the speaker’s performance based on the ACTFL rating scale.
  • "bb" means child rating below N4. The first “b” stands for below, and the second “b” stands for Form “B” in the AAPPL test. N4 is the lowest rating possible when a student takes Form B of the AAPPL test and the AAPPL rating, “below N4” means that the student took Form B of the AAPPL test and raters were unable to assign a rating to the speaker since the performance was below the lowest rating possible.
  • "aa" marks an Advanced child rating. The character “a” was repeated to achieve consistency in file name length.
  • Other child rating codes are n1, n2, n3, n4, i1, i2, i3, i4, i5.

Adult Ratings

  • 00”, “01”, “02”, “03” represent adult ratings, 0, 1, 2, and 3 (ILR rating scale for the OPIc Test), respectively.
  • 0p”, “1p”, “2p” correspond to adult ratings, 0+, 1+, and 2+ (ILR rating scale for the OPIc Test), respectively.
  • aa” specifies the adult rating, AL-AH. This rating is given when the rater cannot choose a specific sub-level (Low, Medium, High) under the Advanced level.
  • "ss" marks adult rating, S or superior.
  • Other adult rating codes are nl, nm, nh, il, im, ih, al, am, ah.

 

c0003cadr0p_1
c0003cadr0p_3
c0003cadr0p_4
c0003cadr0p_6
c0003cadr0p_7
c0003cadr0p_8

When a file number is missing from the database, it could mean one of the following:

  • The file was EMPTY or fully unintelligible.
  • The file was corrupt, and wouldn’t open, so we were unable to transcribe it. (Rare case)

 

The MP3 file is the original speech file produced on the test by the speaker. The MP3 files were transcribed according to the 2021 CHAT transcription protocols established by CHILDES (MacWhinney, 2000). CHA files are the transcriptions written in CLAN. Including CHA files allows the users to enjoy the multitude of tools for tagging and linguistics analyses available on the CLAN program. The TXT file is a copy of the transcriptions in CHA format with a few modifications: 1) angle brackets have surrounded the TXT file headers to separate the headers from the main text and allow the analysis of the main text in corpus analysis tools, 2) the bullets or time stamps that link the audio files to the CHA files have been removed since they have no use in TXT files. TXT files may be used by corpus users who are unfamiliar with the CLAN program or prefer to use other corpus analysis tools. Finally, the PDF file format allows the users to preview the files in their browsers before downloading them. Including the additional formats improves overall accessibility for MuSSeL.

Gender data was not collected from adult speakers in the past few years. Most recent adult data usually include gender information.

The adult files in MuSSeL come from the Oral Proficiency Interview by Computer (OPIc) tests. “An OPIc can be rated according to the ACTFL scale, the Interagency Language Roundtable (ILR) scale, or the Common European Framework of Reference for Languages (CEFR) scale (Language Testing International, n.d.). In MuSSeL, the adult files either had the ACTFL rating or the ILR rating. “An ACTFL OPIc reports a rating between Novice and Superior on the ACTFL scale. An ILR OPIc rating reported is between ILR 0 (No Proficiency) and ILR 3 (Professional Proficiency)” (Language Testing International, n.d.). The following table demonstrates the correspondence between ACTFL and CEFR ratings on OPIc tests.

ACTFL and CEFR Rating Alignment on OPIc(adapted from the ACTFL report, Assigning CEFR Ratings to ACTFL Assessments)

ACTFL Proficiency Scale

ACTFL Rating on OPIc

Corresponding CEFR Rating

Superior

Superior

C2

Advanced

Advanced (AL-AH)

B2-C1

Advanced High

C1

Advanced Mid

B2.2

Advanced Low

B2.1

Intermediate

Intermediate High

B1.2

Intermediate Mid

B1.1

Intermediate Low

A2

Novice

Novice High

A1

Novice Mid

0

 

As for the ILR rating scale, we have not provided the corresponding CEFR or ACTFL ratings since there is no consensus in the literature on the alignment between the ILR-scaled score (0, 0+, 1, 1+, 2, 2+, 3) and the other two scales.

AAPPL Rating Alignment with ACTFL and CEFR Scales (ACTFL Proficiency Guidelines, 2012)

ACTFL Proficiency Scale

Corresponding ACTFL Rating

AAPPL Score

Corresponding CEFR Rating

Advanced

Advanced Low-High

A

B2-C1

Intermediate

Intermediate High

I5

B1.2

Intermediate Mid

I4

B1.1

I3

I2

Intermediate Low

I1

A2

Novice

Novice High

N4

A1

Novice Mid

N3

0

N2

Novice Low

N1

0

 

Below N4

 

Below N1

 

List of topics in MuSSeL were created in five steps: 1) transcribers identified the subjects of the speech files and added annotations to the transcription files, 2) all topics were collected from the files and the database spreadsheets and ordered based on frequency, 3) topics with lower frequency were merged into broader categories, 4) the list of topics under each broad category was recorded and reshared with the transcribers, 5) the assigned topics were revised on the database spreadsheet to match the finalized list of topics. The following are lists of topics that emerged from the child and adult sub-corpora.

List of Topics in the Child Sub-Corpus

  1. About Yourself
  2. Activities
  3. Clothes
  4. Colors
  5. Current Affairs
  6. Family
  7. Food
  8. Holidays
  9. Introductions
  10. Locations
  11. Other People
  12. Routines
  13. School
  14. Sports
  15. Time, Seasons, Climate
  16. Unspecified (Unable to Specify a Topic, Unintelligible Content)

 

List of Topics in the Adult Sub-Corpus

  1. About Yourself
  2. Business & Technology
  3. Current affairs
  4. Education
  5. Entertainment
  6. Events & Activities
  7. Family
  8. Food
  9. Jobs
  10. Locations
  11. People
  12. Questions
  13. Routines
  14. Sports
  15. Travel
  16. Unspecified (Unable to Specify a Topic, Unintelligible content)                 

 


Tutorials

Disclaimer:  The following tutorials introduce the pilot version of MuSSeL and the old search filters, which do not match the current status of the corpus.

MuSSeL Corpus Tutorial: Introducing AntConc Software and Basic Corpus Searches

Speaker: Dr. Erin Schnur

Date: June 2019

Citation (APA 7th Ed.): Schnur, E. (2019, June 5). MuSSeL corpus tutorial: Introducing AntConc software and basic corpus searches [Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_c5x9e2

Tutorials for Language Teachers: Using the Multilingual Spoken Second Language (MuSSeL) Corpus

Speaker: Dr. Erin Schnur

Date: Feb. 2019

Citation (APA 7th Ed.): Schnur, E. (2019, February 5). Tutorials for language teachers: Using the multilingual spoken second language (MuSSeL) corpus [Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_k3o5di

The Multilingual Corpus of Second Language Speech (MuSSeL)

Speaker: Dr. Erin Schnur

Date: Nov. 2018

This five-minute tutorial introduces the multilingual spoken second language (MuSSeL) corpus, explains the pilot corpus search filters, and describes how to use AntConc (a corpus analysis freeware by Laurence Anthony) to explore MuSSeL by providing examples.

Citation (APA 7th Ed.): Schnur, E. (2018, November 9). The multilingual corpus of second language speech (MuSSeL) [Video Tutorial]. University of Utah. https://mediaspace.utah.edu/media/t/1_y8lostzz

 

 

Last Updated: 3/28/24