3 years ago · 0b5dcfdef7
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 
				 # Whisper
			
 
				 
			
 
				 [[Blog]](https://openai.com/blog/whisper)
			
 
				-[[Paper]](https://cdn.openai.com/papers/whisper.pdf)
			
 
				+[[Paper]](https://arxiv.org/abs/2212.04356)
			
 
				 [[Model card]](model-card.md)
			
 
				 [[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)
			
 
				 
			
@@ -66,7 +66,7 @@ There are five model sizes, four with English-only versions, offering speed and
 
				 
			
 
				 For English-only applications, the `.en` models tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
			
 
				 
			
 
				-Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://cdn.openai.com/papers/whisper.pdf).
			
 
				+Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large-v2` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://arxiv.org/abs/2212.04356).
			
 
				 
			
 
				 ![WER breakdown by language](language-breakdown.svg)
			
 
				 
			
--- a/language-breakdown.svg
+++ b/language-breakdown.svg
--- a/model-card.md
+++ b/model-card.md
@@ -2,7 +2,7 @@
 
				 
			
 
				 This is the official codebase for running the automatic speech recognition (ASR) models (Whisper models) trained and released by OpenAI.
			
 
				 
			
 
				-Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://cdn.openai.com/papers/whisper.pdf).
			
 
				+Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://arxiv.org/abs/2212.04356).
			
 
				 
			
 
				 
			
 
				 ## Model Details
			
@@ -17,10 +17,12 @@ The Whisper models are trained for speech recognition and translation tasks, cap
 
				 | medium |   769 M    |         ✓          |         ✓          |
			
 
				 | large  |   1550 M   |                    |         ✓          |
			
 
				 
			
 
				+In December 2022, we [released an improved large model named `large-v2`](https://github.com/openai/whisper/discussions/661).
			
 
				+
			
 
				 
			
 
				 ### Release date
			
 
				 
			
 
				-September 2022
			
 
				+September 2022 (original series) and December 2022 (`large-v2`)
			
 
				 
			
 
				 ### Model type
			
 
				 
			
@@ -28,7 +30,7 @@ Sequence-to-sequence ASR (automatic speech recognition) and speech translation m
 
				 
			
 
				 ### Paper & samples
			
 
				 
			
 
				-[Paper](https://cdn.openai.com/papers/whisper.pdf) / [Blog](https://openai.com/blog/whisper)
			
 
				+[Paper](https://arxiv.org/abs/2212.04356) / [Blog](https://openai.com/blog/whisper)
			
 
				 
			
 
				 
			
 
				 ## Model Use
			
@@ -46,7 +48,7 @@ In particular, we caution against using Whisper models to transcribe recordings
 
				 
			
 
				 The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages. 
			
 
				 
			
 
				-As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
			
 
				+As discussed in [the accompanying paper](https://arxiv.org/abs/2212.04356), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
			
 
				 
			
 
				 
			
 
				 ## Performance and Limitations
			
@@ -55,9 +57,9 @@ Our studies show that, over many existing ASR systems, the models exhibit improv
 
				 
			
 
				 However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
			
 
				 
			
 
				-Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in [the paper accompanying this release](https://cdn.openai.com/papers/whisper.pdf). 
			
 
				+Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in [the paper accompanying this release](https://arxiv.org/abs/2212.04356).
			
 
				 
			
 
				-In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in [the paper](https://cdn.openai.com/papers/whisper.pdf). It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages.
			
 
				+In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in [the paper](https://arxiv.org/abs/2212.04356). It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages.
			
 
				 
			
 
				 
			
 
				 ## Broader Implications