I just read the 100 most popular words from the British National Corpus out loud, one at a time, with pauses in between:
the of and to a in it is was that i for on you he be with by at have are this not but had his they from as she which or we an there her were do been all their has would will what if one can so no who said more about them some could him into its then up two time my out like did only me your now other may just these new also people any know first see well very should than how get most over back way our much think between years go er
This took 114 seconds, or 1.14 seconds per word, so probably the first 1000 words would take me about 20 minutes, and the first 10000 words about three or four hours.
These 100 words comprise 50.56% of the words in the BNC (“the” being about 6.87%, and “er” being about 0.1%), or at any rate the part of it that is represented in my frequency table, so if you had recordings of them (as LPC-10 or whatever) you could only synthesize about half of the words in typical text. Here’s an overview of how that number changes:
Word | Rank | Cumulative % | Self % | Time to record this many | To miss one word in |
---|---|---|---|---|---|
much | 95 | 50.05% | 0.10% | 2 minutes | 2 |
er | 100 | 50.56% | 0.10% | 2 minutes | 2 |
married | 1261 | 75.005% | 0.0083% | 24 minutes | 4 |
colonel | 4407 | 87.5004% | 0.0020% | 1½ hours | 8 |
cdna‽ | 10445 | 93.7500% | 0.0006% | 3½ hours | 16 |
capitals | 10446 | 93.7506% | 0.0006% | 3½ hours | 16 |
stabilise | 20104 | 96.87494% | 0.00019% | 6½ hours | 32 |
pasha | 33275 | 98.43748% | 0.00007% | 11 hours | 64 |
lessens | 43350 | 99% | 0.00004% | 14 hours | 100 |
pizzicato | 60386 | 99.5% | 0.00002% | 19 hours | 200 |
superfields | 76770 | 99.75% | 0.000011% | 24 hours | 400 |
After the first 8000 words or so, the ordering starts to be somewhat dubious, as reverse alphabetization sets in due to small corpus size.
Brute-force recording the whole English lexicon seems like a surprisingly approachable, if boring, project, from this point of view. In a few days of work you could compile enough recordings that your computer could read most text understandably, if not naturally.
I analyzed (an earlier version of) this note using these frequencies. It contained two known words less common than #65536: “BNC” and “superfields”; four words less common than #32768: “synthesize”, “pasha”, “lessens”, and “pizzicato”; three words less common than #16384: “pauses”, “stabilise”, and “approachable”, plus some HTML tags; and 8 words less common than #8192: “corpus”, “comprise”, “cumulative”, “cdna”, “capitals”, “brute”, “lexicon”, and “compile”.
Here’s the full analysis:
unknown
100 114 1 14 1000 20 10000 100 50 56 6 87 0 1 LPC 10 95 50 05 0 10 2 2 100
50 56 0 10 2 2 1261 75 005 0 0083 24 4 4407 87 5004 0 0020 1 8 10445 93 7500
0 0006 3 16 10446 93 7506 0 0006 3 16 20104 96 87494 0 00019 6 32 33275 98
43748 0 00007 11 64 43350 99 0 00004 14 100 60386 99 5 0 00002 19 200 76770
99 75 0 000011 24 400
1
the the the the the the the the the the the
2
of and and of and of of of of of of
4
a in to a in it in it in in to To in a In a
8
I is was that i for on you he that is you you that you that
16
from at with be with by at have are this not but had his they from as she
which This at had as this from this
32
one or we an there her were do been all their has would will what if one can
so no who said more about them some could him into its then up two or so
would about about or about about or so if them or could about an one if
could
64
just most out time between time my out like did only me your now other may
just these new also people any know first see well very should than how get
most over back way our much think between years go er first take me first
three These being er being any my only Here how Time many much er like work
128
British National took four part number point days
256
read words word probably words minutes words hours words words rate table
half words table Word word minutes minutes minutes hours hours hours hours
hours hours hours hours table whole English seems view few enough
512
popular per whatever s changes record miss force project
1024
typical text married
2048
seconds seconds represented frequency th th th th Self th th recording
surprisingly
4096
loud recordings overview Rank br br colonel boring recordings
8192
Corpus comprise Cumulative cdna capitals Brute lexicon compile
16384
pauses tr tr td td td td td td tr td td td td td td tr td td td td td td tr
td td td td td td tr td td td td td td tr td td td td td td tr td stabilise
td td td td td tr td td td td td td tr td td td td td td tr td td td td td
td tr td td td td td td approachable
32768
synthesize pasha lessens pizzicato
65536
BNC superfields
So, with 16384 words, which could be recorded in a day or two, plus numbers and initialisms, only “pauses” and “approachable” would have been missed in a text-to-speech of a note like this one, if we leave out the words deliberately chosen to be uncommon. Even with only a 512-word vocabulary, the note would be comprehensible.
Other notes in Dernocua are not so fortunate. The 7500-word note on energy-autonomous computing, for example, includes 1790 words not in the BNC at all; about half of these are real English words such as Github, Kobo, joule, touchscreens, ebook, and 80Mbps, plus many Spanish words and initialisms. The most common 512 words would have covered a bit over 3000 of its 7500 words.