Written Texts
Spoken Language
Spoken Spanish
Text Archive of the Ancient Sicilian : Medieval Sicilian texts from 14th-16th centuries
CEMET: The Corpus of Early Modern English Texts, compiled by Hendrik De Smet, Department of Linguistics, University of Leuven, Belgium.
CLMETEV: The Corpus of Late Modern English Texts (extended version), compiled by Hendrik De Smet, Department of Linguistics, University of Leuven, Belgium.
COCA: The Corpus of Contemporary American English Compiled by Mark Davies. 425 million words, 1990–present.
COHA: Davies, Mark. (2010–) The Corpus of Historical American English. 400million words, 1810–2009.
LEON: Leuven English Old to New, compiled by Peter Petré, Department ofLinguistics, University of Leuven, Belgium.
Penn Parsed Corpora of Historical English. .(PPCME 2 and PPCEME)
TIME: Davies, Mark. (2007–) TIME Magazine Corpus. Compiled by Mark Davies. 100 million words,1920s–2000s.
YCOE: The York-Toronto-Helsinki Parsed Corpus of Old English Prose.
The [Open] American National Corpus: The Open American National Corpus (OANC) is a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use.
The British National Corpus (BNC) : The BNC is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
Time Magazine Corpus: Full text of the articles in Time Magazine (US) from 1923- present, providing more than 100 million words. Free login required for some features.
American English Dialect Recordings: The Center for Applied Linguistics Collection: 118 hours of recordings documenting North American English dialects, dating from 1900-1999. A few recordings of Canadian speakers are included.
IDEA: International Dialects of English Archive: Recordings are principally in English, are of native speakers, and include both English-language dialects and English spoken in the accents of other languages.
Edinburgh University Speech Timing Archive and Corpus of English (EUSTACE): Comprises 4608 spoken sentences spoken by six speakers of British English, designed to examine a number of durational effects in speech and are controlled for length and phonetic content.
Corpus of Canadian English (Strathy): 50 million words from more than 1,100 spoken, fiction, magazines, newspapers, and academic texts.
Hip Hop Word Count (Rap Research Lab): Request access. V0.2.2 contains syntax, semantic and rhyme data for 50 artists. In progress, a dataset with 20,000 artists.
Helsinki Corpus of English Texts multi-genre diachronic corpus, which includes periodically organized text samples from Old, Middle and Early Modern English.
Swedish
Danish
Norwegian
Icelandic
Faroese
General Internet-Corpus of Russian (GICR) : a megacorpus of tagged texts from Russian Internet,including news sites, VKontakte, LiveJournal, Mail.ru blogs.
Czech Texts
Spoken Czech
Balanced Corpus of Contemporary Written Japanese (BCCWJ): The BCCWJ is a corpus created for the purpose of attempting to grasp the breadth of contemporary written Japanese, containing extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken.