The Oxford Corpus

This project is funded by The Leverhulme Trust and we are working in partnership with Children's Dictionaries at Oxford University Press.

What is a corpus?

A corpus is a 'word bank', a record of natural language samples. The Oxford Corpus has been developed by Oxford University Press to help them develop children's dictionaries and curriculum materials.  It is in two parts.  The reading part contains millions of words of text written for children (for example, children's reading books and popular fiction). The other part is particularly interesting as it contains millions of words written by children themselves. Hundreds of thousands of children have entered the 500 Words competition, hosted by BBC Radio 2 over the last couple of years. All of the children's stories have been entered into the corpus. This provides a unique window into the fantastic, creative minds of children.  Hop over to the the BBC's 500 Words pages to learn more about the competition and to read some fabulous stories.

What can we learn from the Oxford Corpus?

Lots!  Oxford University Press analyse the children's stories each year, identifying popular words and themes, geographical variations and much, much more.  You can hear about their findings by visiting their website.  At ReadOxford, we are using the corpus to help us understand a number of things.  We can look at children’s grammar, punctuation and spelling to see how those skills vary across the population and change with age.  We can understand more about how children use language creatively (inventing new words, using existing words in novel ways) to express the contents of their imaginations, and see how this is influenced by things happening in the children’s own experiences.  And as the stories come from children all over the country, we can take a national picture of children’s writing. 

One focus at present is with how children's experience with words in stories helps then to learn to new words.  We know that reading practice is important: words that are seen more often are easier to read than words that are rare or unusual.  In our project, we are asking how the language environment words appear in influences how easy it is for children to learn new words. With so many millions of words to consider, we can't do this by hand.  Instead, we are developing powerful computer programmes to analyse the text. We then combine these findings with experiments carried out with children in school. We hope to develop a much richer picture of how children's reading experience drives the development of reading skill. 

Together with our partners at O.U.P. we have produced three videos to explain how we are using the Corpus in our research and how O.U.P are using the database to develop their range of dictionaries. You can view the videos here, here and here