Corpus Linguistics Research Paper Starter

Corpus Linguistics

(Research Starters)

Corpus linguistics is the empirical study of language as it occurs naturally, and not as is prescribed by theoretical rules and structures. Corpus linguistics uses corpora, or empirical collections of written and/or spoken text, to discern naturally occurring patterns and features of language use. Corpus based research is particularly useful in the study of language acquisition, as corpora derived from the speech of children or students at various points in their development discloses essential details of the language learning processes. Corpus linguistics as practiced today, with the aid of automation and with the availability of large, comprehensive corpora, is a booming field that researchers predict will continue to dominate research on language in the decades to come. Language pedagogy has been and will continue to be profoundly affected by any developments in corpus linguistics, as empirical observations of language use are critical to formulating theories of language learning and teaching.

Keywords Collocation; Concordance; Corpus; Data driven learning; Discovery learning; Learner corpus; Lexicography


Introduction to Corpus Linguistics

Corpus linguistics refers to the empirical study of language as it occurs naturally in various contexts and under specific conditions. Corpus linguistics uses corpora, or empirical collections of written and/or spoken text, to discern naturally occurring patterns and features of language use. Corpus based research is particularly useful in the study of language acquisition, as corpora derived from the speech of children or students at various points in their development discloses essential details of the language learning processes.

A corpus is a large collection of text representative of a language or of a subset or genre of a language. Corpora are assembled by teams of researchers who select, categorize, and annotate text. This data is then sorted, parsed, and analyzed with the aid of computer programs-typically concordance programs and statistical packages. Concordances are lists of the occurrences of particular words or phrases in the corpus. Through concordance analysis, researchers can determine in which contexts a word, concept, or phrase is most prevalent, can compare the frequency and use of synonyms or similar ideas, and, with the help of statistical software, can characterize patterns of use.

Text in a corpus may be divided into any number of registers, or categories. Possible registers include texts written by various groups, texts of a specific genre, texts derived from speech, and so on (Biber et al, 1998). Through the use of registers, researchers can find and describe language patterns under various conditions and constraints. For example, differences in language use in news reporting, novels, and poems can be explored, vocabularies of natural language speakers and of second language learners can be compared, and so on.

Corpus based analyses have been used to develop dictionaries, to parse out and describe features of language, to derive new theories of grammar, and to forge teaching material that addresses language use, not only linguistic theory.

Corpus Linguistics in Context

Though the term 'corpus linguistics' has been coined only recently, in the second half of the twentieth century, all language studies before modern, Chomskyian linguistics were corpus based. As far back as the middle ages, monks created large tables and indices of phrases and passages from sacred texts to be used for further analysis (McEnery & Wilson, 2001). The study of lexicography-the study of the meaning and use of words-also took root during this period (Biber et al, 1998). Lexicography relied on measurements of the frequency of words and of the relation between words in various texts, or, on early linguistic corpus research.

During the eighteenth century, empirical language studies were used in understanding language acquisition and in creating language reference and learning materials. For example, in 1775, a corpus was used to provide samples of language use for dictionary words, and in the nineteenth century, a large compendium of texts was used to create the Oxford English Dictionary (Biber et al, 1998).

From about 1876 through 1926, corpus diary studies were the prominent methodology of gathering corpus data aimed at understanding language acquisition. Parents participating in studies kept detailed accounts of their children's utterances. These were later analyzed for patterns of normative behavior, and these diary studies corpora are still used at present as "sources of normative data" (McEnery & Wilson, 2001, p. 3).

In the early twentieth century, the empirical study of language took on a more formal shape with the birth of field linguistics and of the structuralism movement (McEnery & Wilson, 2001). Researchers in these traditions collected records of spoken language and later analyzed this corpora material in a 'bottom-up', procedural manner. The most commonly used study designs employed by field and structural linguists were large sample and longitudinal studies. Large sample studies, prevalent from around 1927 through 1957, drew from many students and language samples to determine and describe average language knowledge and usage. In longitudinal studies, popular since the early 1960s, researchers collect corpus data from the same participants over a period of time, and use this to describe changes in language acquisition and learning behaviors (McEnery & Wilson, 2001).


Corpus based language studies were interrupted in the late 1950s by the research of Noam Chomsky (1928 - ), a computer scientist and linguist who ushered in a new wave of rationalistic linguistics and refuted the validity of using corpora to adequately represent language (1957). Chomsky argued that all empirical collections of language samples-all corpora-are skewed and incomplete. They are skewed in that they favor particular uses of language at the expense of others; for example, impolite, false, and obvious statements do not often find themselves in corpus collections (Biber et al, 1998). Further, corpora are incomplete because the number of sentences in a language is infinite; no finite collection of text could ever fully represent all possible configurations of words (McEnery & Wilson, 2001).

Corpus analysis thus lost its popularity during the 1950s and 60s, but resurfaced in the 1970s with the advent of powerful computing capabilities. The arguments leveled by Chomsky against corpus linguistics were addressed during this period, and by the early 1980s, large-scale corpus-building projects were undertaken by many universities and academic partnerships.

Corpus language research, after a dramatic struggle with rationalist theories, overcame Chomsky's challenges and transformed the newly forming field of linguistics. Supporters of corpus linguistics argued that natural language corpora provide key insights into language acquisition processes that cannot simply be theorized. They recognized corpora did not provide complete accounts of language use, but found corpus linguistics invaluable in research on language acquisition and on language pedagogies. Further, corpus research began to provide empirical evidence against purely structuralist, rationalist grammars. These grammars conceived of language use as a 'fill-in-the-slot' process in which appropriate words are fitted into preconceived, theoretically 'correct' sentence structures. Research found that on the contrary, language users rely on schemata and learned language collocations, or commonly used phrases, when engaging in authentic natural speech (Sinclair, 1991).

The successful resurfacing of corpus language research was enabled in the early 1970s by the introduction of the computer into the laboratory. Automated processing allowed for never before imagined storage and analysis capabilities. Researchers were now able to analyze the frequency with which words appeared across registers, the associations between words and common phrases, and the multiple meanings behind individual words (Biber et al, 1998).

Modern Corpus Research

Researchers undertaking linguistic corpus research are able to investigate any feature of language, such as grammar, semantics, and pragmatics (McEnery & Wilson, 2001). However, the comprehensive study of any of these fields requires large, representative data sets from which empirical laws can be derived (Myles, 2005). Therefore, attempts at compiling large, comprehensive corpora have been underway since the inception of corpus linguistics. However, the compilation corpora are most often conducted alongside other research aims. Some of these include:

• Understanding descriptive grammars,

• Discourse analysis,

• Pragmatics, and

• Language acquisition.

Descriptive grammar is a new approach to studying grammar based on corpus research. Traditional, rationalist grammars are prescriptive-or, they dictate the ways in which words should be used. Descriptive grammars examine corpora of text derived from naturally occurring speech and detect grammatical rules used. The corpus approach to grammar was pioneered by Charles C. Fries (1887-1967), who compiled the first large corpus of spoken English by transcribing and annotating large numbers of taped phone conversations (Fries, 1952). Research in descriptive grammar has expanded and has been enriched by the availability of storing, sorting, and analysis technologies. Corpus text is processed using register analysis, or analysis that examines frequency, organization, and form of words and phrases as compared across many registers (Conrad, 2000). Information about language use across large numbers of registers adds nuance to descriptive grammar studies that acknowledge the grammar of each form of language is unique-for example, the ways in which individuals speak on subway trains are not the same as...

(The entire section is 4406 words.)