December 24, 2009

Computational Linguistics

Ever have difficulty deciding whether material should be classed in 006.35 Natural language processing or in 410.285 Computational linguistics? (It would seem so, since many works have been classed in both numbers.) Since we have also found it difficult to distinguish clearly between the two numbers, we decided to take advantage of a recent major gathering of computational linguists at ACL-08: HLT (ACL = Association of Computational Linguistics; HLT = Human Language Technology) to get their feedback on the treatment of computational linguistics and natural language processing in the DDC.

According to LCSH, the intended distinction between computational linguistics and natural language processing is that Computational linguistics (LCC: P98-98.5; DDC: 410.285; 467 WorldCat records) is for “works on the application of computers in processing and analyzing language,” whereas Natural language processing (Computer science) (LCC: QA76.9.N38; DDC: 006.35; 365 WorldCat records) is for “works on the computer processing of natural language for the purpose of enabling humans to interact with computers in natural language.” Dewey currently adopts this same distinction. The distinction, however, does not reflect current thought.

Computational linguists at ACL-08 tended to agree that “natural language processing” (NLP) and “computational linguistics” (CL) mean pretty much the same thing (or, if different, that the meaning of natural language processing is encompassed within the meaning of computational linguistics). That makes our decision to merge natural language processing and computational linguistics relatively easy.

Deciding where the merged subject should go is much harder. On the one hand, there was agreement that the relative contribution of computer science to computational linguistics is greater than the contribution of linguistics. Similarly, there was agreement that a background in computer science is more essential for computational linguistics than a background in linguistics. Further, computer scientists are much more likely than linguists to embrace computational linguistics as part of their field. From these statements, classing the merged natural language processing / computational linguistics in 006 might seem a no-brainer. On the other hand, however, some of the observations shared suggest that the situation may not be so cut-and-dry: Computational linguistics really belongs in linguistics, but linguists don’t realize it yet. Computer scientists sometimes change the field they apply their skills to (that is, a junior computational linguist might not continue to work in computational linguistics). As a supervisor, you get better results teaching computer science to a linguist than teaching linguistics to a computer scientist.

There are at least two distinctions made in computational linguistics that should inform our decision. The first is a distinction between symbolic and statistical approaches to computational linguistics, the former emphasizing linguistics-based representations of natural language, the latter emphasizing quantitative representations of natural language. Many symbolic approaches could be classed comfortably within linguistics; however, the same could be said of statistical approaches considerably less often.

A second distinction is made in computational linguistics between tasks and applications: Computational linguistics tasks (e.g., part-of-speech tagging, parsing, word sense disambiguation, text segmentation) rely, wholly or in part, on specific properties of language in their processing and analysis and may be combined to form applications of extrinsic value; computational linguistics applications (e.g., question answering, information retrieval, automatic abstracting, machine translation) are comprised of components addressing multiple linguistic properties and are of extrinsic value. Again, one end of our spectrum (in this case, tasks) is much more like linguistics than the other (in this case, applications—unless the application is itself in linguistics, e.g., translation), but all applications carry out some number of tasks.

It appears to us that the best solution would be to drop the distinction between natural language processing and computational linguistics by relocating comprehensive and interdisciplinary works on computational linguistics from 410.285 to 006.35. We would continue to use 410.285 in its broad meaning as computer applications in linguistics; for example, the SIL (initially known as the Summer Institute of Linguistics) software catalog, which supports the work of field linguists, would be classed in 410.28553. This catalog includes, inter alia, fonts, a concordance generator, a tool for drawing syntax trees, interlinear text editors, a Spanish verb conjugator, and a program for learning the International Phonetic Alphabet.

We would love to hear your reactions to this solution. (Or if you have another solution that accounts for the interdisciplinary nature of computational linguistics, we would love to hear that, too.) For best consideration, please either comment on this blog or send email to dewey@loc.gov by August 15.

source: http://ddc.typepad.com/025431/400499_language/

0 comments:

Post a Comment

Fans