Text Skimming And The Two Questions

Text Skimming And The Two Questions

When I went to the LSA Summer Institute of Linguistics in June of 2001, I had a growing interest in speech synthesis. Upon returning to UGA that Fall, where I was enrolled as a Ph.D. student in computer science, I was sure that I wanted to concentrate my graduate research on the topic. My major advisor asked me into his office at the start of the semester to tell me that if I wanted to pursue speech synthesis as my research field then I should consider applying to different universities because there was not anyone at the University of Georgia who had done research in that area. He quickly added that he thought I should pursue my work with computer speech and asked me which universities were doing the kind of work that excited me. The first university on my list was the University of Cambridge, in the U.K.. He urged me to apply since he had studied linguistics at Cambridge. He also suggested that I apply to the Gates Cambridge Scholarship which had just been established the previous year.

After contacting the person who was to become my supervisor in linguistics at Cambridge, I applied to the M.Phil. program in theoretical linguistics (similar to a research-based Master’s degree in the U.S.), and I also applied for the Gates Cambridge Scholarship. A few months later, I learned that I had been accepted to both to begin in the autumn of 2002.

The M.Phil. in theoretical linguistics was a 9-month intensive research degree involving taught courses, research papers, and a thesis. I read more during that year than I had in the previous several years combined. That was easily the most difficult year I had ever faced because none of the many books that I needed was available on tape.

The flatbed scanner that I was using took a full 60 seconds to scan each page. When I got to Cambridge, I special ordered a library flatbed scanner to reduce the scanning time from 60 seconds per page to 20 but it still took several hours just to scan a book. Then, I still had to listen to the books read by the computer. My dad started scanning my books in the U.S. and emailing me the scans as PDFs to eliminate the time I had to spend scanning. Even with the accommodations I had available, there was a real concern that I would not be able to read enough to finish.

I did finish my research and my thesis and received the M.Phil. but I learned that the lack of a natural sounding voice was not the bottleneck that was causing me to have such a profound time disadvantage. I asked my course mates how they were able to read so many books so quickly and they told me that they did not read every book from cover to cover. They skimmed the books to find what they needed. This was something that I was not able to do because of my eye sight.

With my M.Phil. research I made a small contribution to the quality of speech synthesis by writing an algorithm to mimic a particular speech pattern that had not been mimicked before. However, I had a bigger problem to solve. It was obvious to me that I needed to be working on the problem of how to teach a computer how to skim a book for people who were unable to skim for themselves.

The Gates Cambridge Trust extended my scholarship to stay at Cambridge for a Ph.D. in computer science. I joined the Natural Language and Information Processing group of the Computer Laboratory and started studying the structure of text and the way we read it. I hoped that learning what makes text understandable to us would give me ideas for how to teach a machine to understand what a text document is about.

There are many types of research projects that share the task of gathering relevant information in a large number of documents. Whether writing a term paper for school or a graduate thesis, or conducting the process of discovery to prepare for a legal case, having command of the literature of text is crucial. When conducting a literature search, there are two questions that must be answered quickly for each document under consideration.

1. Is this document relevant to my search?

Every time you pick up a book or an article, the first thing you need to know is whether the document can help you. If the document is not useful, then you do not need to read it for your project.

2. If this document is relevant, where is the relevant information?

Reading an entire document, or even most of a document, to find one important passage is not a good use of time, especially if you have a lot of documents in your search. If you are searching long documents, like books or reports. then reading irrelevant information is costly. A keyword search has limitations when trying to answer either question, namely that it is up to the user to construct the right search queries. If the information you need uses slightly different vocabulary from the words in your search then you may not find it even if it is there.

I could not answer either of these questions without listening to each document from beginning to end. That was my chief time disadvantage. Both questions can be answered quickly by a skilled reader who is able to skim but neither question can be answered quickly by listening to the entire text because you cannot rule out the relevance of a document, or any part of a document, until you have confidence that everything mentioned in the text is not useful.

Although I had never been able to skim a book before, my new quest was to simulate the skimming process of a skilled reader on a computer with an interactive interface that would be accessible to people with print disabilities. This new technology I was seeking to create became the subject of my graduate research. I already had the perfect measurement standard since I was about to start a Ph.D. at Cambridge with even more to read; my success depended on my gaining reading knowledge of hundreds of books and academic papers, and this would require the technology I was developing in my research.

One Reply to “Text Skimming And The Two Questions”

Leave a Reply

Your email address will not be published. Required fields are marked *

Download Skimcast!