A Prototype Of Skimcast: Automatic Text Skimming

A Prototype Of Skimcast: Automatic Text Skimming

By 2005, I was conducting experiments with human readers for the purpose of trying to mimic the intelligence that they used when reading. Through my experiments, I was developing and testing the linguistic framework for enabling a machine to understand the discourse of the text that it read, By the Summer, I had developed some initial algorithms for simulating the skimming process and had constructed some working demonstrations.

Over the 2005 Christmas break, I wrote a prototype of an intelligent reading machine using the linguistic principles that I was formalizing. The implementation was not needed for the Ph.D. and was actually frowned upon since it was generally thought that unnecessary programming would distract away from research and dissertation writing. (I found that writing such code over Christmas vacation avoided concerns that I may be misusing my work time.) In my case, however, I needed the system that I was building in order to finish the literature review required for my dissertation.

The initial version of my automatic skimming software went through several rewrites and I tested many different interfaces for navigating text and representing topics. The underlying technology eventually produced the Skimcast website, browser extension, and mobile apps that are available today. Although I am still working on new and different ways to help make the process of reading a skimming text easier, Skimcast is still built around the core linguistics of the prototype.

As discussed in a previous post, I was interested in how to discover the contextual meaning of a word. When we read a document, we understand the document’s content by first understanding its vocabulary. Every text that is about something uses specialized vocabulary to express the topics that the text is about. For example, a news article about the economy may contain vocabulary terms such as “gross domestic product”, “stock market”, and “economic growth”. If we were to see margin notes identifying these vocabulary terms in bold, then we may be able to infer that the text is about the economy without reading the text. By “specialized vocabulary” and “vocabulary term”, I am referring to phrases of text that are characteristic of the subject area of the text, as opposed to other words that may occur in the text but carry less relevant information, like “afternoon” or “announcement”.

One of my discoveries was that vocabulary terms behave like vocabulary terms no matter the subject. That is, vocabulary terms are used in text with certain linguistic patterns that are different from the usage patterns of other words. Therefore, Skimcast is able to recognize vocabulary terms without the use of an external dictionary of known terms in any particular subject. This is significant for allowing a machine to understand text because of the time, cost, and data required for human experts to maintain dictionaries of machine-readable key terms.

Another advantage of Skimcast is the ability to judge the linguistic prominence of a vocabulary term by analyzing the term’s linguistic patterns rather than relying only on the number of times the term is repeated. Just because a word may be used commonly and may be repeated many times in a text does not mean that it represents one of the most important themes in the text. Often, the most important words are used strategically by the author and are not the most common words in the text. Skimcast can use its linguistics prominence measure to figure out which vocabulary terms represent broad themes and which terms merely provide supporting detail.

With the two innovations above, I was able to write a first version of Skimcast that could read any text and produce a list of vocabulary terms, ranked by linguistic prominence. Each term had links to the sentences in the text where the term was used (like a back-of-book index). The terms that were judged by Skimcast to represent broader themes were labelled with links to passages of text that were about these themes (like a table of contents).

As mentioned earlier, when we skim a document, we construct a conceptual structure of the content of the text. The conceptual structure of a text allows us to see what the text is about. We can see which concepts are close to each other, perhaps occurring in the same sentence or paragraph. By using the same process, we can search the text for particular concepts. I found that with Skimcast’s list of linguistically prominent themes, along with links from the themes to the relevant sentences in the text, I was able to perform the functions of skimming a document┬ámentioned above.

In order to skim a document with Skimcast, the document had to be available in an electronic format (such as PDF). I had access to the entire library of online publications by the Association for Computational Linguistics (ACL). At the time, this library consisted of approximately 10,000 journal articles, conference papers, and workshop papers in the PDF format. The ACL Anthology was an ideal library for testing my prototype of Skimcast not only because it was large and available online. Since the area of interest for my Ph.D. was computational linguistics, a significant percentage of the papers that I used in my research came from this library.

Producing a literature review with Skimcast

In addition to conducting my research, I was facing the task of writing a dissertation. It was not enough to write about the work that I was doing; an important requirement of the dissertation is a lengthy literature review that gives detailed summaries of journal articles, conference papers, and books that are relevant, or may be relevant, to the given field. This was the part that was the hardest for me because it called for reading knowledge of hundreds of sources that I could not read on my own.

The papers that are cited in the bibliography are papers that were actually used in the dissertation.. Even more papers than that must be understood well enough to reject as not relevant enough (see the two questions that must be answered quickly when doing research).

Using my new algorithms and linguistic principles, I was able to complete a prototype of an intelligent reading machine that would allow me to skim a text well enough to understand details about the text without reading it. I ran the entire ACL anthology through the Skimcast engine and produced a “skimmable library”. I could open any document in the library and get an outline of important themes in the text. The most important sentences were highlighted, allowing me to read only the highlighted sentences to obtain an overview of the document. If I needed more detail, then I could search the text for key themes

A typical journal article, conference paper, or workshop paper in the ACL Anthology would describe experiments testing a particular hypothesis and present the results of the experiments in a comparison to competing experiments. If I opened such a paper in Skimcast, I could determine, just by clicking a few themes and reading a few sentences, the hypothesis, the experiment design and methodology, the data set used in the experiment, the algorithm being tested, and the final results. This allowed me to find relevant papers, or eliminate papers not relevant to my work, in a short amount of time.

After finding a relevant paper, I could use Skimcast to find other relevant papers by doing a comparison of the themes of all of the papers in the library. This kind of semantically-enhanced search could utilize contextual information from the content instead of relying on keyword searches. A natural barrier to doing a context search using a library of texts is that it requires a library to be preprocessed. Although this is something not yet available in the current version of Skimcast, it is a priority.

Today, I use Skimcast for all of my reading, whether I am doing research, reading emails, or even reading menus in dark restaurants. In future posts, I will describe some of the ways that I use Skimcast and uses that were shown to me by students.

Leave a Reply

Your email address will not be published. Required fields are marked *

Download Skimcast!