Thursday, October 31, 2013

11-7 Travis Brown et al. The substantial words are in the ground and sea: computationally linking text and geography

33 comments:

  1. 1 - This project is incredible, but I wonder if textual studies are really up to the challenge of fully engaging this for a new branch of literary analysis. I haven't worked with many digital humanists, but while this project seems like an incredible way to perform a sort of 'literary ethnography,' I would love to see some examples of how scholars are engaging with this technology to generate new analyses of texts, or simply to highlight their own hypothesis/interpretations of a work.
    2 - Does mapping literary works render something that's supposed to be subjective into an objective piece? I understand that travel literature would be mentioning concrete places, but I'm thinking of how spatial studies have intersected with literature in the construction of "the city." A good example of this would be Calvino's novel "Invisible Cities," where numerous cities are discussed, but towards the conclusion of the novel the reader is supposed to consider these descriptions different permutations of the same space. Would rendering these spaces geographically cause something to be lost in translation?

    ReplyDelete
  2. Brown et al.

    1. On page 331, the authors say, "in our work we explicitly do not use in-
    formation about population, since population figures given in gazetteers are for recent years and are likely to be irrelevant or even misleading for analyzing texts from the nineteenth century." How might this be compensated for? What studies are being done or prototypes are being developed to expand the population potential of this technology and apply it to historical gazeteers?

    2. In their discussion of words associated with specific places, such as "bonaparte" and Corsica, the authors claim, "Others seem simply anomalous, such as the situation of 'louvre' somewhere in Austria." This can't really be the case, can it? Either these "anomalous" results are due to errors, or there is a previously under-emphasized connection between the term in question and the location.

    3. As an English Lit person, I couldn't help but think about the exciting possibilities for this type of mapping of fictional worlds. Has any of this been done so far? It could be especially interesting and fruitful if applied to fantasy or sci-fi works. I also look forward to the advance of this technology to the point that we can scale through time and see the transformation of topographies, especially in cities.

    ReplyDelete
  3. 1. In the these language analysis steps, the authors employ models that were trained on newswire text from the 1980s and 1990s using machine-learning techniques. The authors also admit that these models don’t work well on more temporally distant English text —such as the nineteenth-century documents. So why not train the models based on text from nineteenth-century? Won’t this approach achieve better result?


    2. One of the major inputs for TextGrounder is the gazetteer. And the authors’ research mainly focus on the nineteenth-century documents. So do the gazetteers provide information for that time period? If not, would the proper gazetteers, which provide relevant information for nineteenth century, be considered as a factor for improving the TextGrounder?


    3. When scaling up to larger corpora, would the nature of the corpora affect the accuracy of TextGrounder? Though TextGrounder employs the unsupervised method, can it handle radically different dataset? For example, one dataset is the 19th century document, while the other dataset is about 21st century science fictions.

    ReplyDelete
  4. 1. Is this program only applicable to the English language? What about finding and annotating places that are identified in their own native tongue in texts, i.e. EspaƱa instead of Spain? They are still relevant subject matter, but will they be missed or will the software be able to identify these, too?

    2. On pg. 330, the authors write about how the system and software is acting as guide and synthesizing material you might not even have imagined or put together. That the underlying data relationships can lead down new paths of inquiry is very interesting, but I wonder how much are true and valid relationships and what percentage is just bogus material and errors? Is that just to be determined by a human analyst or whoever is using the results?

    3. What about the possibilities of mapping ever-changing territories or countries? Is it up to the creator of the query and annotations or interpreter of information to understand the history behind toponyms when creating a specific field, or would the system be able to synthesize that information?



    ReplyDelete
  5. 1. The authors discuss this very unique mapping program, TextGrounder, which uses algorithms to match text to correct locations on the earth’s surface. They explain that before the locations can be matched to the text, the text must first be annotated, and that programs can be trained to annotate text. The authors concede that “human annotators will always outperform machines for accuracy (although not speed) on tasks that involve finding and labeling, such as toponym resolution” (331). While it’s clear that programs can make deeper connections within the text or reveal more complex/interesting patterns than a human might, is accuracy really worth sacrificing? Have human annotations been compared to those made by computer programs? How do they match up?

    2. The authors explain that TextGrounder uses OpenNLP in order to process the text. As stated in the article, OpenNLP is used “to identify sentence boundaries, tokenize the text, label each word with a part of speech, and finally identify the named entities, which may be multiword expressions” (330). What is this program’s accuracy rate? The problem with Google Earth is that it misidentifies locations bearing the same name as that in the text. Similarly, some words that OpenNLP could find might be homographs, or the same word can be used in a variety of parts of speech. How is OpenNLP affected by this, or how does it correctly analyze the text?

    3. One interesting aspect of this study was when the authors claimed that their “methods do not even require corrected text”. Do they not require corrected text because the sample corpus was older texts that had already been edited? If so, how would TextGrounder work on unedited text? If it really doesn’t require the text to be edited, how does the program deal with misspellings or typos in the text?

    ReplyDelete
  6. 1.The authors raised an important point about the need for grounding language in historical and geographical context to more accurately predict location. While I can see how developing a system to take these variables into account could raise the accuracy of toponym identification, does developing such a system account for the fact that language is often fluid and dynamic rather than linear and gradual? For example, slang such as “chill” or “cool” in the modern sense has many meanings that can’t be isolated independently through geography or historical context. Similar words have different meanings for different populations and users.

    2. During topic modeling I am curious how certain concepts get attributed to geographic locations. For example, a paper that equally references water, earth, wind, and fire is most likely about the classical elements but these philosophical themes exist in Greek, Hindu, Buddhist, and Chinese thinking and are not necessarily rigid. In this case would it be hard to attribute a geographic quality to a theme that has many interpretations and settings?

    3.The idea of using large data sets to do topical searches was really fascinating to me but I wished that the authors would have gone into slightly more detail about how sets are compiled and arranged to be searchable by the computer. Is there anything preventing a user with access to hardware from using a text dump of Wikipedia to do topical searches or does the text have to be formatted in way that it can be taken advantage of?

    ReplyDelete
  7. 1. The project discussed in the paper is really amazing. However, I'm wondering if the men who conducted the project think about the language things. There are multiple languages across the surface of the whole earth, how could they take care of the geographic information described by various languages?

    2. What could TextGrounder match the real needs of archivists or other information professionals? I mean, after pointing out a bunch of locations and times from a given document, what can we do with them then?

    3. I visited the web site of this project and viewed its source code on Github just now. It seemed not many people other than the researchers of the project talk or keep eyes on the project. How could we evaluate the value of the project for the public?

    ReplyDelete
  8. 1. I am wondering if we can talk in class more about how TextGrounder was able to display words onto the Google Earth map. I can understand the use of the gazetteer to be able to find pre-entered locations, but how are the non-location names added to the map? Is it that if a noun is frequently used in the text near a known location, will that word be mapped near it? There are also known Italian cities that aren’t mapped where they should be. Does the gazetteer not give the location for these, or are the locations being approximated based on their relation with the surrounding words?

    2. I am wondering if a geoparsing system would be able to incorporate additional metadata about a book to help it in its analysis of geographical names. For example, if the system knew that a story was from biblical times, then it would know that the names of cities would mean the Middle East and not Texas.

    3. Would it be possible for this type of geoparsing to be used by agencies like the NSA to analysis of large bodies of communications of terrorist groups to be able to help identify their possible locations?

    ReplyDelete
  9. The goal of TextGrounder offers up an intriguing amount of possibilities for humanities scholars especially in working with ancient texts. Assuming various texts, words, and even expressions can be tied to geographical locations, would a possibility for TextGrounder be to identify “lost” or renamed locations?

    On page 335, examples of words associated with particular regions are presented and there is a reference to anomalous word associations. While the example of “louvre” being attributed to Austria is indeed very anomalous, how would ambiguous associations, not as far off as the louvre example seems to be but odd none the less, be dealt with once generated and looked into?

    Since the entire process lacks a gold standard of annotations, as noted on page 336, the visual output of the process provides an easy way to view errors. Are the associations that are tagged to geographical locations and viewed as correct therefore the gold standard moving forward?

    ReplyDelete
    Replies
    1. 1 - I was wondering the same thing - it seems like the evolution of cities/empires could pose an interesting challenge to TextGrounder, particularly if we think about locations that scholars may be in disagreement on. I wonder if TextGrounder has the ability to examine the date a text was created and then compare it to world geography at the time in order to come up with a more accurate location, or if it could eventually be expanded to add geographical context (e.g. "At this point in time, The Byzantine Empire controlled most of this area.."), which would make it an excellent teaching tool.

      Delete
  10. 1. This article talks a lot about accurately recognizing the name of a place in books. What I'm more interested in, is the recognition that a name that is written in a book is correctly determined to be a physical location and not just a name. I understand that there are a number of clues to look for that a name is in reference to a place but I would think that with the changes in language from as far back as words were written and the continued evolution of language makes this seemingly an impossible process to accurately automate.

    2. Would it be possible to use something like Google Goggles to translate the visual into words and then use software like the toponym recognition software to use a computer to find accurate, detailed information in a picture? I'm thinking about being able to describe the scene and what is in it as well as potentially find the location that the image was taken.

    3. I'm sure that there are more useful and academically relevant uses for this but what first came to mind was using the geolocational data in a publicly accessible database wherein anyone could enter a location and get results back in paragraph form of all the mentions of that place in the processed data. Other obvious uses would be to search locations in a specific book or to reverse the search and provide a navigable globe a la Google Earth with the locations from the most popular texts displayed.

    ReplyDelete
    Replies
    1. 2 I am also interested in whether we can use image information to recognize toponym just like text extracting to locate locations. As there is a way using metadata of certain text to identify locations. Maybe we can figure out a way using picture’s metadata that may include information of pixels to evaluate the location.

      Delete
  11. 1. As TextGrounder's mission is to link geography and text, I can't help but wonder about the role of language. Even when using TextGrounder as an English language tool, to what extent do tags and topic labels in other languages impact the way TextGrounder collects and provides information?
    2. The authors suggest that TextGrounder will be of great benefit to research in the humanities (338) and literary scholars (336). With that, I'm not sure I understand the demonstrated need for such a resource. Are there currently ways in which this kind of geobrowsing research is performed, and is ineffective?
    3. I'm interested to know what the biggest challenges have been in developing TextGrounder? Moreover, how have researchers managed the large amount of data that is being used to conduct the research for this project?

    ReplyDelete
    Replies
    1. 2 - I think I could see the use in, for example, finding out which artworks were created in geographic and temporal proximity to each other, or procuring a list of novels set in a particular area, or locating letters written by residents of a given neighborhood during a disease outbreak. Information along these lines can be procured by other means, but it relies on e.g. previous archival research and provenance records, and is probably incomplete. It would be really cool to be able to retrieve aggregated information by just clicking on a city, though I'm not sure how easy that would be to implement in practice.

      Delete
  12. 1) Though Google Books was fairly incidental to this reading, it was amusing (and sadly unsurprising) to learn that metadata isn’t the only place where Google Books is a trainwreck. Given the sophisticated technologies at their disposal, why did Google seemingly take a “good enough is good enough” approach to its automated data analysis for Google Books?

    2) It’s impressive the degree to which computers can “learn” now rather than being bound to less-adaptable rule-based algorithms. True artificial intelligence is a ways off still, but it seems like natural language processing has come far closer than other technologies. Not coming from a technical background, I’m curious—to what degree are there hardware differences between computers that adhere to rules and computers that “learn,” and how much of the difference is instead based on software?

    3) The continued incidence of errors in natural language processing brings me back to a question I posed a few weeks ago—is it possible to teach a computer program to recognize when it’s wrong or having difficulties, in order to flag those specific instances for human review? The ideal is of course a fully automated system with perfect comprehension and accuracy, but in the meantime, could we teach the technology to ask for help when it needs it?

    ReplyDelete
  13. 1. On pages 329 and 330, the authors discuss their desire to apply methods of grounding to tasks such as word-sense disambiguation and textual entailment, giving the example of differentiating between the meanings of the word 'wireless' given a certain historical time. I think this is an interesting concept and would be curious to know how far down the road this sort of textual analysis is.

    2. The authors mention the highly accurate output of TextGrounder in analyzing contemporary English text, but go on to say that, given more temporally distant English text, the results may not yield such accuracy. This is due, apparently, to a scarcity of annotated historical corpora. Is an annotated historical corpora something which is worth investing in, or are researchers more concerned with contemporary text? Is it worth it to create such a corpora?

    3. Is the gazetteer used as a foundation for all of these models? As the readings seem to prove, it's ineffective when used alone, but it seems that it is still a significant aspect of an successful model, and is useful for more straightforward toponym identification.

    ReplyDelete
  14. 1. Is this Textgrounder technology only capable of displaying modern texts and their corresponding places or can it also identify where something took place when borders were different or were changing throughout history?

    2. How does Textgrounder account of for countries or cities whose names have been changed? Does it also account for names that are written in other alphabetic scripts? Ex.) Ho Chi Minh City was formerly Saigon and Mumbai was formerly known as Bombay.

    3. In looking at page 335, Figure 2, “Word Distributions from the RTM model in Google Earth,” I am wondering is it possible to filter one word and see which places have the highest use of that word or concept? Also, is Textgrounder technology and data affected by Google Map updates? Will it ever be possible to create a real-time map that shows which words are “trending” in certain places?

    ReplyDelete
  15. 1. Since all web-facing textual resources will be parsed for geographical content, I wonder if the image (maybe only photo) resources will complete this process earlier than the textual. I believe that textual resources are always much more complicated than we expected. However, almost every picture we take every day is tagged with the location and date. The geographic information in pictures is much more accurate than that in the text.

    2. The author mentioned a problem that the places have same or similar name will be easily mislabeled. However, not only places which similar to each other will confuse the computation. On one hand, since there are so many places and millions place names in the world, it might hard for a computer to recognize all of them especially when one place has several different names. On the other hand, people likes to use metaphor when they describe a place. For example, my hometown used to be called “East Chicago” but it has no relationship with Chicago. How can computer know the “East Chicago” is not Chicago at all?

    3. It is amazing that the machine-learning methods, such as Latent Semantic Analysis, can extract latent semantic information from the deep connections between text and meaning. However, if no human can easily match information after the information was found by these methods, will human check it? What will we do if the latent semantic information still confusing for us?

    ReplyDelete
  16. 1. Elliott and Gillies provide a compelling outline of the danger and difficulties of relying on companies like google for these kinds of research tools. But I seems that the author did not give the reasons that explain the dangers and difficulties. Is the privacy security that brings the danger?
    2. The toponym resolution system might encode an assumption that repeated occurrences of a toponym in a document refer to a single location, or the places referred to in a document are likely to be near each other. And this assumption is also referred to in the paper by Travis in which it use an example of Dallas and Paris.
    3. It seems that all the solutions are based on English. How could they deals with multiple languages? Sometime the author may use the local language for a geographic name instead of English. And also, there might be different translation for one location, like ‘Beijing’ and ‘Peking’.

    ReplyDelete
  17. 1. We’ve digitized much of our practice of Geography long before now, with GIS systems proliferating and mapping the world with ever-clearer resolution. Being able to easily reference place-names from a corpus of literature onto a map would be of great interest to humanities scholars and others, but the task of toponym identification is difficult and far from complete. Is this likely to continue to receive support?

    2. Alternate mapping methods mentioned such as Moretti’s seem to provide additional links in data that may not otherwise be apparent. To what degree are these methods compatible or available to compare results for a scholar studying a particular corpus?

    3. The non-designated topic-ranking methods sound very interesting, but still constrained with difficulties in matching places to purposes, particularly given flaws such as weighting results in North America more heavily due to the location of many scholars. Is truly automating a large-scale study of toponyms possible, or must it be restricted to relatively related literature such as the PCL travel corpus?

    ReplyDelete
  18. 1 Does this research only training model using English? How about some toponym borrowing name from different languages? I think this would be a problem in this diversity country.

    2 The author uses fictions or texts from nineteen century. Do they consider the problem that some places change the name or have some territory changes after nineteen century?

    ReplyDelete
  19. 1. ‘The Perseus maps use the same Google Maps interface as Google Books, but the toponym resolution is generally of a markedly higher quality. It is still far from perfect, however.’ Can we find a perfect solution? It is too complicated. And I think Google Maps is really good.

    2. ‘The project is dedicated to improving computational analyses of natural language texts and developing tools to assist in cross-cultural information exchange.’ It seems that these two parts are the most difficult sections in toponym resolution. I wonder whether we can use logic language to do the toponym resolution. How could the project assist in cross-culture information exchange?

    3. In the conclusion part, the author mentions that the base of the project is there must be extremely large datasets and excellent hardware. So what other disadvantages does this method have? What can we do to improve it in the future?

    ReplyDelete
  20. 1. The authors talks about the difficulty of selecting the correct location for the given toponym and how the vocabulary of the document can give clear clues as to the distribution of its geographical reference. If ‘washington’ is associated with USA in the text, it becomes easy to parse the location. To what extent can the vocabulary of a document be used to correctly identify a location for a toponym?
    2. The TextGrounder system that is developed performs a light weight form of grounding computational representation of words to properties of the real world. What does the term ‘light weight’ in this context mean? Does it imply that the grounding happens superficially and not with full text processing?
    3. What is the downside of creating such a hardware for natural language processing? How can a hardware be 100 percent efficient? What are the compensations in this project?

    ReplyDelete
  21. 1) How are the problems facing curatorially linked text and geography vs. traditional archived materials similar? It seems to me that, in both cases, the work done "presents the perspective of experts" but limits the amount of work that can ever be completed so-

    2) Is a potential loss in quality of data worth a much higher quantity of data?

    3)How does this different model for digital geography pertain to other digital projects in the humanities?

    ReplyDelete
  22. 1. Are nick-names included in this location recognition? For example "Big Apple"? How would it recognize that the text is referring to NYC instead of a large apple of fruit? We would also know that the text is after the 1920s since the phase was first introduced then, but more likely after 1970 when it was first used as the slogan for tourism in NYC.

    2. All of the terms in Figure 1 are single word entries - what happens when you start looking for word combinations? Would that be more helpful (narrowing connections) or make things more complicated?

    3. "Possible problem with model - treat all entries as equal" p.335 So would you have to weigh your model depending on the corpus being used? So PCL Travel corpus might be weighted for Texas region, but a Biblical corpus would be the weight for Middle East region?

    ReplyDelete
  23. 1. In this article, our authors introduced a computational way to link text and geography information to enhance the accuracy of toponyms in location searching engines. However, I am wondering about the meaning of doing this. There is no doubt that some different locations may have a same toponym, but usually, the amount is so small that it could be presented in a short list of candidates for users to pick up conveniently. So, is it worthy for us taking efforts to improve this function?

    2. On P331, they claimed that "One common approach, for example, is to give priority to the more populous of the candidate locations for a toponym while simultaneously trying to minimize the physical distance between different possible locations for multiple toponyms in a single document." What is their basis to use this approach? And how to make a balance between distance and population?

    3. And, I am curious about the structure of data recorded. Whether it would be better for us to use a hierarchy like "Austin-Texas-United States", or to separate them to be individual locations such as"Austin", "Texas", and "United States"?

    ReplyDelete
  24. 1. “There are machine-learning methods for extracting latent semantic information—deeper connections between text and meaning—that no human could easily match.” (pg 331) This stuff is totally foreign to me. Is this a linguistics exercise? Is this what people do in linguistics? What are these deeper connections between text and meaning that a machine can perceive, but a human cannot?

    2. I am admittedly not very familiar with contemporary scholarly studies of literature, but I fail to understand how having a machine map out all the places mentioned in a book is going to be particularly useful. I mean, I believe it could be, but I think an example of how someone uses this information would help clarify this. Or even an idea of what the “more sophisticated geographical information” that these “academics and specialist communities” will be annotating texts with would be nice.

    3. “We are particularly interested in acquiring computational models of word meaning that are grounded, in the sense that they link natural language expressions to measurable properties of the real world.” (pg 329) Somehow I feel like when these authors say “word meaning” they mean something different than when I say “word meaning.” I think I generally mean it in a non-computational, non-grounded sense (presumably). How can I reconcile these different understandings of “word meaning”? (This is all just to say: what, exactly, are the authors talking about here?)

    ReplyDelete
  25. 1. I noticed in all 3 of these articles the use of gazetteers is stressed as being a key component in the study. Are there no other types of reference produced that can be considered useful or accurate enough for this kind of work available? Especially since (at least in the case of the United States gazetteer) it's all based on census data, which is only updated every 10 years. A lot can change in that amount of time.

    2. While I really love the idea proposed on page 337 about an interface that can connect a reader to a wealth of information spatially or topically related to a novel, I also have to ask if that means they plan on refining the toponyms results to stack in some way. I.E. would it be possible to link toponyms like "Topeka", "prairie" and "Free-Stater" into something like "Topeka AND prairie AND Free-Stater" to further refine the results.

    3. The authors mention that training the topic model on the entirety of the PCL Travel corpus produced "recognizably better output than training it on half or three-quarters of the corpus" and then state that "using twice as much data should give a similar degree of improvement." I found this kind of puzzling, because it sounds like they're saying plugging 20 million words into this model, 200%, will only give a degree of improvement on par with using the full 100% instead of only 50-75%. How does that make sense, and wouldn't they have some way of gauging the improved accuracy?

    ReplyDelete
  26. 1) Geographic names or places change depending on the time period, the language used, or political affiliation of the person or institution citing the location. How does the system account for this? Is it dependent on the gazetteer used?
    2) On page 336 the authors state that TextGrounder continues to learn as training data is added, as opposed to other machine-learning techniques for natural language processing where the benefit levels off fairly quickly. Why is TextGrounder different? I don’t know much about this area of research, but will future machine-learning techniques incorporate the model that TextGrounder uses or is TextGrounder somehow unique?
    3) On page 330 the authors state that language analysis models were trained on newswire from the 1980’s and 90’s. Would it be useful to have different language analysis models for different genres, biographies vs. fantasy, or different time periods? Is it possible to combine different time periods, genres, or languages in to one language analysis model without muddling the results?

    ReplyDelete
  27. 1. In this article the authors talk about different types of toponym resolution systems that assign place names to geographical coordinates. They point out that even the most state of the art system developed in universities, like the Perseus Project at Tufts University, will occasionally produce errors where a place is labeled to be at one set of coordinates when in fact it is at another. Do you think that scholars should use these systems to help them conduct analysis when an error like this could lead to a false relationship between two places that could precipitate a false conclusion?
    2. In this article the authors discuss the use of the TextGrounder program that links texts to specific locations and places in time. However one of the documents that they discuss as a test collection for this program is an interview done with Walt Whitman about his various travels. How would TextGrounder link a document like this, a history that was written in the past, to a specific time? Would it link it to the time it was written to maintain linguistic information contained or to the time that the book was written about or perhaps both?
    3. In this article the authors discuss Latent Dirichlet Allocation or LDA as an example of a method of analysis that could be translated for use in toponym resolution. To show how this method works they gave a table of eight topics that were found in the PCL Travel corpus and the top fifteen words that were related to these topics in descending order. The results showed that this data was highly subjective as many of the words related to the topics were only labeled as so because these relations occurred in the documents. Would this lead to a need to calibrate this method for each set of documents that it is used on as otherwise be a form of bias? Would there be a way to create a dataset to train this algorithm on that would be largely objective?

    ReplyDelete
  28. 1. The notion of literary mapping as described by Franco Moretti was difficult to understand. Why did the authors refer to his text and how does it apply to top onyx resolution?

    ReplyDelete
  29. 1. The authors mentioned that there are other ways of connecting language to the real world, such as by extracting public opinion trends from Twitter posts, predicting movie revenues from movie reviews, or creating three-dimensional images from textual descriptions. How do deal with the biases from people (I mean different perspective from people) when doing the connecting work like they discussed?

    2. Can TextGrounder system can extend to support multilingual functionality? How to deal with the data model when achieving that?

    ReplyDelete
  30. 1. I'd be incredibly interested in reading a study in which TextGrounder was used to explore temporal rather than geospatial relationships. The article mentions that this can be done (the tracing of "wireless" from its former meanings to current usage); I just think it'd be really cool to read.

    2. What kind of applications might the MALLET toolkit (pp. 332-333) have in examinations of the ASR transcripts of the Holocaust survivors read about for the Information Retrieval lecture? It seems that, were that data to be refined a bit, something along the lines of MALLET could generate groupings for potential topics.

    3. I find it really interesting that there doesn't seem to exist this "leveling off" phenomenon when adding training data to TextGrounder. I wonder if this is because of the (relatively) small size of the corpus being examined--the PCL Travel texts--or if this might be something substantially different.

    ReplyDelete