Perspectives on Information -- UT Austin INF380E (Fall 2013): 11-7 Mike Speriosu and Jason Baldridge. Text-Driven Toponym Resolution using Indirect Supervision

Thursday, October 31, 2013

11-7 Mike Speriosu and Jason Baldridge. Text-Driven Toponym Resolution using Indirect Supervision

36 comments:

Kathryn DarnallNovember 3, 2013 at 7:08 PM
1 - Are there risks involved in this sort of contextual mapping, or is this mostly applied to news articles and works of fiction? Mapping can be a particularly dangerous game, in particular when dealing with indigenous land claims and mining rights, so I'm curious if this will eventually be moved into a legal context, which could get dicey for some communities.

2 - What are the future applications for this work in spaces like oral traditions, or testimony? Will we eventually develop speech recognition that's sophisticated enough to merge with this work and create a toponomy of oral histories? It would be interesting to see this work mixed with say, Americo Paredes' archive of corrido traditions, currently housed at the Briscoe.
ReplyDelete
Replies
UnknownNovember 4, 2013 at 10:48 AM
1. In the conclusion of this article, the authors note that the text-driven resolvers in conjunction with the minimum toponym resolvers work well in relatively restricted contexts. Does the efficacy of the text-driven resolvers noticeably diminish on larger and more wide-ranging collections? What is being done to address this (possible) issue and test the scalability of the program?

2. What are the possibilities of this in an intelligence gathering capacity? Could this be used (I am assuming it probably already is) to map and understand the intentions of people, countries, etc. being monitored by intelligence agencies? Would this be possible to help use in interpreting code names of geographic locations?

3. Seeing that many of the relationships rely on metadata about locations p. 1467, and given last weeks discussion on metadata, is it really wise to rely on something that can be as insubstantial and flighty as metadata to form spatial relationships?

ReplyDelete
Replies
Haoyang LiNovember 4, 2013 at 6:51 PM
1. The authors mention that geonames gives the locations of regional items like states, provinces, and countries as single points. Thus many words associated with the USA are connected to a point in Kansas. I wonder why are those points connect to Kansas, and what causes the links.

2. Since SPIDER runs 10 iterations, why its runtime is only quadratic in the size of the documents? Furthermore, would breaking up the document affect SPIDER perform iterations to calculate the weight information?

3. The authors say that the population resolver is generally quite effective by selecting the location with the greatest population for each toponym. Even without the problems of lack of related population information, I still don’t see why it is effective. What’s the correlation between population and toponym?

ReplyDelete
Replies
Heather WalterNovember 4, 2013 at 8:21 PM
1. This article seems to place more emphasis on human text analysis than the article by Brown et al. In this article, the authors state regarding their test corpora, “We need corpora with toponyms identified and resolved by human annotators for evaluation” (1468). I assume the authors are using the human-annotated work to compare with the results of their new toponym resolution method. However, is their goal to move away from the need for human-annotation altogether? Brown et al. used text that had not been annotated – why didn’t the authors of this article?

2. When discussing document size, the authors claim that the larger the text, the more time it takes to be processed. They specify that for SPIDER, TRIPDL, and TRAWL, breaking up documents is necessary. For the purposes of their study, the authors divided their test corpus “into small subdocuments of at most 20 sentences (1471). How long does it take to divide the text up, and is it worth it? Furthermore, will future researchers need to divide up text in order to use these new systems, and will they consider the time spent dividing up a corpus justified? Will there be a way to process larger bodies of work in shorter amount of time as this system develops?

3. The authors included an Error Analysis section in their article. One of the biggest error-causing toponyms reported was Washington, as there are two widely known locations bearing that name in the U.S. (1473). How will future versions of this system deal with these errors? Will humans always be required to make the final judgment call, so to speak, regarding what location a text is actually talking about? If so, doesn’t that defeat the goal of these systems?
ReplyDelete
Replies
Graham AustinNovember 5, 2013 at 11:56 AM
1. Thinking about last week’s reading I’m really curious to know if anyone has developed a metadata strategy for including geographic reference data inside their texts (much like hyperlinking). I often wonder if children of this generation will learn to apply good metadata to the documents they produce or whether these things will show up in the next iteration of Elements of Style. What types of smart processing could we invent to aide this process?

2. In the article the author mentioned that the systems often times brings up false positives for certain geographic location such as Washington. With so much variation in the text how does a systems designer deal with these problems? Do the systems contain an additional rule for commonly mistaken locations or is a failure an indication of bad system design?

3. I’m curious if the markup that was contained in the data that researchers harvested could have helped to further differentiate the locations and origins of topics. It would be interesting to use this data in conjunction with analytic data to see where viewership of a page is taking place in order to derive some additional context about who the piece was written for and where they are likely referencing in their article.
ReplyDelete
Replies
UnknownNovember 5, 2013 at 3:38 PM
1. First of all, I'm thinking of the paper we read in the previous week. For those articles on the Web, is it possible to automatically recognize and tag the texts on with geographic information as metadata? If it's possible, there's one more dimension for search engines and those who retrieve information on the Web.

2. By the end of the paper, the author mentioned "This strategy works particularly well when predicting toponyms on a corpus with relatively restricted geographic extents." In other words he admitted the use of such a strategy is limited. If so, does the author have any ideas to get rid of the issue?

3. In the paper, the author stated "This strategy works particularly well when predicting toponyms on a corpus with relatively restricted geographic extents." This is also what I want to ask about. It's normal that two locations share a same name even in one country. how could the resolver mentioned in the paper takes care of this case?
ReplyDelete
Replies
UnknownNovember 5, 2013 at 3:44 PM
1 Considering future applying of this work, how about some place that we have not explored on earth in some rural area? And how about places that there is no books have recorded? And how do we decide the width and depth of the selection of books?

2 The author states that many researches identify toponym by using metadata. How do they decide the pinpoint locations on a map by those metadata? To what degree with the accuracy they build their map?
ReplyDelete
Replies
Kendra MNovember 5, 2013 at 6:47 PM
1. I was surprised to see Gaza as the second most inaccurately judged toponym. (Table 4) Until I looked in Wikipedia, I was not aware of cities in other contents that were named Gaza. I am still surprised of the inaccuracy though as I don’t think it would be as ambiguous or large of an area as Washington or California. Why do you think Gaza was a very inaccurate toponym?

2. The article mentions that population based toponym decisions do not work well for historical corpora. Would it be possible to build into a gazateer a way to calculate what an approximate population of a certain city would have been at the certain time of history that the body of text is from?
ReplyDelete
Replies
UnknownNovember 5, 2013 at 10:52 PM
WISTR has a range of 10km on the toponyms it identifies. With the size of metropolitan areas and documents which may mention locations at a range larger than 10km, why was 10km chosen as the the limit of the range? Or does the system have an overlapping area which covers the widespread area of a city beyond the 10km limit on toponyms?

In the error analysis section, Washington was listed as having the most errors attributed to it. In the case of the Israeli Ambassador, a possible solution involving contextual training is brought up. How might contextual training actually take place? Would it just involve assigning specific words related directly to a topic leading to a toponym or some other method?

The California related error is really interesting despite the general cues of location in the section pointing to the East coast. Is this another matter of contextual misunderstanding? Or is it a different underlying problem that lead to the error?
ReplyDelete
Replies
Ryan SutherlandNovember 5, 2013 at 10:57 PM
1. "We choose a metric that instead measures the distance between the correct and predicted location for each toponym and compute the mean and median of all such error distances." Wouldn't this method of measurement allow for the possibility that the reported toponym and the actual location be similar but very far away as well as they could be very close to each other? Both answers are wrong and I'm not sure I understand why the distance makes a difference as to the severity of the error.

2. "This strategy works particularly well when predicting toponyms on a corpus with relatively restricted geographic extents." This seems like an obvious conclusion and one that perhaps highlights a weakness of the system. I understand the concept of spatial minimality but while this may be true, wouldn't the goal of a system like this to be able to be used in many different contexts?

3. "...we simply divide each book in the CWAR corpus into small sub-documents of at most 20 sentences." When using the context of the surrounding documents to accurately identify toponyms, does this breaking up of documents into smaller documents affect the systems ability to correctly match locations and toponyms in any way?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 6:07 AM
1. I'm curious about the relationship between toponyms and metadata. I'm not sure I'm totally clear on my understanding of toponyms--are they a type of metadata? And furthermore, how exactly do toponyms and metadata coexist--are they mutually dependent, or is one element more dependent on the other?
2. In what ways, if at all, do the authors see the practical use of this information technology outside of the academic realm?
3. Is the complexity of having the various described toponym resolvers a result of this research being new? Or, as it develops further, do researchers anticipate that the toponym resolver technology will become more streamlined?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 9:26 AM
1) I understand the reluctance to use polygon shapefiles because of their additional technical requirements (GIS programs being unfortunately expensive), but I do think there’s an advantage to using them—namely, their usefulness for establishing area overlaps and layering of geographic data. Does the technology exist to create geographic polygons that could be used without additional software, to facilitate greater accuracy in spatial analysis?

2) I’m mostly familiar with information retrieval and analysis that is language-based, so it was interesting to learn that existing technologies for toponym resolution have previously not incorporated textual analysis. There are advantages to probability and spatial analysis when dealing with geographic information, but the authors’ conclusion that textual information should be integrated is sound and well-supported. How were these conclusions received at the conference where this paper was presented?
ReplyDelete
Replies
T.KlinkNovember 6, 2013 at 10:38 AM
1. Is the ultimate goal to streamline all of these approaches (TRIPDL, WISTR, TRAWL) in order to create a most effective system, or to continue using them in conjunction with one another?

2. To clarify, the Toponym Resolution Corpora was a collection of human-annotated material created in order to establish a basis for toponym identification within a machine-readable context? Am I understanding this step of the process correctly?

3. It seems to me, from the two readings about toponym resolution we've done, that the encouraged approach is to design a system that utilizes surrounding text in disambiguating geographic locations. I think this makes a lot of sense, but is there evidence of when this approach did not work? This reading gives the example of utilizing placement of the word 'lobster' near the toponym 'Portland' in order to determine its location as Portland, Maine, but it seems to me that 'lobster' could be irrelevant to place name.
ReplyDelete
Replies
UnknownNovember 6, 2013 at 11:45 AM
1. On page 1467 it says “Our primary focus is toponym resolution, so we evaluate on toponymns identified by human annotators. How accurate are human beings in identifying these toponymns? How much experience does someone need in this field to be a reliable toponym evaluator?

2. It seems as if much of this technology has trouble accurately identifying toponyms due to a lack of context. I’m wondering will it possible for this technology to ever function without some kind of human assistance as identifying context in textual references is very important.

3. Why were certain cities chosen for this study and not others? I looked up a list of the most common city names in the world and there are 328 places which share the name of San Jose, 320 that share San Antonio, and 296 that share Santa Rosa. Is it possible for this toponym technology to deal with such a very large set of data if these three cities were used?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 11:56 AM
1. In another article, the authored eliminate using information about population because this data always change with time. However, in this article, some resolvers are based on heuristics using spatial relationships between multiple toponyms in a document or metadata such as population. In my opinion, population is not reliable data when analyzing toponyms.

2. Since GEOWIKI dataset contains over one million English articles from February 11, 2012 dump of Wikipedia. The author divided the corpus into training (80%), development (10%), and test (10%) at random and performed processing. I wonder why author divide corpus by 80%, 10% and 10%. Is there any assumption or theory to support that?

3. Toponym resolvers use a gazetteer to obtain candidate locations for each toponym. The tool GEONAMES they used include the location’s administrative level and its position in the geopolitical hierarchy of countries and states. However, the study results are given for both 19th century texts pertaining to the American Civil War and 20th century newswire articles. How can the gazetteer deal with the problem if the information, such as location’s administrative level, has changed over time?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 1:04 PM
1. The population solver selects the location with the greatest population for each troponin. Why does it select the location based on the population? Is there a statistics relation between the population to possibility?

2. Connecting to the papers we read last week, I think it is good supplement to the automatically generated tag system. The method introduced here could greatly help improving the quality of tags generated by computer, if it could be integrated in to the existing system.

3. The Author introduced several ways to select location for toponyms. But it is not referred that we could use relevant words to help us selecting. If Eiffel Tower appears in the text with Paris, it is more possible that the toponyms mentions the capital city of France. What’s is limitation of this method?
ReplyDelete
Replies
Lindsey WilsonNovember 6, 2013 at 1:26 PM
1. The use of geographically-linked Wikipedia articles to assist in toponym identification from related text is fascinating, given the crowdsourced curation of Wikipedia. There are very few other such sources of information that are as heavily annotated by the public, but would other geo-tagged data (such as Twitter profiles) be similarly useful?

2. The Civil War documents being researched had to be broken into much smaller sub-documents of 20 sentences at most (Speriosu and Baldrigde, 1471) in order to control processing time. This results from these smaller increments seems to still produce increased accuracy of results, but as processing speeds improve, will these smaller chunks be necessary?

3. The conclusion states that performance is strongest when dealing with a corpus covering a relatively limited geographical area, such as the American Civil War. This seems somewhat obvious in that while European locations might be mentioned in a Civil War text, it is overwhelmingly possible to weight for the correct area. Reuters articles tend to focus on a single geographical area as well, so is it possible to extrapolate this method to a wider corpus or something like a travelogue which might mention broadly distant locations?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 2:38 PM
1. ‘It has been estimated that at least half of the world’s stored knowledge, both printed and digital, has geographic relevance.’ How could the author make that estimation? Are there any evidences showing that? If it is not true, then maybe this research is meaningless.

2. ‘Ours is the first we are aware of to use training data from a different domain to build a document geolocator that uses all words (not only toponyms) to estimate a document’s location.’ Why do they use training data form a different domain? What are the advantages and disadvantages?

3. ‘For SPIDER, runtime is quadratic in the size of documents, so breaking up documents vastly reduces runtime.’ Are there any limitations when breaking up documents? If we do so, will it bring some bad influence?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 3:58 PM
1. We have seen the phrase "gold label" in articles this week, what does that mean exactly?

2. Were these american modern day newswire text - american stories as opposed to international stories? Would we have very different results for text from England during the Victorian period? How much does the training corpus influence the judgement of automated assessment for the test corpus?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 4:06 PM
1) How important IS the human element to identifying toponyms? This article seems to value human annotators more than the Brown article did. Is there a way to crowdsource in order to edit?

2) Along with this human vs. machine theme in correctly identifying toponyms, our articles have discussed the many issues with current technologies that do this work but does introducing the human element to the process improve this process? Are human being more accurate when identifying toponyms?

3) Can modern authors and creators do anything to facilitate the process of identifying toponyms within their own work?
ReplyDelete
Replies
Sowmya SadhasivamNovember 6, 2013 at 4:43 PM
This comment has been removed by the author.
ReplyDelete
Replies
Sowmya SadhasivamNovember 6, 2013 at 4:44 PM
1. The Gazetteer used for the project is GeoNames which is used to identify candidate location for each toponym. Is it sufficient to restrict to only one source and entirely depend on it? What other gazetteers are available for this purpose?
2. The base line resolvers mentioned are said to produce errors. What can be done to fine tune them to be more accurate? Why is this part not taken into consideration in the research?
3. The text-driven resolvers are said to perform well when predicting toponyms on a corpus with relatively restricted geographic extents. What is the boundary of this extent? Does it produce errors when it crosses the particular restriction?
ReplyDelete
Replies
Ce ShangNovember 6, 2013 at 4:45 PM
1. In its introduction part, the authors said that "It has been estimated that at least half of the world’s stored knowledge, both printed and digital, has geographic relevance". What is geographic relevance?

2. On P1469, they cited Leidner's words summarizing two general properties of toponyms: one sense per discourse and spatial minimality. And they explained one sense per discourse as "multiple tokens of a toponym in the same text generally do not refer to different locations in the same text". So, my question is that do we need to count repeated toponyms for once or multiple times in one publication ? Does it have any influence?

3. The authors had some conclusions, such as "WISTR is very effective" and "on the CWar database, population performs relatively poorly", based on their analysis on documents about the American Civil War. Which aspects should we take into consideration if we hope to enhance its generality in the future?
ReplyDelete
Replies
Lyndsey RaneyNovember 6, 2013 at 4:56 PM
1. Are there other sites aside from Wikipedia whose data could be used for toponym resolution? What about location-driven social media sites like 4square, or even Facebook or Instagram?

2. On page 1473, the authors note that errors for toponyms Australia and Russia were very small. Is this due to both a) a lack of cities named Australia and Russia and b) that data about both of these places is (for particularly Australia) so distinct? That is most of the context for Australia involves a lot of things unique to the continent.

3. With these programs... is it possible that the more words and pieces of information you 'feed' them, the more refined their results become?
ReplyDelete
Replies
Joseph D. OlivarezNovember 6, 2013 at 5:48 PM
1) Is it possible to use other metadata associated with the text to help determine these geographic locations more accurately? For instance, most books have a bibliographic record created by libraries that indicate general geographic areas covered in the text, if important. If a book is about a trip to Paris, France but the person happens to be from Texas, the geographic information for France should be in the record helping to disambiguate the place name. While it may not be worth it to create metadata for a project if it does not exist, using created metadata for books that are being analyzed could be useful and is freely available.
2) On page 1473 the authors state that the errors for Australia and Russia are fairly small because of the differences in how they are represented across different gazetteers. What is the difference and how did it make an impact?
3) The authors state on page 1474 that the lack of disambiguating context lead to high error rates for American towns sharing the same name, like Jackson and Lexington. Is this because the segments of analyzed data were so small, 20 sentences? When the authors were talking about document size on page 1471, I was confused as to if the subdocuments for each book related to each other or influenced the results in other subdocuments of the same book.
ReplyDelete
Replies
UnknownNovember 6, 2013 at 6:14 PM
1. In this article the authors test several methods of toponym resolution. The ones that they are focused on are the text-driven resolvers. They use there of these TRIPDL, WISTR, and TRAWL which brings together TRIPDL, WISTR, and standard toponym resolution cues. However in the results WISTR outperforms TRAWL in several areas. What do you think could be the reason that TRAWL is not superior to WISTR if it includes it in its calculations?
2. In this article the authors create a dataset by taking collections of REUTERS news articles and books about the Civil War and 19th Century America. However because one of the methods that they were testing takes longer the longer the document they had to break up the books into segments that were 20 sentences long. However they do not state how they went about breaking up the documents. Were the documents broken up according to subject or were they randomly divided into subdocuments? What problems could either of these methods pose and which do you think is best?
3. In this article they put forth several methods for resolving toponyms. In the Brown et al. article they also put forth a couple different methods for resolving toponyms. What are the similarities and differences between the different methods that they put forth and which do you think is best for toponym resolution?
ReplyDelete
Replies
Julie "Michelle" BennightNovember 6, 2013 at 6:26 PM
1. The article refers to human annotators. What types of annotations do they add to the texts and at what point do they occur? How are the annotations used in training the systems?

ReplyDelete
Replies
UnknownNovember 6, 2013 at 6:47 PM
1. The author mentioned that we need corpora with toponyms identified and resolved by human annotators for evaluation before spending so much space to talk about the algorithms for different resolvers, which brought me a thought that no matter how advanced or innovative the technology can reach, they will eventually go back to manual work did by human beings to be evaluated. Do that mean the development of technology is anyway constrained by human work? Or is that kind of credit crisis for technology that human make by themselves?

2. When talking about the toponym resolvers, the authors pointed out three different kinds of resolvers- heuristic resolver, text-driven resolver and combing heuristic and text-driven resolver. These three resolvers are a bit overlapped, but among all of them, which one is the best for identifying the location references for Google Books?

3. Can any of these resolvers incorporate with extension to recognize the toponym in other language? It happens sometimes that authors use English all through the article, but when they mention some specific locations in other countries, they use their native language, like Koln, Germany.
ReplyDelete
Replies
Lizzie SeipleNovember 6, 2013 at 7:00 PM
1. To explain why they are endeavoring to develop their disambiguation project, the authors explain, “geographic information pervades many more aspects of humanity than previously thought …Thus, there is value in connecting linguistic references to places (e.g. placenames) to formal references to places (coordinates) (Hill, 2006).” Does this strike anyone else as kind of funny? Because geographic information “pervades” “aspects of humanity”, we should spend the time and money to develop the technology to connect place names to coordinates? They couldn’t come up with anything more compelling than that?

2. In talking about how their SPIDER resolver works, the authors say that “When there is no such weight information, such as when the toponym does not co-occur with other toponyms anywhere in the corpus, we select a candidate at random.” Is random really the only option they have? Why?
ReplyDelete
Replies
UnknownNovember 6, 2013 at 8:20 PM
1. Could further iterations/implementations of these methods be coded to recognize geopolitical boundaries between nation-states? For instance, the errors encountered in rendering Australia and Russia due to gazetteer misinformation might possibly be resolved using a range of latitudinal/longitudinal coordinates rather than a single point. (I think this was addressed early in the paper--it was avoided because of the difficulties in coding. However, difficulty and impossibility are two different ballgames.)

2. Does using (modern) Wikipedia as a contextual source cause issues in the analysis of older documents? Might the system be improved by using more "sophisticated" or academically vetted (whatever that may mean) sources of contextual information?
ReplyDelete
Replies

Add comment