Thursday, October 17, 2013

10-24 Oard et al. Building an information retrieval test collection for spontaneous conversational speech

32 comments:

  1. 1. When describing the interviews used in their text collection, the authors state that the subjects of the interviews “were asked to complete a detailed questionnaire a week or so before their interview” (3). However, while I understand that this study used the search-qualified assessment, does the fact that the interviewees knew what questions they were going to be asked take away from the spontaneity of the speech? Isn’t spontaneous conversation less planned?

    2. The authors devised five categories to determine relevance, listed on page 4 of the article. All of these categories have to do with providing evidence, context, or a source. These are certainly important relevancy indicators, but should there be more? Were there only five categories in order to limit the scope of the study and make the information more manageable?

    3. The authors talk about the future of ASR and IR systems, talking about how more development is needed and much more study (7-8). Is the ultimate goal of the systems such as the one discussed in the article to be able to search recorded speech the same way we search Google? If so, is the cost (initially $2,000 per interview for this study) worth it? Would costs go down with more development?

    ReplyDelete
  2. 1. I'm just really interested in the potential for non-written forms of communication and information in the future, and this article made me even more interested. Writing, it seems, will not have its privileged position as the most "authoritative" form of communication and record keeping for much longer. I'm increasingly fascinated by the possibilities this revolution holds for both entrenching new types of authority based on non-written records and creating new opportunities for democracy based on the very same technology.

    2. This article notes that each interview in the study of the Survivors of the Shoah Visual History Foundation cost $2,000 (including a "trained interviewer" and "professional videographer"). Further processing (relevance assessment, digitization) was done by graduate students; it is unclear whether these were volunteers or paid research assistants. Indexing and summaries cost an additional $2,000. The article admits that it was therefore "unaffordable to scale this process up to the entire collection, thus motivating research on techniques that would minimize the required human indexing effort while still providing adequate support for subsequent access to these materials." What, if any, further research has come out of this study? I want to read a further study on the financial feasibility of widespread application of these methods and technologies.

    3. The relevance categories applied to the resulting records were based on "the notion of evidence" and numbered five: direct evidence; indirect evidence; context; basis for comparison; pointer to a source of information. How might these categories change for other projects, involving other records, interviewees, users, contexts, or communities? What categories might come out of a group of archivists that values "potential emotional impact" or "appropriateness to an audience," two notions rejected by the formulators of this project and study.

    ReplyDelete
    Replies
    1. 3. This is an interesting question, especially given that some non-written materials might have, e.g., "potential emotional impact" as their *primary* informational value. (I am thinking of various artforms, here, but there are likely other examples.) How would it be possible to design an information retrieval system to account for emotional or aesthetic value, especially when this value can be so subjective?

      Delete
  3. 1) Seeing as Automated Speech Recognition still has trouble with words it isn’t familiar with, like personal names or obscure toponyms, it is obviously important to continue refining the technology. Having worked with the OCR in the past (a technology that is similar in concept in many ways), I wonder if it might also be a good idea to figure out how to teach these technologies to recognize when they *don’t* “understand” something, and to automatically flag that audio segment for manual review by a human. The technology would still cut down drastically on manual transcription, but a misunderstanding-recognition mechanism would prevent problems like those mentioned in the article, in which some search queries failed to retrieve highly relevant information because the ASR program merely took a guess rather than flagging the confusing segment for review.

    2) I was interested in the problems raised for ASR algorithms by conversational human speech, especially by non-native English speakers—for instance, unorthodox grammatical structures or use of non-English loanwords. Given the intuitive and non-mathematical nature of human language processing and expression, is it feasible for ASR, or any algorithmic program, to ever account for the complexity of human speech?

    ReplyDelete
  4. 1. On pg. 2 of this article, the authors state something that we have consistently come back to during class discussions, that is 'relevance is inherently subjective'. After presenting this, they go on to claim that such subjectivity is bearable due to things such as MAP, which from my understanding, supposedly helps restore balance. But is this system foolproof? What do we risk trusting to such devices to maintain equilibrium? Is it possible to completely remove subjectivity from the process to ensure that nothing is left out, or are the countermeasures enough?

    2. Throughout the article, part of the process of creating a test collection is based on the indexing of the interviews. Is ASR applicable to search only interviews with extensive indexing, which sounded quite expensive, or is it able to be used with a bare minimum of information? In addition, to effectively search material, does one have to be experienced in writing queries?

    3. It seems that ASR really only works with the English language from what information is available in the article. Towards the end, there is mention of a similar system and test collection being created in Czech, but to what extent does the language limitation limit the handlers understanding of the test collection, especially for something where there are so many different, accents, languages, and cultures involved? On collections that are more of an international nature, is there extensive collaboration between countries with a vested interest? Would this facilitate the creation of such test collections?


    ReplyDelete
  5. 1. What is a "lattice of phenomes" (p.2)?

    2. What is the second basic approach to searching spoken work collections called? The first approach is called "word spotting", but the second one is not so pithy. Perhaps we should call it "sequence spotting"?

    3. It is hard for me to imagine that our generation will see ASR come into the mainstream in the U.S unless there is significant financial value. The issue is not with technology so much as it is with the inertia of the emphasis on the written word. Would Google or Bing trash their own market for text search as Apple sacrificed its desktop and laptop markets to pursue phone and tablet technology?

    ReplyDelete
    Replies
    1. A lattice of phonemes would be, I assume, a string of sound fragments. A phoneme is the basic building block of an utterance, so to conduct a speech recognition-type search, you'd need to be able to translate between written text and sounds as they are heard. Phonemic breakdown (and the International Phonetic Alphabet) help make that happen.

      Delete
  6. 1. There are four parts to evaluate the ASR: (1) the size of recognition vocabulary table, (2) the quality of speech signal, (3) a single speaker or multiple speakers, (4) the hardware. The first three parts are mentioned in this paper, but the hardware part is missing. So what’s the situation about hardware in ASR field?

    2. In the creating relevance judgments part (4.3), the authors say ‘five categories were derived from our understanding of historical methods and information seeking processes’. Is this too subjective? Are these five categories enough? Are these categories effective? I wonder if there are some other relevance judgments.

    3. In table 2, it shows us the results of Maryland experiments. I wonder why they divide the whole experiments into two parts: single assessor and adjudicated. And as it shows, ‘title-only and full queries yielded similar results’. Why the results are similar? I wonder why ‘the segment summaries and thesaurus terms are the most useful sources of index terms’.

    ReplyDelete
  7. 1. On page 2, the article mentions weaknesses present with the ASR technique. One of those weaknesses stated was emotional state. With that being said why did the writers of this paper choose to use a collection of interviews from witnesses to the Holocaust? I'm assuming many of those interviews would be emotional and wouldn't that affect the ASR techinque?

    2. Under section 4.3 "Creating Relevance Judgements" the writers mention that their are five categories were refined by discussions with "our assessors"? Who are these people and how much experience do they have in this field?

    3. With so many variable in capturing recognizable and coherent sound using the ASR technique I don't really see ASR as ever overtaking text based recognition. In reading this article, it seems that the money and labor that have to go in to this enterprise is too large. Speech is too large of a format. It can't all be recorded. Text, while large, is more permanent and is vastly easier to process.

    ReplyDelete
  8. 1. I had a really hard time following the steps of the studying. While I’m sure I don’t need to understand it all, I am confused by a few parts. The article mentions that the segments had subject indexing done on them, the graduate students that judged their relevancy, and that the oral histories were manually transcribed. I am wondering why this manual indexing was necessary if the goal of the project was to have the ARS system decide relevancy or was this to compare results? How were the graduate students able to determine relevancy of the segments without knowing what research question someone could be asking?

    2. How does using ARS compare to using a transcription of an oral history and then using a text search of the transcription? Is one more cost/time efficient for an archives to provide their resources to researchers? Does one provide more accurate results?

    3. The article mentions studies in ARS in other languages and possible multilingual experiments in the future. I am wondering if there are certain languages or families of languages that make ARS more or less difficult? I am imagining languages that have more varied uses of accents or inflections would make ARS more difficult.

    ReplyDelete
  9. 1. The interview indexing is quite expensive, as the authors say, because of the human labor. So could the indexing process be crowd sourced to save the budget?

    2. Can the authors' approach in creating their test collection be replicated to the creation of other kinds of test collections on other topics? Is the test collection creation process inherently determined by the collection or could replicable?

    3. When collecting the interviews, the interviewees were asked to questionnaires in advance. The authors say the purpose is to help the interviewers prepare and for later use in the search system. So what is included in the questionnaires exactly? Would the questions affect how the interviewees behave during the interviews? Would it be better to only collect minimal information for the interview preparation, and leave all topic relevant questions after the interviews as a follow up?

    ReplyDelete
  10. 1. As yet, we’ve barely managed to teach computers to read. Teaching them to listen has proven an even more daunting task for many of the reasons the authors cite. However, this seems like a technology with great promise, in permitting simpler interactions with automated phone menus at the least, but also assisting the visually impaired or otherwise aiding interactions with computers. Is this an area likely to receive ongoing funding and support?

    2. Given the rapid evolution of language and immense variation between speakers, how possible will it be to train them to hear these variations as similar constructs of the same language? The results of this study showed low accuracy from these initial results, but it seems very preliminary.

    3. Will technologies such as this influence our speech instead, cultivating a ‘computer accent’ that is more easily understood by machine interfaces? To some degree we find ourselves doing this already to be clearly understood even in interpersonal speech.

    ReplyDelete
  11. I mentioned cost associated with natural language processing when discussing the Croft et al. article and now within this Oard et al. article they have a price listed as $2000 for indexing each interview related to the study. Is there a threshold willing to be accepted for cost in processing natural language? I would assume with natural language multiple assessors and reviews need to take place in order to ensure accurate indexing, transcription, and ultimately relevance judgments. These things all cost time and money for those involved and will likely go down in price but ultimately seem like they will continue to cost more than traditional IR methods. So, if this is the case how widespread might these processes be in the future?

    The assessor interface described on page 4 seems to cover most, if not all, potential needs for each of the assessors as they go about their job to check for assessments. Is it possible that with all the bells and whistles, along with the non-historical background for some of the judges, this interface contributed to the judgments taking up one whole month of time to complete?

    Oard et al. noted that the relevance assessors, also, did some research and kept notes as they completed their relevance judgments. Would it have been beneficial for them to then go back to their initial judgments, as their experience and insights might have changed, to review their own work before a separate review might’ve taken place?

    ReplyDelete
  12. 1. While the topic of security and privacy are mainly outside the scope of this article I couldn't help but think about the implications for constant recording of speech in the everyday world, especially considering the nature of the topic touch upon by the researchers later in the paper. The question of why the spoken word is under represented in the IR field is interesting but the author seems to think that it is merely a result of storage capacity and technology limitations which I don’t know if I agree with. How do others feel about this?

    2. In many of the test cases it seemed that the ASR often had trouble picking out specific names and organizations from recorded speech. Would it be possible to teach names of people and organizations to ASR by feeding data from sources like the YellowPages or Yelp?

    3. I am curious if any research is being done to utilize technology such as Captcha (a verification system that uses human computing power to complete large tasks) to transcribe recorded speech in a more intelligent way.

    ReplyDelete
  13. 1 - In reviewing some of the parameters of this study, I'm curious about how more nuanced understandings of language and identity could affect a larger scale project. This weekend I attended a conference where a representative from the Shoah Foundation spoke about the video archive, but there was also a larger discussion of how many legal or state-sanctioned terms, such as "genocide" (which cannot be considered politically motivated, according to the UN) or even "citizenship" (what about stateless individuals or undocumented persons?) can limit the inclusion of voices. I'm interested in how some of these terms (an individual who was identified during the holocaust as Jewish, for example, but perhaps doesn't personally identify as such) could affect or alter the way informal/conversational speech is used or searched for "relevance."

    2 - Does this project lead us into a metadata minefield? Conversational speech is highly contextual and contains verbal tics and patterns that may be useful to some, and irrelevant to others - how can we created useful codified vocabularies to enable access to materials such as oral histories, without re-inventing the wheel each time a new topic or collection is developed?

    3 - I'm intrigued as to how this work could interact with other work being done in sound, such as Dr. Clement's HiPSTAS project (http://blogs.ischool.utexas.edu/hipstas/) where sound is being visualized to look for differences and commonalities in speech. How might visualized speech patterns enhance the work that's being done here, for example, being able to look at sound patterns to find specific phrases or words that are repeated, but used in different contexts?

    ReplyDelete
  14. 1 In this research, the author uses search-guided approach to improve polled assessment. Would this be too leading and bring some problem to this research?

    2 As Table 2 shows, there are big differences between ASR and manual approach. And manual approach is far more close to combination approach. So, is there still a long way to improve ASR?

    3 As Table 2 shows in this paper, title-only and full (title, description and narrative) queries yielded similar results. Moreover, adjudication and review did not markedly alter the relative effectiveness of reported collections. Can this have some inspiration to recent IR system?

    ReplyDelete
  15. 1. So the segments referred to throughout the experiment are the "topically coherent segments" that indexers divided the interviews into, correct (Section 4.1)? And these segments made up the test collection, providing the basis for study?

    2. The participants ranked, or assessed, the relevance of a segment based on what? A query they had pre-formulated? How does the relevance then affect the IR system?

    3. The eventual goal, and the idea this paper is trying to elucidate and support, is to be able to develop a system which accounts for the intricacies of language in searching oral systems? Have I understood that correctly?

    ReplyDelete
  16. 1. This article makes me think about what Google has been doing with their Google Voice service. Google Voice manages your calls and voicemails and allows the participants to record and have Google transcribe the calls as well as transcribe your voicemails. It even allows the users to make corrections to the voicemail transcription if the automated service doesn't get it right. I'm imagining that Google Voice is just a way to offer users some free services in exchange for their help in building a better way for them to recognize conversational speech.

    2. "...we are building a community of users from whom we can continue to learn about the true information needs that motivate those who seek access to this collection." I think that this is essential to any type of automatic speech recognition. There are many variances in speech and having some context based on the type of users or community the speech is being transcribed from can help improve ASR.

    3. As I was reading this article, I tried to think of a way that searching transcribed audio would directly benefit me and my everyday life. Something that I would consider very useful is if I had the capability to search the transcribed audio from YouTube (YT) videos. As it is right now, there is an overwhelming amount of content on YT and way more often than not, the author created meta data that is used for searching is not a very efficient way to find videos that you're interested in. If each video was transcribed and that transcription was made searchable, perhaps even broken down by video and corresponding time stamp of the result, all of the information on YT would be so much more accessible useful.

    ReplyDelete
  17. 1. I’m confused with the disadvantages of one of the approach, word spotting. The author said word spotting in large collections is practical only when the query is known in advance. However, I wonder under what situations we cannot know the query in advance. I thought that maybe in some cases, the information needs could not be expressed in a unique query so that we cannot use the approach of word spotting.

    2. Since the differences in speaking rate, accents, background noise, emotional state, and many other factors can severely affect recognition accuracy, I wonder could the ASR technique apply to the field of privacy protection. Besides, the ASR might also face the problem to meet the information needs; It can only match the query we provide.

    3. I think the pooled assessment is a very good way to deal with the large collections and large numbers. The author said there was a limitation of pooled assessment for our purposes is that it depends on contributions from a moderately large number of different systems. However, I’m more worried about another issue. When we get top-ranked documents from many systems, how would we re-rank all these documents?

    ReplyDelete
  18. 1. On the page 4, the author mentioned "Five categories were derived from our understanding of historical methods and information seeking processes" and then he put the categories below. Although the categories he listed in the paper are sound, I still feel lack of the confidence about the criteria. Are there any theories to support the categories?

    2. After reading the paper, I'm interested in the prospect of information retrieval for conversational speech. It's universally known that the rise of technologies prompts the booming of new applications. With the appearance of information retrieval for conversation speech, what kinds of new applications would come up to affect our daily lives? Or say, is it possible that such a technology would change the ways of archiving and managing voice files?

    3. In the page 2, the author introduced some backgrounds about the research and pointed out that emotion could affect the result of information retrieval for conversational speech. In this case they chose first-person narrative of Holocaust as the import of the research. However, would this method undermine the conclusion of the paper?

    ReplyDelete
  19. 1. Explain further what ASR is. How are the systems trained? How much human effort is involved in this training process?

    2. This experiment involved collecting interviews, briefing assessors and judging relevance, training ASR's and testing, among other things. This was obviously a very complicated and expensive process. How long did the entire process take?

    3. Did the users who requested information from this test collection have access to the recordings or just to the notes and summaries? The text seemed to imply that only occasionally were the recordings accessible to the user.

    ReplyDelete
  20. 1. The data that can be accessed via spontaneous conversational speech would be a great boon for academics in virtually all fields, but will non-written forms of communication ever carry the weight and accessibility of writing?

    2. The ability to index large amount of non-written data with any sort of objectivity or appropriate degree of indexing seems so far off, yet this paper was written way back in 2004. What steps have been taken since this study? How far along are we?

    3. How might such technology have to change if we were working with a language other than English? For example, if we were working with a language in which tone and pitch convey much more information, how would the software need to change?

    ReplyDelete
  21. 1. I had a hard time following this article. Maybe it was due to the fact that I had the flu this weekend, but I had a hard time following the process to creating the test collection. I do realize that it is difficult and time consuming (to say the least) but I kept thinking that using just interviews of Jewish survivors of concentration camps was only going to give them a limited type of speech sample. At the end of the article they admit to this and will continue to develop additional test collections with other speech types.

    2. In Understanding and Servers Users class we worked on the Glifos project. It has audio recordings of interviews with Texas oil industry from the 1920-30s. With this software the transcription displays matching to the audio. The interviews were pretty causal without any emotional overtones. It has been digitized, transcribed, and the audio matched to the transcription. If memory serves me correctly I believe it has been indexed. This might be a great resource for the text collection.

    3. They talk about not being able to upscale any of the manual work – could crowd sourcing be a model that could be employed here?

    ReplyDelete
  22. 1. If this project had been helmed by some kind of consulting firm who wanted interviewees to be familiar with what could be discussed regarding their place of employment/quality of life... I'm at a bit of a loss as to why the researchers would give the interviewees a questionnaire to peruse and complete. By doing that aren't you inherently influencing the responses you get to the questions, and making sure there aren't any anomalies for the software to struggle with (even though your point is to build a better sofware)?

    2. In essence aren't you asking these people to edit their life narratives to fit a structure needed to test this program? If that's the case, why not make a test program to see how well the program works and refine the rough edges (or at least take them into account) with a topic of discussion slightly less important instead? It may not be the most informative first collection, but it would allow the interviewees from the Holocaust the free reign to elaborate and narrate how they best see fit.

    3. The assessors for this research were students once again. While I can't deny their close proximity and abundance, shouldn't an effort be made when working with certain materials to query the general public for interested participants. In the case of amateur genealogists or fans of history, assessing how well a feature gives them search results relevant to their topics is likely something they'd be excited about if it worked well and would eagerly participate in making it function better.

    ReplyDelete
  23. 1. On page 4, the author lists five relevance categories to obtain judgments. However, some of them are difficult for me to understand without examples. Like, what do "provides indirect/circumstantial evidence" and "provides pointer to a source of information" mean?

    2. The author claims that "differences in speaking rate, accents, background noise, emotional state, and many other factors can severely affect recognition accuracy". However, does this IR test collection take the responsibility to filter or revise these aspects? Or, they just give these documents to users and expect them to decide its accuracy?

    3. In the process of building the test collection, the interviews were conducted and recorded by the research team. However, how to make people be aware of the significance of spontaneous conversational speech and keep using this system actively in the future?

    ReplyDelete
  24. 1. Two approaches to searching spoken word collections are talked about in this article. Word spotting is assumed to be unpractical since the time need increases linearly with the size of the collection, but the increasing power of machine will improve the efficiency of word spotting. The accuracy of large-vocabulary ASR will compromised by the differences in speaking rate, accents, back-ground noise. So which is the most used approach currently, or is there other approach?

    2. About relevance judgments, this article and the one by Saracevic both recommend the multi-valued scales. The conflict exists. Saracevis denied the binary relevance judgments, on the other hands, author regarded binary relevance judgments as most widely reported retrieval effectiveness.

    3. In section 4, authors illustrate the unaffordable scale of the whole process to the entire collection. It lists approximately $2000 per interview for interviewing, digitizing and data entry, and another $2000 for manual indexing process. It is mentioned that machine learning techniques has led directly to the creation of the IR test collection. But how much percent will it save, and how could machine replace the manual part?

    ReplyDelete
  25. 1. The recent technique of large vocabulary ASR for more efficient query processing searches the lattice in advance for word sequences that match a language model trained using enormous amounts of representative text. Is the model trained in a particular accent? If so, does it not cater to only a part of the world audience?

    2. Pooled assessments and search guided assessments are used to build IR test collections in the project. What kind of samples are used for these assessments? In pooled assessments top ranked documents are judged. What is the basis for this ranking? How is it done?

    3. In speech recognition, the pre-processing stage prior to decoding has some stages for noise removal etc. Is there a chance of losing important data in this stage? What can be done to avoid it?

    ReplyDelete
  26. 1. In this article the authors constructed their topics for their test collection by taking requests from a variety of different agencies and organizations that wanted information from the collection that they were using. They expanded the scope of several of the more specific requests based on their own understanding of the work and information requests in general. Wouldn’t this type of behavior create some sort of bias as the authors are making general assumptions of the information needs of specific users? Do you agree with what they did or not?
    2. In judging the relevance of the collection that they were creating the authors chose five different relevance categories that were each judged on a five-point scale. The categories were created from their own understanding of information seeking and from feedback given by their assessors in a two-week pilot study. Do you agree with the categories that the authors defined in this article? Do you think that they should have used information from assessors in creating these categories or that they should have contacted potential users and gotten their ideas?
    3. In this study the authors used two methods to create relevance judgments. One method had only a single assessor judge 14 topics and the other method had two different assessors judge a different 14 topics. In the second case when one of the assessors gave a high score the two met and decided together what score to give the topic whereas all other discrepancies in score were merely averaged together and rounded up. What reason can you think of for the two different methods of adjudication being used for different scores? Also why did the authors round up when they averaged scores and what type of bias would that introduce?

    ReplyDelete
  27. 1. In what ways are the content of ASR and spoken word searches commonly used? How does this compare to other information resources?
    2. I'm not sure I fully understand the scope of ASR systems. What are the different component parts, and how exactly do they interact? What benefit is created by this system as opposed to manual assessing/indexing?
    3. How has this project developed since 2004? Are there any new technologies or systems for spoken word and oral history that are currently being used and not discussed in this article?

    ReplyDelete
  28. 1. The five categories for assessment were derived from the authors understanding of historical methods and information seeking processes. What were the basis of their understanding and do how they directly relate to the five categories? Did one have more influence than the other and what would this mean for their test?
    2. I don’t think I clearly understood what the authors meant by having name authority control within an interview, but not across interviews. Were people interviewed multiple times and were they not then able to connect the two different interviews as being by the same person? Or were there multiple people with the same name who they could later not distinguish between each other?
    3. I didn’t follow the end of the article when the authors were assessing the assessment. While it was highly technical, toward the end of the segment the authors then randomly selected ten highly ranked unjudged selections. They then state that two segments from their “untrained ear” appear to be relevant. Is this a throw away line and should it be included in the article? If they were not part of the assessment should they then try to assess these unjudged segments?

    ReplyDelete
  29. 1. What is a lattice of phonemes?

    2. What is mean uninterpolated average precision?

    3. In the conclusion the authors bring up topic segmentation and question whether it is useful for “supporting access to spontaneous conversational speech.” I wonder what would be more useful? They mention that they’ve switched their manual indexing from “segment-based annotation to time-based annotation,” but I didn’t understand what that was solving, exactly.

    ReplyDelete
  30. 1. The idea of having relevance judges is both fascinating and a little... not unsettling, but something similar. Because relevance is so subjective to a particular user at a particular time in his or her search, it seems "off" to have researchers assessing relevance in such a controlled setting.

    2. I think it's interesting to think about how this type of assessment could be used in the future. When you think about huge corpora, like YouTube, for instance, you have unimaginably large data sets that are searchable by only the limited data given by the uploader. Could the methods/assessments used in this study one day be blown up to that large a scale?

    ReplyDelete