1. Early in the article, the authors stipulate that the difference between a document and a database record is essentially its organization and structure. A document is largely composed of unstructured text while a database record has a clear organization (like a bank statement) (2). However, how much unstructured text changes something from a database record into a document? Is it always a clear cut distinction?
2. The authors also discuss methods of evaluating a search engine’s usefulness, including the “clickthrough” method. This basically entails tracking a document to see how many clicks it gets – the more the better (6). Is this really an effective method? How do the search engineers know whether it was a human or robot that clicked the link? If I go click on a document 30 consecutive times, will the engineers know that it was a single user randomly clicking? Furthermore, even if the document was clicked, that click alone doesn’t prove the document’s relevance or usefulness. How is this accounted for?
3. The authors claim that “the users of a search engine are the ultimate judges of quality” (6). While I can see the logic of this statement, isn’t it also problematic? How can search engineers get accurate and comprehensive feedback from users? It seems unlikely that every user could be completely satisfied with a search engine – how do search engineers deal with this?
1) It is interesting that, at least as of 2009, most people who end up working in information retrieval have never actually formally studied it (p. 10). This seems like a major oversight, especially given that information retrieval is such a user-focused field and would thus need a very distinct skill set from general computer science. How was this issue overlooked for so long?
2) The authors touch briefly on the problems of retrieving visual information, and the advances that are being made in image retrieval (p. 3). Just the last few years have seen Google Image Search roll out a feature in which the user can click and drag an existing image into the searchbox and retrieve optically similar results, which makes it much easier to, for instance, source an image that the user already has in their possession but for which they do not know the origins. The image search function works well for this purpose, but it is still extremely difficult to retrieve an image that the user remembers but does not have in their possession. How can information retrieval specialists facilitate the retrieval of this kind of visual information?
3) One major problem with nailing down an algorithm for relevance is the fact that individual users have such radically different information needs, so information retrieval systems need to suss out why the user is searching as well as what they are searching for. Is there a way to shift some of this burden back onto the user, for instance by including a small number of questions for the user to fill out about why they are searching? Is the average user’s expectation of immediate results part of the problem?
1. This article immediately brought to mind the claims made by Seadle and Greifeneder in the article we read a couple of weeks back, about the relevance and importance of linguistics to a robust information studies. It seems that a key issue in information retrieval is language, the use of language, and the ever-evolving state of usage. Someone I know who recently completed a dissertation in French linguistics actually got a job with a major computer company working on this very thing. 2. In another class, I'm writing a paper about metadata in graffiti archives, from grassroots to gallery, and this article also touches on this, saying, "The current technology [2009] for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example." I hadn't even thought of this, but I definitely need to incorporate the current image-matching technology into my discussion of graffiti archiving, insofar as such technology would aid archivists in identifying artists, tags, locations, and themes. 3. This one also - especially in the discussion of the concept of "relevance" - made me think about the implications of "improved" information retrieval on the often-discussed value of "serendipitous" searching and browsing, which has, from what I've learned so far, typically been much more associated with brick-and-mortar libraries (and mourned by libraries and users). How might information retrieval specialists enable serendipitous searching in the digital environment? Things like StumbleUpon and Google's "I'm Feeling Lucky!" spring to mind first, but I'd like to know more about other, related work being done by information studies scholars, developers, and programmers in this area.
1. On pg. 3, the authors describe the different types of searches that can be done, such as peer-to-peer, desktop, enterprise, etc. Are there times when it is necessary to combine different types of searches to find what you are looking for? Could you conduct one search, then another, and easily cross-reference your findings or can you conduct many different searches all at once? Would the latter result in information that is not necessarily relevant?
2. Before reading this excerpt, I had not realized that being a 'search engineer' was a real and viable profession. Is there any evidence of gaps between search engineers and usability? We have been speaking about end results and users in class, and I wonder if search engineers build a search engine, and then if it works as desired, are done with it and they do not think beyond that. Is it just that the everyday user doesn't really know what they want or is not adept at discerning relevance?
3. It is interesting how this portion of the book deals mainly with search engines that predominantly search text. Then I think of the article we read on ASR. I think about the option of using Siri on my iPhone, and how even though it is a developed technology, I still always prefer to type something out like a text message or a search in Yelp, because Siri very rarely actually matches exactly what I am saying. You have to speak very precisely and very slowly for it to properly work. I wonder if speech recognition is simply a technology continuously in development, and if it will ever really be possible for a speech recognition search engine to work as the user desires due to various contributing factors?
1. Why isn't destruction listed among the different elements of the information retrieval field (p.1)? If information that had been the most relevant for the user and/or topic is gone, are we ok just with the replacement, or do we at least want to be told that it has disappeared and how the new information compares to what has been "destroyed"?
2. Why do retrieval models treat linguistic properties as a secondary factor to relevance (p.5)? Does this say anything about the information that we are trying to retrieve or the people who are trying to retrieve it?
3. "A search engine is the practical application of information retrieval techniques to large-scale text collections." How much does Google Image rely on text to do its search?
1 One core issue about information retrieval, as the author states in the paper, is relevance. To address this issue, researchers propose retrieval models. And word frequency is used to evaluate the relevance. How does this work? Is that absolute that the more frequently the word showed in an paper, the more important of the word it is? Does the frequency in some way really relate to the topic?
2 If the user gives a query, how the search engine analysis this query? Based on entity, verb, or the combination? And how the search engine analyzes the intention of the user to find the feedback for them?
3 The author also mentions multimedia documents in information retrieval field. And the current technology for searching these is based on descriptions of their content rather than the contents themselves. How should we improve this? To combine with HCI to let the engines learn things as we human do?
1. At the beginning of this article, the definition of IR is ‘a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information’. I think this is too general and IR in this definition contains almost everything about information. For example, management information system also concerns the structure, analysis, organization, storage, and searching of information. So what’s the difference between IR and MIS? Or could we say MIS is just a part of IR? I don’t know.
2. In table 1.1, it shows us the content, applications and tasks of IR. I wonder what the relationship is between IR and search engines. Based on the table 1.1, what is the difference among various search engines? For example, I want to know the intrinsic distinctions between Yahoo and Google. I think Google is much more popular than Yahoo. So why? Are the reasons related to what the table 1.1 shows us?
3. It seems that the biggest issue in IR is relevance. I think it is also the most difficult part in IR. For example, when people input ‘apple’ in the search box, how could we know he/she is searching the fruit ‘apple’ or the company ‘apple’? So how could we rank the search results? It is too subjective.
1. On page 4, it says that "there are many factors that go into a person's decision as to whether or not a particular document is relevant. These factors must be taken into account when designing algorithms for comparing text and ranking documents." How does this really work? It's not like a Google employee is sitting next to every one of its users all the time. How do search engineers know whether or not something was relevant and useful for a person? Is it because they actually clicked the page? Maybe, it's the duration spent browsing that page? How reliable really are these retrieval models?
2. After reading this, I am more curious as to how predictive search works online. Are certain phrases brought up more because of ranking algorithms? Are the phrases that come up just the popular searches of that particular day, time, and location? Past browsing habits?
3. I am wondering since there are so many different search engine companies like Google, Bing, Yahoo, etc, do these companies ever use each other's search data to make themselves more efficient? Or do they all each horde their own data for themselves so they can become the best search engine with the highest degree of easy access and relevant data?
1. Concerning the information retrieval, the author of this book mainly mentioned the cases that search inputs are text. However things are changing quickly in recent years. For example, Google is trying to develop its search inputs from text to image, voice and photos. Then, are there any differences among them?
2. In the paper, the author pointed out "the 'big issues' in the design of search engines include the ones identified for information retrieval: effective ranking algorithms, evaluation, and user interaction." I'm very surprised with this part since it's universally known that the coverage matters for search engines as well information retrieval. How could the authors leave this important factor behind in their assertions?
3. In the section before the last one, the author of the paper discussed the role of a search engineer and agree some types of people whose work are concerning information retrieval should be considered as search engineers. However, someone is hard to define. For instance, when I was in Baidu, there are a number of people who were responsible for assessing the algorithm of the search engine. Their jobs are easy and repetitive, grading every result of some given queries from 1 to 5 and then sending the results to the professionals in the company. Then, how about them? Could they be considered as search engineers?
1. On page 7 the author writes that open source search engines have different design goals than that of commercial search engines. It was unclear from the reading how these search engines are different. What is different in their design? Are the engines seeking to display different information?
2. As described in the reading, this textbook was written to bring the concepts of information retrieval to the computer scientists that build search engines. The author writes that search engineers are computer scientists, but often do not receive training in information retrieval. I find it surprising that this area of study would not be included for those creating search engines as its primary function is information retrieval. This perhaps relates back to our reading from a few weeks ago on the differences between information science, computer science and information systems. What can search engineers learn from studying information retrieval?
3. This article mentions the Yahoo directory at dir.yahoo.com. I remember using this directory when I was younger, but I was surprised to see that it still exists. I am wondering what the use of this directory still is. I can see that it allows a person to somewhat broadly search for sites if a search is not producing great content. The directory greatly limits the results for each sub-category though. How are the sites in the sub-categories chosen? What are the criteria for deciding if a site should be in a category as there are many possible sites that are left out of them?
1. In the model of information retrieval, what’s the difference between user relevance and user’s information needs? What are the factors that affect the user relevance?
2. My understanding of the distinction between information retrieval and search engine is just that information retrieval is more of a theory and search engine is more of a application. The author also says that a search engine is the practical application of information retrieval techniques to large-scale text collections. But why search engine cares about different issues or more issues than information retrieval, for example scalability and adaptability?
3. The author classifies open source search engines as a different class of system, due to different design goals from the commercial search engines. I wonder what the differences are. Don’t they aim to solve the same set of issues (performance, incorporating new data, scalability, adaptability, and etc)? Don’t they only differ in the collaboration mode?
1. In many years of internet usage, I have had to learn multiple search engine skills, some of which are mentioned in this paper, such as site search. Search results have gotten steadily more sophisticated, but very often a specialized skillset is still required. How can we continue to improve IR to recognize more “natural language”-style queries that users might enter?
2. Sometimes IR systems seem to get a little bit too smart, in fact, in trying to correct for user error. When searching for a term that is very close to, but not actually a similar word, it can be quite difficult to get Google to stop turning up results for what it thinks is correct. This pretty much destroys relevance for such searches. Could we create a more modular system in which specific correction functions could be disabled?
3. The exploration of text queries is interesting as many of us have had to develop a specialized “query structure” vocabulary to try and tell the search engine what we want and how important each term is within a query, and attempting to eliminate less useful results. While teaching search engines to recognize regular language queries would be nice, should we consider search engine manipulation to be a language topic that should become part of regular education? For instance, many individuals do not know about -term as a means of eliminating unwanted parts of a query.
1. The author mentions that “all” documents have some amount of structure, including title, author, date etc. However, with the advent of communication on blogs and other informal communication such as conversation on social media how does information retrieval techniques continue to stay relevant and accurate.
2. With the amount of data being added to the internet everyday does the problem of relevance grow? While the author talked a lot about miscommunication in language and coverage I’m wondering if we will reach a point when the shear amount of data makes it too difficult for a computer to do a comprehensive search of all text. Or, are we only concerned with precision when recalling an answer, thus being able to disregard all other relevant but redundant information?
3. I was intrigued when the author started to touch on the topic of spamdexing and would like to go into more discussion about some of the ways that search engines work to prevent irrelevant ad pages from popping up on search requests.
In dealing with media content, how disruptive are current tagging methods employed by users? For instance many individuals seemingly tag photos with certain terms that have little to no bearing on the actual image or video is this one reason for heading toward methods which employ image comparison for search purposes?
We’ve discussed cost previously in regards to dealing with information in different spheres and on page 5 Croft et al., mention the advent of natural language processing. How intensive are the costs associated with natural language processing or speech recognition methods?
Search engineers are often individuals with a background primarily in computer science which may be the reason why a focus on the particular algorithms and systems has played a primary role in IR development. In spaces such as the iSchool where these individuals interact with social scientists and people from varying disciplines, is there likely to be a large shift toward user focus in IR development? Or will the dynamic nature of individual users stop that shift from occurring and the status quo continuing?
1 - It occurs to me through these readings that there's an implicit understanding that most information is textual - what happens to information retrieval when we incorporate audiovisual materials, or even physical objects, into the criteria? Is this something that's totally out of the realm of IR for various reasons, or is it because society as a whole considers information to be mostly textual?
2 - While reading this article, I was thinking of a recent project designed by the UN about global perceptions of women. The ad campaign involved typing in items such as "women should" and letting autocomplete handle the rest by filling in sexist tropes (link: http://www.upworthy.com/would-you-expect-these-results-to-appear-when-you-google-women-2?c=bm1) However, it's my understanding that these results don't imply that millions of people are typing "women should not vote" into the search engine...or is it? Again, we're back to the question of information/the internet being "democratic" - are these results really demonstrating a 'popular vote' on societal perceptions of women?
3 - How has information retrieval developed as a position/field of study within IS? It's mentioned that most individuals working in IR didn't formally engage in the study of IR, so where does this knowledge/understanding come from? Is this more of a tacit knowledge gained through problem-solving skills, especially in the realm of computer science or programming? It seems like a nuanced understanding of the logical structuring of programming and technology would enable one to understand IR more than other fields would.
1. The authors talk about the idea of unstructured text within documents. This, presumably, refers to any text which has no easily discernible attributes such as title, author, date, which are pertinent to IR. Is that correct?
2. I don't quite understand the concept of a peer-to-peer search. Aside from music sharing, what would be an example of this? Is this type of searching something that occurs frequently as I know the others (i.e. desktop, enterprise, etc.) do?
3. I'm not sure when this book was written, but the authors refers to linguistic vs statistical analysis in IR. According to them, statistical analysis, with the ranking algorithms being more concerned with word occurrences, is more common. Does anyone know if this has changed at all in recent years? Is statistical analysis perhaps employed because it's easier to gauge?
1. I don't mean to be pedantic but the definition if Information Retrieval provided in this article has the phrase "... retrieval of information." in it. Does this really need to be stated? Couldn't the authors have just left the definition at structure, analysis, organization, storage, and searching?
2. I know that jobs exist to rate web pages based on search terms to see if those pages seem like relevant results to the terms. Is there another way besides using a human to evaluate relevance?
3. I know that a lot of websites have a search feature that allows the user to search through the information on that particular site. Assuming they use Google as their site search engine, is that search any different than searching on Google.com and specifying that Google only search that particular address?
1. In the definition of “Information Retrieval” from Gerard Salton, information retrieval is a field concerned with structure, analysis, organization, storage, searching, and retrieval of information. I'm confused with “searching” and “retrieval”; what is the difference between them?
2. One of the big issues in information retrieval, vocabulary mismatch problem, is very interesting; but the author did not give a solution of it. Since language can be used to express the same concepts in many different ways with very different words, could the controlled vocabulary solve the mismatch problem at all? What are the disadvantages of controlled vocabulary?
3. What’s the role we play in the field of information retrieval? The author said very few courses being taught in computer science departments that give the student an appreciation of the variety of issues in the field of search engineering. I have learned some knowledge of information retrieval from i-school; however, I wonder how to use those concepts and tools to build a real search engine.
1. I was surprised by the definition of filtering on page 3. I thought filtering was adding additional criteria to a search to narrow results. His definition sounds more like google alerts.
2. Interesting that over probably the last 10 years(?) we went from search engines designed to find webpages to now webpages being designed to be found by search engines. These algorithms have changed the way we design, write, and code on a single page and across the site.
3. Relevance and search engines determining relevance is still hazy to me. By getting additional information from the user so that the search can be refined helps with relevance, but still there seems to be this large hole that we jump over. There must a bridge, but that bridge is poorly defined and explained.
1. The authors place test collections under the category of Evaluation. What is the purpose of test collections and why are they assembled? Are they only used for evaluation purposes or do they have other uses?
2. Croft et al. briefly mention clustering in reference to data mining. What does clustering refer to and how is it useful?
3. Open-source search engines are mentioned on page 7 along with the names of three popular systems -- Lucene, Lemur and Galago. How do open-source search engines differ from the four types mentioned near the beginning of the chapter: Vertical, Enterprise, Desktop and Peer-to-peer?
1. The authors remark that, "The current technology for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example." This statement caught my eye? How would you search for pictures or music WITH pictures or music rather than words? How would such a search engine function? Would the technology be similar to advanced facial recognition techniques, maybe?
2. If word frequency is so important for search engines, are they essentially making their own "clouds" of words in order to create ranking algorithms? How does that work?
3. What is the difference between open source search engines and commercial search engines? The paper says they have different design goals but I don't think it specifies what they are.
1. On page 5, the author suggests that we distinguish between topical relevance and user relevance. And his example also seems reasonable, since different users might have different experiences and retrieving needs. But how to take user relevance into consideration when designing and developing an retrieving system?
2. As a user-centered service, information retrieval engine collects users' behaviors or feedback to make an self-evaluation. However, how to reduce the subjective impact from users who have bad retrieving habits?
3. In the end of this article, the author mentions the role of search engineer as designing and implementing new searching engines. So, what is information professionals' role in IR area?
1. On page 5, the authors state that the evaluation of a search is a core issue for individuals involved in information retrieval. With that in mind, can this be attributed as a driving reason why many search engine sites (Google, Bing/Microsoft, Yahoo!) push to be the one offering you something of a 'complete web package'? That is your email linked to your social media linked to your search engine linked to your browser you're signed in... all in the same of returning you search results more relevant to you as an individual? I realize a lot of it is also yes, to get you to buy more things, and it is pre-emptively filtering the internet for you which has been cited as not necessarily a good thing, BUT is there a reason for that beyond capital?
2. The authors stress the importance of text for searching for documents/images/etc. online. That said, how does Internet culture in some ways hinder or resist being searchable? Particularly on sites like Twitter and Tumblr, many users utilize tagging as something of a secondary method of communication with other users. They will leave each other messages or notes within tags for specific users they are 'conversing' with... but sense none of their language is adhering to any kind of search hierarchy, it makes whatever the original post was about unsearchable by terms we'd assign to it if we were able to see it.
3. In the realm of peer-to-peer, desktop and web searching, where does searching cloud storage fall? The tech uploading to that cloud aren't all on the same intranet, but the items within the cloud aren't available for searching on the web at large. So where would this put searching one's cloud?
1. It is mentioned that current technology for seating non-text documents relies on text description of their content rather than the contents themselves. But techniques for direct comparison is merging, and we all know some applications have already existed, like Shazam. What is the method they use to compare with?
2. Evaluation of retrieval models and search engines focused on using large volumes of log data, such as clickthrough data. But as we can see, the clickthrough has some flaws. Sometimes the users can get result without any clicking, or they may click the wrong link. How could we deal with this situation?
3. Author talked about the open source search engines in this article. But he was unable to clear definition of open source search engines. Is it the reason that they use open source development platform? What is the difference between the open source search engines and Google?
1. One of the biggest issues in information retrieval is relevance. Topical relevance and user relevance are discussed. To what degree does topical relevance give out accurate results? How do we determine that? And are there situations where user relevance need not be taken into account? Like for example, if the user searches for a topic in the broader sense?
2. One more issue discussed is the user’s information needs. Information needs of the users keep changing constantly and is not the same for each of them. Query suggestion, query expansion and relevance feedback are some techniques to offer better results to users. What kind of a test group should be chosen for this? How large should it be ? How can you make sure the particular group addresses the needs of all the users?
3.How are search engines made more scalable? With more data and people, do they tend to give out irrelevant results? So, should the algorithms be dynamic? How efficient are the present algorithms?
1. On page 3, the author discusses different types of searches and methods for data retrieval. I'm curious as to the context and scope of this list--how have these methods evolved over time, and what are the foreseeable changes? 2. Since relevancy is a rather subjective in nature, to what extent do evaluation tools truly catch all that they seek to capture? Is there a way to quantify and discuss gaps in information, information that is perceived to be missing/lacking, or information that is missing/lacking but of which that factor is unknown? 3. I appreciated the author's discussion of the search engineer. Sometimes, in the details of talking about systems, processes, and algorithms, the human element can become lost. Taking the time to discuss their role, background, training, and skills is helpful in understanding the full effect of the information retrieval system and construct.
1. In this article the authors discuss several different types of search engines. One type of search engine that they describe is the open source search engine. What do you think are the benefits in using an open source search engine versus the benefits of using a proprietary search engine that has been created for a similar task? What are the drawbacks? 2. In this article the authors discus the various different applications of information retrieval from web searching to enterprise searching to desktop searching. They state that each of these different applications has different requirements due to the amount of information they will be searching as well as the type of information they will be searching. However later they discuss that scalability is an issue in search engines and that search engines should be able to work on large and small collections. Do you think that you should use one engine to search both the web and your home computer or organizational network or do you think these different applications should use different engines? 3. In this article the authors briefly discuss how information retrieval is being used to deal with different media types such as pictures, video, and audio. They state that currently most technology that is used to search these types of media is based on searching text descriptions of this media. In the Oard et al. article that we read this week the authors attempt to create an information retrieval test collection for conversational speech. Do you think that the Oard et al. are creating a test collection that uses text descriptions like this article states?
1. On the first page of the article, the authors refer to a definition of information retrieval given by Gerard Salton. The authors describe it as “appropriate and accurate.” I find it to be vague. Could we work on a better definition?
2. What is ad hoc search? It comes up on page 3, and I feel like the authors indirectly define it, and yet I don’t really know what he’s talking about…
3. On page 5, the authors say it is an “interesting” feature of retrieval models that they focus on statistical properties rather than linguistic ones. Perhaps it’s just because I’m part of the google generation or something, but this doesn’t seem at all counterintuitive to me. The idea of taking into account the parts of speech, on the other hand, is more intriguing. How is information about adjectives and nouns incorporated into the “more advanced” retrieval models that they mention? Would that be more like semantic web?
1. There will be a huge impact on the results of a search depending on how relevancy ranking algorithms are constructed. If people could find out how all their searches are constructed, in terms of relevancy, would this influence the retrieval systems and databases they use? I know most of these are generally considered proprietary but should they be available for public access since they affect the information seeking habits of everyone who uses them? 2. The reading mentions that linguistic features are of secondary importance to an information retrieval model. Is this because of the complexity and intricacies of how individuals speech patterns? Depending on the community or linguistic inconsistencies and lexicon of an individual the same question could be constructed in several ways. 3. As I read about clickthrough data I thought about things like Wikipedia getting one of the highest rankings for any searches. If retrieval is based, in some part, on what users “clicked” before is this reinforcing the initial retrieval algorithm? This is based on the assumption that many users click on links in the first couple of pages their search.
1. In the 1.2 The Big Issues, the authors mentioned these factors must be taken into account when designing algorithms for comparing text and ranking documents. So I am wondering how do the algorithms designers decide the boost value for each item? 2. The authors also mentioned the concepts of topical relevance and user relevance. So I am thinking how to customize the searching results in search engine? Why all search engines don`t have personal account to record what information they already consume? 3. Just came across an interesting article about search engine: I am wondering how this happened? It is really because the guys who are in charge of designing algorithms have sexism? What the mechanism behind the search box for the auto-complete functionality? http://mashable.com/2013/10/18/google-autocomplete-sexism/
1. I'd never given much thought to my spam filter (which I guess means it's doing a good job--thanks, Gmail!), so it was interesting to stumble upon the section explaining that spam filters are simply sophisticated search engines applied to your mail as it's received. As someone who, until as recently as starting this program, considered only web search or face-to-face reference transactions when thinking of information retrieval, seeing its many other applications was quite interesting.
2. It was also interesting to see how web search engines have dealt with ambiguous or incredibly short search terms (the example used in the book: "cats"). I remember when Google first began listing popular search terms in a drop-down box beneath the search bar, and when they introduced auto-complete. There was a lot of feedback online, with many people wishing the system would just "let me finish my search!" without realizing that, to Google, their searches sucked. I'd considered it a convenience feature for the user, never thinking it was actually a tool Google uses to better complete its task.
3. I found it interesting that "[meta /]" tags (which blogspot refuses to let me use with appropriate "< >" notation, irksomely) in the headers of web pages weren't mentioned in either the explanation of attributes searched by search engines or in explanations of how websites are searched in general. Why is that (covered in a later chapter, perhaps)?
1. Early in the article, the authors stipulate that the difference between a document and a database record is essentially its organization and structure. A document is largely composed of unstructured text while a database record has a clear organization (like a bank statement) (2). However, how much unstructured text changes something from a database record into a document? Is it always a clear cut distinction?
ReplyDelete2. The authors also discuss methods of evaluating a search engine’s usefulness, including the “clickthrough” method. This basically entails tracking a document to see how many clicks it gets – the more the better (6). Is this really an effective method? How do the search engineers know whether it was a human or robot that clicked the link? If I go click on a document 30 consecutive times, will the engineers know that it was a single user randomly clicking? Furthermore, even if the document was clicked, that click alone doesn’t prove the document’s relevance or usefulness. How is this accounted for?
3. The authors claim that “the users of a search engine are the ultimate judges of quality” (6). While I can see the logic of this statement, isn’t it also problematic? How can search engineers get accurate and comprehensive feedback from users? It seems unlikely that every user could be completely satisfied with a search engine – how do search engineers deal with this?
1) It is interesting that, at least as of 2009, most people who end up working in information retrieval have never actually formally studied it (p. 10). This seems like a major oversight, especially given that information retrieval is such a user-focused field and would thus need a very distinct skill set from general computer science. How was this issue overlooked for so long?
ReplyDelete2) The authors touch briefly on the problems of retrieving visual information, and the advances that are being made in image retrieval (p. 3). Just the last few years have seen Google Image Search roll out a feature in which the user can click and drag an existing image into the searchbox and retrieve optically similar results, which makes it much easier to, for instance, source an image that the user already has in their possession but for which they do not know the origins. The image search function works well for this purpose, but it is still extremely difficult to retrieve an image that the user remembers but does not have in their possession. How can information retrieval specialists facilitate the retrieval of this kind of visual information?
3) One major problem with nailing down an algorithm for relevance is the fact that individual users have such radically different information needs, so information retrieval systems need to suss out why the user is searching as well as what they are searching for. Is there a way to shift some of this burden back onto the user, for instance by including a small number of questions for the user to fill out about why they are searching? Is the average user’s expectation of immediate results part of the problem?
1. This article immediately brought to mind the claims made by Seadle and Greifeneder in the article we read a couple of weeks back, about the relevance and importance of linguistics to a robust information studies. It seems that a key issue in information retrieval is language, the use of language, and the ever-evolving state of usage. Someone I know who recently completed a dissertation in French linguistics actually got a job with a major computer company working on this very thing.
ReplyDelete2. In another class, I'm writing a paper about metadata in graffiti archives, from grassroots to gallery, and this article also touches on this, saying, "The current technology [2009] for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example." I hadn't even thought of this, but I definitely need to incorporate the current image-matching technology into my discussion of graffiti archiving, insofar as such technology would aid archivists in identifying artists, tags, locations, and themes.
3. This one also - especially in the discussion of the concept of "relevance" - made me think about the implications of "improved" information retrieval on the often-discussed value of "serendipitous" searching and browsing, which has, from what I've learned so far, typically been much more associated with brick-and-mortar libraries (and mourned by libraries and users). How might information retrieval specialists enable serendipitous searching in the digital environment? Things like StumbleUpon and Google's "I'm Feeling Lucky!" spring to mind first, but I'd like to know more about other, related work being done by information studies scholars, developers, and programmers in this area.
1. On pg. 3, the authors describe the different types of searches that can be done, such as peer-to-peer, desktop, enterprise, etc. Are there times when it is necessary to combine different types of searches to find what you are looking for? Could you conduct one search, then another, and easily cross-reference your findings or can you conduct many different searches all at once? Would the latter result in information that is not necessarily relevant?
ReplyDelete2. Before reading this excerpt, I had not realized that being a 'search engineer' was a real and viable profession. Is there any evidence of gaps between search engineers and usability? We have been speaking about end results and users in class, and I wonder if search engineers build a search engine, and then if it works as desired, are done with it and they do not think beyond that. Is it just that the everyday user doesn't really know what they want or is not adept at discerning relevance?
3. It is interesting how this portion of the book deals mainly with search engines that predominantly search text. Then I think of the article we read on ASR. I think about the option of using Siri on my iPhone, and how even though it is a developed technology, I still always prefer to type something out like a text message or a search in Yelp, because Siri very rarely actually matches exactly what I am saying. You have to speak very precisely and very slowly for it to properly work. I wonder if speech recognition is simply a technology continuously in development, and if it will ever really be possible for a speech recognition search engine to work as the user desires due to various contributing factors?
1. Why isn't destruction listed among the different elements of the information retrieval field (p.1)? If information that had been the most relevant for the user and/or topic is gone, are we ok just with the replacement, or do we at least want to be told that it has disappeared and how the new information compares to what has been "destroyed"?
ReplyDelete2. Why do retrieval models treat linguistic properties as a secondary factor to relevance (p.5)? Does this say anything about the information that we are trying to retrieve or the people who are trying to retrieve it?
3. "A search engine is the practical application of information retrieval techniques to large-scale text collections." How much does Google Image rely on text to do its search?
1 One core issue about information retrieval, as the author states in the paper, is relevance. To address this issue, researchers propose retrieval models. And word frequency is used to evaluate the relevance. How does this work? Is that absolute that the more frequently the word showed in an paper, the more important of the word it is? Does the frequency in some way really relate to the topic?
ReplyDelete2 If the user gives a query, how the search engine analysis this query? Based on entity, verb, or the combination? And how the search engine analyzes the intention of the user to find the feedback for them?
3 The author also mentions multimedia documents in information retrieval field. And the current technology for searching these is based on descriptions of their content rather than the contents themselves. How should we improve this? To combine with HCI to let the engines learn things as we human do?
1. At the beginning of this article, the definition of IR is ‘a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information’. I think this is too general and IR in this definition contains almost everything about information. For example, management information system also concerns the structure, analysis, organization, storage, and searching of information. So what’s the difference between IR and MIS? Or could we say MIS is just a part of IR? I don’t know.
ReplyDelete2. In table 1.1, it shows us the content, applications and tasks of IR. I wonder what the relationship is between IR and search engines. Based on the table 1.1, what is the difference among various search engines? For example, I want to know the intrinsic distinctions between Yahoo and Google. I think Google is much more popular than Yahoo. So why? Are the reasons related to what the table 1.1 shows us?
3. It seems that the biggest issue in IR is relevance. I think it is also the most difficult part in IR. For example, when people input ‘apple’ in the search box, how could we know he/she is searching the fruit ‘apple’ or the company ‘apple’? So how could we rank the search results? It is too subjective.
1. On page 4, it says that "there are many factors that go into a person's decision as to whether or not a particular document is relevant. These factors must be taken into account when designing algorithms for comparing text and ranking documents." How does this really work? It's not like a Google employee is sitting next to every one of its users all the time. How do search engineers know whether or not something was relevant and useful for a person? Is it because they actually clicked the page? Maybe, it's the duration spent browsing that page? How reliable really are these retrieval models?
ReplyDelete2. After reading this, I am more curious as to how predictive search works online. Are certain phrases brought up more because of ranking algorithms? Are the phrases that come up just the popular searches of that particular day, time, and location? Past browsing habits?
3. I am wondering since there are so many different search engine companies like Google, Bing, Yahoo, etc, do these companies ever use each other's search data to make themselves more efficient? Or do they all each horde their own data for themselves so they can become the best search engine with the highest degree of easy access and relevant data?
1. Concerning the information retrieval, the author of this book mainly mentioned the cases that search inputs are text. However things are changing quickly in recent years. For example, Google is trying to develop its search inputs from text to image, voice and photos. Then, are there any differences among them?
ReplyDelete2. In the paper, the author pointed out "the 'big issues' in the design of search engines include the ones identified for information retrieval: effective ranking algorithms, evaluation, and user interaction." I'm very surprised with this part since it's universally known that the coverage matters for search engines as well information retrieval. How could the authors leave this important factor behind in their assertions?
3. In the section before the last one, the author of the paper discussed the role of a search engineer and agree some types of people whose work are concerning information retrieval should be considered as search engineers. However, someone is hard to define. For instance, when I was in Baidu, there are a number of people who were responsible for assessing the algorithm of the search engine. Their jobs are easy and repetitive, grading every result of some given queries from 1 to 5 and then sending the results to the professionals in the company. Then, how about them? Could they be considered as search engineers?
1. On page 7 the author writes that open source search engines have different design goals than that of commercial search engines. It was unclear from the reading how these search engines are different. What is different in their design? Are the engines seeking to display different information?
ReplyDelete2. As described in the reading, this textbook was written to bring the concepts of information retrieval to the computer scientists that build search engines. The author writes that search engineers are computer scientists, but often do not receive training in information retrieval. I find it surprising that this area of study would not be included for those creating search engines as its primary function is information retrieval. This perhaps relates back to our reading from a few weeks ago on the differences between information science, computer science and information systems. What can search engineers learn from studying information retrieval?
3. This article mentions the Yahoo directory at dir.yahoo.com. I remember using this directory when I was younger, but I was surprised to see that it still exists. I am wondering what the use of this directory still is. I can see that it allows a person to somewhat broadly search for sites if a search is not producing great content. The directory greatly limits the results for each sub-category though. How are the sites in the sub-categories chosen? What are the criteria for deciding if a site should be in a category as there are many possible sites that are left out of them?
1. In the model of information retrieval, what’s the difference between user relevance and user’s information needs? What are the factors that affect the user relevance?
ReplyDelete2. My understanding of the distinction between information retrieval and search engine is just that information retrieval is more of a theory and search engine is more of a application. The author also says that a search engine is the practical application of information retrieval techniques to large-scale text collections. But why search engine cares about different issues or more issues than information retrieval, for example scalability and adaptability?
3. The author classifies open source search engines as a different class of system, due to different design goals from the commercial search engines. I wonder what the differences are. Don’t they aim to solve the same set of issues (performance, incorporating new data, scalability, adaptability, and etc)? Don’t they only differ in the collaboration mode?
1. In many years of internet usage, I have had to learn multiple search engine skills, some of which are mentioned in this paper, such as site search. Search results have gotten steadily more sophisticated, but very often a specialized skillset is still required. How can we continue to improve IR to recognize more “natural language”-style queries that users might enter?
ReplyDelete2. Sometimes IR systems seem to get a little bit too smart, in fact, in trying to correct for user error. When searching for a term that is very close to, but not actually a similar word, it can be quite difficult to get Google to stop turning up results for what it thinks is correct. This pretty much destroys relevance for such searches. Could we create a more modular system in which specific correction functions could be disabled?
3. The exploration of text queries is interesting as many of us have had to develop a specialized “query structure” vocabulary to try and tell the search engine what we want and how important each term is within a query, and attempting to eliminate less useful results. While teaching search engines to recognize regular language queries would be nice, should we consider search engine manipulation to be a language topic that should become part of regular education? For instance, many individuals do not know about -term as a means of eliminating unwanted parts of a query.
1. The author mentions that “all” documents have some amount of structure, including title, author, date etc. However, with the advent of communication on blogs and other informal communication such as conversation on social media how does information retrieval techniques continue to stay relevant and accurate.
ReplyDelete2. With the amount of data being added to the internet everyday does the problem of relevance grow? While the author talked a lot about miscommunication in language and coverage I’m wondering if we will reach a point when the shear amount of data makes it too difficult for a computer to do a comprehensive search of all text. Or, are we only concerned with precision when recalling an answer, thus being able to disregard all other relevant but redundant information?
3. I was intrigued when the author started to touch on the topic of spamdexing and would like to go into more discussion about some of the ways that search engines work to prevent irrelevant ad pages from popping up on search requests.
In dealing with media content, how disruptive are current tagging methods employed by users? For instance many individuals seemingly tag photos with certain terms that have little to no bearing on the actual image or video is this one reason for heading toward methods which employ image comparison for search purposes?
ReplyDeleteWe’ve discussed cost previously in regards to dealing with information in different spheres and on page 5 Croft et al., mention the advent of natural language processing. How intensive are the costs associated with natural language processing or speech recognition methods?
Search engineers are often individuals with a background primarily in computer science which may be the reason why a focus on the particular algorithms and systems has played a primary role in IR development. In spaces such as the iSchool where these individuals interact with social scientists and people from varying disciplines, is there likely to be a large shift toward user focus in IR development? Or will the dynamic nature of individual users stop that shift from occurring and the status quo continuing?
1 - It occurs to me through these readings that there's an implicit understanding that most information is textual - what happens to information retrieval when we incorporate audiovisual materials, or even physical objects, into the criteria? Is this something that's totally out of the realm of IR for various reasons, or is it because society as a whole considers information to be mostly textual?
ReplyDelete2 - While reading this article, I was thinking of a recent project designed by the UN about global perceptions of women. The ad campaign involved typing in items such as "women should" and letting autocomplete handle the rest by filling in sexist tropes (link: http://www.upworthy.com/would-you-expect-these-results-to-appear-when-you-google-women-2?c=bm1) However, it's my understanding that these results don't imply that millions of people are typing "women should not vote" into the search engine...or is it? Again, we're back to the question of information/the internet being "democratic" - are these results really demonstrating a 'popular vote' on societal perceptions of women?
3 - How has information retrieval developed as a position/field of study within IS? It's mentioned that most individuals working in IR didn't formally engage in the study of IR, so where does this knowledge/understanding come from? Is this more of a tacit knowledge gained through problem-solving skills, especially in the realm of computer science or programming? It seems like a nuanced understanding of the logical structuring of programming and technology would enable one to understand IR more than other fields would.
1. The authors talk about the idea of unstructured text within documents. This, presumably, refers to any text which has no easily discernible attributes such as title, author, date, which are pertinent to IR. Is that correct?
ReplyDelete2. I don't quite understand the concept of a peer-to-peer search. Aside from music sharing, what would be an example of this? Is this type of searching something that occurs frequently as I know the others (i.e. desktop, enterprise, etc.) do?
3. I'm not sure when this book was written, but the authors refers to linguistic vs statistical analysis in IR. According to them, statistical analysis, with the ranking algorithms being more concerned with word occurrences, is more common. Does anyone know if this has changed at all in recent years? Is statistical analysis perhaps employed because it's easier to gauge?
1. I don't mean to be pedantic but the definition if Information Retrieval provided in this article has the phrase "... retrieval of information." in it. Does this really need to be stated? Couldn't the authors have just left the definition at structure, analysis, organization, storage, and searching?
ReplyDelete2. I know that jobs exist to rate web pages based on search terms to see if those pages seem like relevant results to the terms. Is there another way besides using a human to evaluate relevance?
3. I know that a lot of websites have a search feature that allows the user to search through the information on that particular site. Assuming they use Google as their site search engine, is that search any different than searching on Google.com and specifying that Google only search that particular address?
1. In the definition of “Information Retrieval” from Gerard Salton, information retrieval is a field concerned with structure, analysis, organization, storage, searching, and retrieval of information. I'm confused with “searching” and “retrieval”; what is the difference between them?
ReplyDelete2. One of the big issues in information retrieval, vocabulary mismatch problem, is very interesting; but the author did not give a solution of it. Since language can be used to express the same concepts in many different ways with very different words, could the controlled vocabulary solve the mismatch problem at all? What are the disadvantages of controlled vocabulary?
3. What’s the role we play in the field of information retrieval? The author said very few courses being taught in computer science departments that give the student an appreciation of the variety of issues in the field of search engineering. I have learned some knowledge of information retrieval from i-school; however, I wonder how to use those concepts and tools to build a real search engine.
1. I was surprised by the definition of filtering on page 3. I thought filtering was adding additional criteria to a search to narrow results. His definition sounds more like google alerts.
ReplyDelete2. Interesting that over probably the last 10 years(?) we went from search engines designed to find webpages to now webpages being designed to be found by search engines. These algorithms have changed the way we design, write, and code on a single page and across the site.
3. Relevance and search engines determining relevance is still hazy to me. By getting additional information from the user so that the search can be refined helps with relevance, but still there seems to be this large hole that we jump over. There must a bridge, but that bridge is poorly defined and explained.
1. The authors place test collections under the category of Evaluation. What is the purpose of test collections and why are they assembled? Are they only used for evaluation purposes or do they have other uses?
ReplyDelete2. Croft et al. briefly mention clustering in reference to data mining. What does clustering refer to and how is it useful?
3. Open-source search engines are mentioned on page 7 along with the names of three popular systems -- Lucene, Lemur and Galago. How do open-source search engines differ from the four types mentioned near the beginning of the chapter: Vertical, Enterprise, Desktop and Peer-to-peer?
1. The authors remark that, "The current technology for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example." This statement caught my eye? How would you search for pictures or music WITH pictures or music rather than words? How would such a search engine function? Would the technology be similar to advanced facial recognition techniques, maybe?
ReplyDelete2. If word frequency is so important for search engines, are they essentially making their own "clouds" of words in order to create ranking algorithms? How does that work?
3. What is the difference between open source search engines and commercial search engines? The paper says they have different design goals but I don't think it specifies what they are.
1. On page 5, the author suggests that we distinguish between topical relevance and user relevance. And his example also seems reasonable, since different users might have different experiences and retrieving needs. But how to take user relevance into consideration when designing and developing an retrieving system?
ReplyDelete2. As a user-centered service, information retrieval engine collects users' behaviors or feedback to make an self-evaluation. However, how to reduce the subjective impact from users who have bad retrieving habits?
3. In the end of this article, the author mentions the role of search engineer as designing and implementing new searching engines. So, what is information professionals' role in IR area?
1. On page 5, the authors state that the evaluation of a search is a core issue for individuals involved in information retrieval. With that in mind, can this be attributed as a driving reason why many search engine sites (Google, Bing/Microsoft, Yahoo!) push to be the one offering you something of a 'complete web package'? That is your email linked to your social media linked to your search engine linked to your browser you're signed in... all in the same of returning you search results more relevant to you as an individual? I realize a lot of it is also yes, to get you to buy more things, and it is pre-emptively filtering the internet for you which has been cited as not necessarily a good thing, BUT is there a reason for that beyond capital?
ReplyDelete2. The authors stress the importance of text for searching for documents/images/etc. online. That said, how does Internet culture in some ways hinder or resist being searchable? Particularly on sites like Twitter and Tumblr, many users utilize tagging as something of a secondary method of communication with other users. They will leave each other messages or notes within tags for specific users they are 'conversing' with... but sense none of their language is adhering to any kind of search hierarchy, it makes whatever the original post was about unsearchable by terms we'd assign to it if we were able to see it.
3. In the realm of peer-to-peer, desktop and web searching, where does searching cloud storage fall? The tech uploading to that cloud aren't all on the same intranet, but the items within the cloud aren't available for searching on the web at large. So where would this put searching one's cloud?
1. It is mentioned that current technology for seating non-text documents relies on text description of their content rather than the contents themselves. But techniques for direct comparison is merging, and we all know some applications have already existed, like Shazam. What is the method they use to compare with?
ReplyDelete2. Evaluation of retrieval models and search engines focused on using large volumes of log data, such as clickthrough data. But as we can see, the clickthrough has some flaws. Sometimes the users can get result without any clicking, or they may click the wrong link. How could we deal with this situation?
3. Author talked about the open source search engines in this article. But he was unable to clear definition of open source search engines. Is it the reason that they use open source development platform? What is the difference between the open source search engines and Google?
1. One of the biggest issues in information retrieval is relevance. Topical relevance and user relevance are discussed. To what degree does topical relevance give out accurate results? How do we determine that? And are there situations where user relevance need not be taken into account? Like for example, if the user searches for a topic in the broader sense?
ReplyDelete2. One more issue discussed is the user’s information needs. Information needs of the users keep changing constantly and is not the same for each of them. Query suggestion, query expansion and relevance feedback are some techniques to offer better results to users. What kind of a test group should be chosen for this? How large should it be ? How can you make sure the particular group addresses the needs of all the users?
3.How are search engines made more scalable? With more data and people, do they tend to give out irrelevant results? So, should the algorithms be dynamic? How efficient are the present algorithms?
This comment has been removed by the author.
ReplyDelete1. On page 3, the author discusses different types of searches and methods for data retrieval. I'm curious as to the context and scope of this list--how have these methods evolved over time, and what are the foreseeable changes?
ReplyDelete2. Since relevancy is a rather subjective in nature, to what extent do evaluation tools truly catch all that they seek to capture? Is there a way to quantify and discuss gaps in information, information that is perceived to be missing/lacking, or information that is missing/lacking but of which that factor is unknown?
3. I appreciated the author's discussion of the search engineer. Sometimes, in the details of talking about systems, processes, and algorithms, the human element can become lost. Taking the time to discuss their role, background, training, and skills is helpful in understanding the full effect of the information retrieval system and construct.
1. In this article the authors discuss several different types of search engines. One type of search engine that they describe is the open source search engine. What do you think are the benefits in using an open source search engine versus the benefits of using a proprietary search engine that has been created for a similar task? What are the drawbacks?
ReplyDelete2. In this article the authors discus the various different applications of information retrieval from web searching to enterprise searching to desktop searching. They state that each of these different applications has different requirements due to the amount of information they will be searching as well as the type of information they will be searching. However later they discuss that scalability is an issue in search engines and that search engines should be able to work on large and small collections. Do you think that you should use one engine to search both the web and your home computer or organizational network or do you think these different applications should use different engines?
3. In this article the authors briefly discuss how information retrieval is being used to deal with different media types such as pictures, video, and audio. They state that currently most technology that is used to search these types of media is based on searching text descriptions of this media. In the Oard et al. article that we read this week the authors attempt to create an information retrieval test collection for conversational speech. Do you think that the Oard et al. are creating a test collection that uses text descriptions like this article states?
1. On the first page of the article, the authors refer to a definition of information retrieval given by Gerard Salton. The authors describe it as “appropriate and accurate.” I find it to be vague. Could we work on a better definition?
ReplyDelete2. What is ad hoc search? It comes up on page 3, and I feel like the authors indirectly define it, and yet I don’t really know what he’s talking about…
3. On page 5, the authors say it is an “interesting” feature of retrieval models that they focus on statistical properties rather than linguistic ones. Perhaps it’s just because I’m part of the google generation or something, but this doesn’t seem at all counterintuitive to me. The idea of taking into account the parts of speech, on the other hand, is more intriguing. How is information about adjectives and nouns incorporated into the “more advanced” retrieval models that they mention? Would that be more like semantic web?
1. There will be a huge impact on the results of a search depending on how relevancy ranking algorithms are constructed. If people could find out how all their searches are constructed, in terms of relevancy, would this influence the retrieval systems and databases they use? I know most of these are generally considered proprietary but should they be available for public access since they affect the information seeking habits of everyone who uses them?
ReplyDelete2. The reading mentions that linguistic features are of secondary importance to an information retrieval model. Is this because of the complexity and intricacies of how individuals speech patterns? Depending on the community or linguistic inconsistencies and lexicon of an individual the same question could be constructed in several ways.
3. As I read about clickthrough data I thought about things like Wikipedia getting one of the highest rankings for any searches. If retrieval is based, in some part, on what users “clicked” before is this reinforcing the initial retrieval algorithm? This is based on the assumption that many users click on links in the first couple of pages their search.
1. In the 1.2 The Big Issues, the authors mentioned these factors must be taken into account when designing algorithms for comparing text and ranking documents. So I am wondering how do the algorithms designers decide the boost value for each item?
ReplyDelete2. The authors also mentioned the concepts of topical relevance and user relevance. So I am thinking how to customize the searching results in search engine? Why all search engines don`t have personal account to record what information they already consume?
3. Just came across an interesting article about search engine: I am wondering how this happened? It is really because the guys who are in charge of designing algorithms have sexism? What the mechanism behind the search box for the auto-complete functionality?
http://mashable.com/2013/10/18/google-autocomplete-sexism/
1. I'd never given much thought to my spam filter (which I guess means it's doing a good job--thanks, Gmail!), so it was interesting to stumble upon the section explaining that spam filters are simply sophisticated search engines applied to your mail as it's received. As someone who, until as recently as starting this program, considered only web search or face-to-face reference transactions when thinking of information retrieval, seeing its many other applications was quite interesting.
ReplyDelete2. It was also interesting to see how web search engines have dealt with ambiguous or incredibly short search terms (the example used in the book: "cats"). I remember when Google first began listing popular search terms in a drop-down box beneath the search bar, and when they introduced auto-complete. There was a lot of feedback online, with many people wishing the system would just "let me finish my search!" without realizing that, to Google, their searches sucked. I'd considered it a convenience feature for the user, never thinking it was actually a tool Google uses to better complete its task.
3. I found it interesting that "[meta /]" tags (which blogspot refuses to let me use with appropriate "< >" notation, irksomely) in the headers of web pages weren't mentioned in either the explanation of attributes searched by search engines or in explanations of how websites are searched in general. Why is that (covered in a later chapter, perhaps)?