Research: A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)
Applicability: “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” offers unique insight into the temporal dynamics of term distribution which may hold implications the design of search systems. As the growing importance of real-time search brings with it several information retrieval challenges; this paper frames one such challenge, that of rapid changes to term distributions, particularly for queries.
Abstract: The real-time nature of Twitter means that term distributions in tweets and in search queries change rapidly: the most frequent terms in one hour may look very different from those in the next. Informally, we call this phenomenon “churn”. Our interest in analyzing churn stems from the perspective of real-time search. Nearly all ranking functions, machine-learned or otherwise, depend on term statistics such as term frequency, document frequency, as well as query frequencies. In the real-time context, how do we compute these statistics, considering that the underlying distributions change rapidly? In this paper, we present an analysis of tweet and query churn on Twitter, as a first step to answering this question. Analyses reveal interesting insights on the temporal dynamics of term distributions on Twitter and hold implications for the design of search systems.
Analysis: Summarized analysis from this paper includes observations on:
Authors: Prepared by Jimmy Lin and Gilad Misne of Twitter, Inc., “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” is a prepared paper submitted and accepted by the 6th International AAAI Conference on Weblogs and Social Media (ICWSM 2012).
This entry was posted on Tuesday, June 5th, 2012 at 2:39 pm. It is filed under chronology, discover and tagged with research, social media. You can follow any responses to this entry through the RSS 2.0 feed.
Comments are closed.
Daily we read, see, and hear more and more about the tension corporate legal departments face as they decide how to source technology and talent for their eDiscovery efforts. Balancing cost, time, and complexity is a continual challenge and what is the right balance today may be out of balance tomorrow. This week our cartoon and clip provides one look at the impact of technology on outsourcing (cartoon), and shares considerations for right sourcing eDiscovery (clip).
Daily we read, see, and hear more and more about how technology is changing the game of document classification and revolutionizing document review. While there may be evidence that new data governance and discovery technologies can absolutely change current approaches to document classification and review, it is important to remember that technology is only as good as its ability to be delivered, managed, and supported by vendors and integrators.
While some may dispute the existence of unstructured data, definitions for the term “unstructured data” do exist. This week our cartoon and clip provides a quick look at how people convey meaning in different ways (cartoon) and provides a short list of definitions for the term “unstructured data” (clip).
The focus on the technology and talent elements of an information governance vendor’s capability is certainly warranted as these elements ultimately provide the cutting edge for the knife of task execution. However, just as there is much more to the utility of a knife than its edge (especially if you want to use it more than once), there are additional areas worthy of consideration in vendor selection if one is considering the long term strategic utility and viability of a vendor.
Since the advent of Technology Assisted Review (aka TAR, predictive coding or computer-assisted review), one of the open questions is whether you have to run a separate TAR process for each item in a document request. As litigation professionals know, it is rare to have only one numbered request in a Rule 34 pleading. Rather, you can expect to see scores of requests (typically as many as the local rules allow).
Attorneys and judges often rely exclusively upon “precision” and “recall” thresholds for acceptance of dichotomous classification models in what is commonly referred to in the legal industry as “predictive coding.” Because these measures fail to provide a complete understanding of the proposed model’s characteristics and efficacy, this paper will argue that interested parties should go beyond the precision and recall metrics and include other, more effective performance measurements such as Receiver Operator Characteristic (ROC) and Area Under the Curve (AUC).
Courts have so far provided mixed guidance on this issue, leaving litigants guessing as to whether their choice of blending keyword and predictive coding search methodologies – if challenged by an adversary – would receive judicial imprimatur. Nevertheless, a new ruling from the Rio Tinto v. Vale litigation confirms that parties may combine these search methodologies to achieve reasonable and proportional productions of highly relevant information.
Technology-Assisted Review (hereinafter TAR) is broadly defined as the use of computer tools to determine the relevance of selected documents to any issues in a given controversy. The most utilized form of TAR, known as predictive coding, allows a human reviewer to utilize a select sample of documents to “train” a computer to recognize patterns of relevance in the universe of documents under review.
In this episode of Digital Detectives, Sharon Nelson and John Simek interview Judge Andrew Peck, an expert in issues relating to electronic discovery. Together they discuss the current state of technology-assisted review, how FRCP amendments will affect the way lawyers do discovery, and best practices when using TAR.
ComplexDiscovery | Creative Commons Attribution 4.0 International