Research: A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)
Applicability: “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” offers unique insight into the temporal dynamics of term distribution which may hold implications the design of search systems. As the growing importance of real-time search brings with it several information retrieval challenges; this paper frames one such challenge, that of rapid changes to term distributions, particularly for queries.
Abstract: The real-time nature of Twitter means that term distributions in tweets and in search queries change rapidly: the most frequent terms in one hour may look very different from those in the next. Informally, we call this phenomenon “churn”. Our interest in analyzing churn stems from the perspective of real-time search. Nearly all ranking functions, machine-learned or otherwise, depend on term statistics such as term frequency, document frequency, as well as query frequencies. In the real-time context, how do we compute these statistics, considering that the underlying distributions change rapidly? In this paper, we present an analysis of tweet and query churn on Twitter, as a first step to answering this question. Analyses reveal interesting insights on the temporal dynamics of term distributions on Twitter and hold implications for the design of search systems.
Analysis: Summarized analysis from this paper includes observations on:
Authors: Prepared by Jimmy Lin and Gilad Misne of Twitter, Inc., “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” is a prepared paper submitted and accepted by the 6th International AAAI Conference on Weblogs and Social Media (ICWSM 2012).
This entry was posted on Tuesday, June 5th, 2012 at 2:39 pm. It is filed under chronology, discover and tagged with research, social media. You can follow any responses to this entry through the RSS 2.0 feed.
Comments are closed.
As lawyers, we hear a lot about the technological advances in e-discovery and information governance. How do you describe the current state of e-discovery from an opportunity and growth perspective, and how does this market opportunity impact the pulse rate of mergers, acquisitions, and investments? For lawyers purchasing e-discovery packages, there are several types of vendors and pricing models, and they need to be asking the right questions. What does the data governance solution need to do, how much does it cost, what are the time constraints, and how complex is the system?
Since its 2007 introduction, kCura’s Relativity product has become one of the world’s leading attorney review platforms. One of the elements of Relativity’s strong growth and marketplace acceptance has been kCura’s focus on and support of partnerships. Provided as a by-product of review platform research and presented in the form of a simple and sortable table is an aggregation of kCura Premium Hosting Partners and Consulting Partners.
Taken from a combination of public market sizing estimations as shared in leading electronic discovery reports, publications and posts over time, the following eDiscovery Market Size Mashup shares general worldwide market sizing considerations for both the software and service areas of the electronic discovery market for the years between 2013 and 2018.
In the wake of Judge Peck’s recent Rio Tinto opinion on technology assisted review, the ediscovery blogosphere has been repeatedly quoting its bold pronouncements that judicial acceptance of TAR “is now black letter law” and that “it is inappropriate to hold TAR to a higher standard than keywords or manual review.” And rightly so — these statements appear intended to put outdated predictive coding debates to rest once and for all. Yet a good deal of the focus is going to the question Judge Peck raises but does not fully resolve: whether disclosure of TAR seed sets may be required.
One advantage of using computer assisted review, for example, predictive coding, is that the computer does, in fact, examine all of the available evidence in a document. Unlike human reviewers, the computer sees all parts of the elephant and, as a result, consistently judges documents based on the full complement of information in them. Each of reviewer judgment used to train the system may be based on a sample of features, but the computer system aggregates all of these partial judgments and chooses the category that is most consistent with this aggregation of cues, rather than with any individual sample. As a result, the computer can be more consistent than the human reviewer who trains trains it. Under appropriate circumstances, this consistency further enhances the accuracy and reliability of computer assisted review.
Because so much useful information is unavailable to text analytics engines, they are unsuited for enterprise-scale document classification processes that involve placing documents in discrete document types so that subsequent classification-dependent initiatives can be undertaken, e.g., retention, remediation, migration, and digitization.
Many who consider Magistrate Judge Peck’s recent opinion and order in Rio Tinto PLC v. Vale S.A., which he titled “Predictive Coding a.k.a. Computer Assisted Review a.k.a. Technology Assisted Review (TAR) – Da Silva Moore Revisited,” will focus on his declaration “that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.” We’ll revisit that statement in a moment, but first note that it is also black letter law that important discovery decisions get revisited. See, e.g., The Pension Committee of the University of Montreal Pension Plan v. Banc of America Securities, subtitled “Zubulake Revisited: Six Years Later.”
ComplexDiscovery | Creative Commons Attribution 4.0 International