Research: A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)
Applicability: “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” offers unique insight into the temporal dynamics of term distribution which may hold implications the design of search systems. As the growing importance of real-time search brings with it several information retrieval challenges; this paper frames one such challenge, that of rapid changes to term distributions, particularly for queries.
Abstract: The real-time nature of Twitter means that term distributions in tweets and in search queries change rapidly: the most frequent terms in one hour may look very different from those in the next. Informally, we call this phenomenon “churn”. Our interest in analyzing churn stems from the perspective of real-time search. Nearly all ranking functions, machine-learned or otherwise, depend on term statistics such as term frequency, document frequency, as well as query frequencies. In the real-time context, how do we compute these statistics, considering that the underlying distributions change rapidly? In this paper, we present an analysis of tweet and query churn on Twitter, as a first step to answering this question. Analyses reveal interesting insights on the temporal dynamics of term distributions on Twitter and hold implications for the design of search systems.
Analysis: Summarized analysis from this paper includes observations on:
Authors: Prepared by Jimmy Lin and Gilad Misne of Twitter, Inc., “A Study of “Churn” in Tweets and Real-Time Search Queries (Extended Version)” is a prepared paper submitted and accepted by the 6th International AAAI Conference on Weblogs and Social Media (ICWSM 2012).
This entry was posted on Tuesday, June 5th, 2012 at 2:39 pm. It is filed under chronology, discover and tagged with research, social media. You can follow any responses to this entry through the RSS 2.0 feed.
Comments are closed.
Since its 2007 introduction, kCura’s Relativity product has become one of the world’s leading attorney review platforms. One of the elements of Relativity’s strong growth and marketplace acceptance has been kCura’s focus on and support of partnerships. Provided as a by-product of review platform research and presented in the form of a simple and sortable table is an aggregation of kCura Premium Hosting Partners and Consulting Partners.
Taken from a combination of public market sizing estimations as shared in leading electronic discovery reports, publications and posts over time, the following eDiscovery Market Size Mashup shares general worldwide market sizing considerations for both the software and service areas of the electronic discovery market for the years between 2013 and 2018.
The consensus view is that after the purchase Microsoft will essentially disband Equivio and absorb its technology, its software designs, and some of its experts. Then, as Craig Ball predicts, they will wander the halls of Redmond like the great cynic Diogenes. No one seems to think that Microsoft will continue Equivio’s business.
In my previous post, I found that relevance and uncertainty selection needed similar numbers of document relevance assessments to achieve a given level of recall. I summarized this by saying the two methods had similar cost. The number of documents assessed, however, is only a very approximate measure of the cost of a review process, and richer cost models might lead to a different conclusion.
One distinction that is sometimes made is between the cost of training a document, and the cost of reviewing it. It is often assumed that training is performed by a subject-matter expert, whereas review is done by more junior reviewers. The subject-matter expert costs more than the junior reviewers—let’s say, five times as much. Therefore, assessing a document for relevance during training will cost more than doing so during review.
A critical metric in Technology Assisted Review (TAR) is recall, which is the percentage of relevant documents actually found from the collection. One of the most compelling reasons for using TAR is the promise that a review team can achieve a desired level of recall (say 75% of the relevant documents) after reviewing only a small portion of the total document population (say 5%). The savings come from not having to review the remaining 95% of the documents.
On Oct. 7, 2014, the Wall Street Journal reported that Microsoft had signed a letter of intent to buy what they called an Israel-based text analysis startup company named Equivio . The mainstream business press has virtually no understanding of the e-discovery industry, nor anything having to do with litigation support. They also seem to have no real grasp of what kind of software Equivio and others like it in the industry have created.
By William Webber My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in […]
ComplexDiscovery | Creative Commons Attribution 4.0 International