Features Portrait of Laura Zubulake by Anita Kunz When Laura Zubulake first brought her employment discrimination lawsuit to attorney James Batson in 2001, neither of them thought the case would make history. Neither did U.S. District Judge Shira Scheindlin, who presided over the case in the Southern District of New York. In fact, Scheindlin has mentioned many times that Zubulake’s lawsuit seemed like a "garden-variety employment discrimination case." Zubulake didn’t get a promotion she thought she had earned at the global financial services firm UBS Warburg, filed a complaint with human resources and suddenly found herself at odds with [...]
Based on a website review of this year’s Inc. 5000, the following list provides a quick, non-all inclusive reference of some of the eDiscovery enablers that have been included in the 2014 list. The sortable list includes the provider’s name, 2014 Inc. 5000 ranking (#), three year revenue growth (%), 2013 revenue ($) and industry categorization.
Continuous Active Learning for Technology Assisted Review (How it Works and Why it Matters for E-Discovery)
Grossman and Cormack concluded that CAL demonstrated superior performance over SPL and SAL, while avoiding certain other problems associated with these traditional TAR 1.0 protocols. Specifically, in each of the eight case studies, CAL reached higher levels of recall (finding relevant documents) more quickly and with less effort that the TAR 1.0 protocols.
Since its 2007 introduction, kCura’s Relativity product has become one of the world’s leading attorney review platforms. One of the elements of Relativity’s strong growth and marketplace acceptance has been kCura’s focus on and support of partnerships. Provided as a by-product of review platform research and presented in the form of a simple and sortable table is an aggregation of kCura Premium Hosting Partners and Consulting Partners.
A random only search method for predictive coding training documents is ineffective. The same applies to any other training method if it is applied to the exclusion of all others. Any experienced searcher knows this.
The results presented here do not support the commonly advanced position that seed sets, or entire training sets, must be randomly selected [19, 28] [contra 11]. Our primary implementation of SPL, in which all training documents were randomly selected, yielded dramatically inferior results to our primary implementations of CAL and SAL, in which none of the training documents were randomly selected.
Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part Two
Cormack and Grossman set up an ingenious experiment to test the effectiveness of three machine learning protocols. It is ingenious for several reasons, not the least of which is that they created what they call an “evaluation toolkit” to perform the experiment. They have even made this same toolkit, this same software, freely available for use by any other qualified researchers. They invite other scientists to run the experiment for themselves. They invite open testing of their experiment. They invite vendors to do so too, but so far there have been no takers.
I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system. The catalyst for this current interest is “Evaluation of Machine Learning Protocols for Technology Assisted Review in Electronic Discovery”, recently presented at SIGIR by Gord Cormack and Maura Grossman.
The Grossman-Cormack article, ” Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery ,” has kicked off some useful discussions. Here are our comments on two blog posts about the article, one by Ralph Losey, the other by John Tredennick and Mark Noel.
The easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense.
Data transfer risk may be minimized by automation and standards or increased by the requirement of human intervention. As automation and standards are still slowly maturing in the realm of electronic discovery technology, it seems important that legal professionals understand and properly consider the impact of potential data transfer risk as they plan, source, and conduct their electronic discovery activities.
Since the last century, text analysis has been the primary tool used to classify documents, but its durability as the tool of choice doesn’t mean that it remains the best choice.
International Legal Technology Association: “Released in May 2014, Legal Technology Future Horizons (LTFH) is a report that provides insights and practical ideas to inform the development of future business and IT strategies for law firms, law departments and legal technology vendors. The research, analysis and interpretation of the findings were undertaken by Fast Future Research and led by Rohit Talwar.
With a desire to increase the understanding of those considering engaging with eDiscovery vendors, provided is a simple subjective analysis of the operational center of gravity for approximately 130 providers.
The hype cycle around Predictive Coding/Technology Assisted Review (PC/TAR) has focused around court acceptance and actual review cost savings. The last couple weeks have seen a bit of blogging kerfuffle over the conclusions, methods and implications of the new study by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted-Review in Electronic Discovery”. Pioneering analytics guru Herbert L. Roitblat of OrcaTec has published two blogs (first and second links) critical of the study and its conclusions. As much as I love a spirited debate and have my own history of ‘speaking truth’ in the public forum, I can’t help wondering if this tussle over Continuous Active Learning (CAL) vs. Simple Active Learning (SAL) has lost view of the forest while looking for the tallest tree in it.
Proportionality. Parties are expected to use reasonable, good faith and proportional efforts to preserve, identify and produce relevant information. This includes identifying appropriate limits to discovery, including limits on custodians, identification of relevant subject matter, time periods for discovery and other parameters to limit and guide preservation and discovery issues.
The State Bar of California has issued perhaps the country’s most straightforward and candid directive to litigators to learn the ins and outs of electronic discovery (e-discovery). In a proposed formal opinion, it states, “Not every litigated case ultimately involves e-discovery; however, in today’s technological world, almost every litigation matter potentially does.”
Calculating MTV Ratio and True Recall Many tools designed to search or classify documents as part of the enterprise content management and electronic discovery functions in organizations depend on having accurate textual representations of the documents being analyzed or indexed. They have text-tunnel vision – they cannot “see” non-textual objects. If the only documents of interest were text-based, that could be an excusable shortcoming, depending on what tasks were being performed. However, there are some collections where as many as one-half of the documents of interest contain no textual representation.
In this post, I want to focus more on the science in the Cormack and Grossman article. It seems that several flaws in their methodology render their conclusions not just externally invalid — they don’t apply to systems that they did not study, but internally invalid as well — they don’t apply to the systems they did study.
When EDRM , the organization that created the Electronic Discovery Reference Model , launched its Information Governance Reference Model (IGRM) I wondered how long it would take for this day to come. The wait is over. Information governance (IG) has taken its place on the EDRM. In this post I will take a look at the changes and consider whether they have gone too far, have got it just right, or maybe have a little room for more tweaks.
Provided for your review is a short but important comment from a recent LinkedIn Group discussion on the evaluation of machine learning protocols for technology-assisted review (TAR). The comment introduced a problems with TAR in the fact that it is limited text.
I don’t usually comment on competitors’ claims, but I thought that I needed to address some potentially serious misunderstandings that could come out of Cormack and Grossman’s latest article, “ Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery .” Although they find in this paper that an active learning process is superior to random sampling, it would be a mistake to think their conclusions would apply to all random sampling predictive coding regimens.