Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part Two
Cormack and Grossman set up an ingenious experiment to test the effectiveness of three machine learning protocols. It is ingenious for several reasons, not the least of which is that they created what they call an “evaluation toolkit” to perform the experiment. They have even made this same toolkit, this same software, freely available for use by any other qualified researchers. They invite other scientists to run the experiment for themselves. They invite open testing of their experiment. They invite vendors to do so too, but so far there have been no takers.
I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system. The catalyst for this current interest is “Evaluation of Machine Learning Protocols for Technology Assisted Review in Electronic Discovery”, recently presented at SIGIR by Gord Cormack and Maura Grossman.
The Grossman-Cormack article, ” Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery ,” has kicked off some useful discussions. Here are our comments on two blog posts about the article, one by Ralph Losey, the other by John Tredennick and Mark Noel.
The easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense.
Data transfer risk may be minimized by automation and standards or increased by the requirement of human intervention. As automation and standards are still slowly maturing in the realm of electronic discovery technology, it seems important that legal professionals understand and properly consider the impact of potential data transfer risk as they plan, source, and conduct their electronic discovery activities.
Since the last century, text analysis has been the primary tool used to classify documents, but its durability as the tool of choice doesn’t mean that it remains the best choice.
International Legal Technology Association: “Released in May 2014, Legal Technology Future Horizons (LTFH) is a report that provides insights and practical ideas to inform the development of future business and IT strategies for law firms, law departments and legal technology vendors. The research, analysis and interpretation of the findings were undertaken by Fast Future Research and led by Rohit Talwar.
With a desire to increase the understanding of those considering engaging with eDiscovery vendors, provided is a simple subjective analysis of the operational center of gravity for approximately 130 providers.
The hype cycle around Predictive Coding/Technology Assisted Review (PC/TAR) has focused around court acceptance and actual review cost savings. The last couple weeks have seen a bit of blogging kerfuffle over the conclusions, methods and implications of the new study by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted-Review in Electronic Discovery”. Pioneering analytics guru Herbert L. Roitblat of OrcaTec has published two blogs (first and second links) critical of the study and its conclusions. As much as I love a spirited debate and have my own history of ‘speaking truth’ in the public forum, I can’t help wondering if this tussle over Continuous Active Learning (CAL) vs. Simple Active Learning (SAL) has lost view of the forest while looking for the tallest tree in it.
Proportionality. Parties are expected to use reasonable, good faith and proportional efforts to preserve, identify and produce relevant information. This includes identifying appropriate limits to discovery, including limits on custodians, identification of relevant subject matter, time periods for discovery and other parameters to limit and guide preservation and discovery issues.
The State Bar of California has issued perhaps the country’s most straightforward and candid directive to litigators to learn the ins and outs of electronic discovery (e-discovery). In a proposed formal opinion, it states, “Not every litigated case ultimately involves e-discovery; however, in today’s technological world, almost every litigation matter potentially does.”
Calculating MTV Ratio and True Recall Many tools designed to search or classify documents as part of the enterprise content management and electronic discovery functions in organizations depend on having accurate textual representations of the documents being analyzed or indexed. They have text-tunnel vision – they cannot “see” non-textual objects. If the only documents of interest were text-based, that could be an excusable shortcoming, depending on what tasks were being performed. However, there are some collections where as many as one-half of the documents of interest contain no textual representation.
In this post, I want to focus more on the science in the Cormack and Grossman article. It seems that several flaws in their methodology render their conclusions not just externally invalid — they don’t apply to systems that they did not study, but internally invalid as well — they don’t apply to the systems they did study.
When EDRM , the organization that created the Electronic Discovery Reference Model , launched its Information Governance Reference Model (IGRM) I wondered how long it would take for this day to come. The wait is over. Information governance (IG) has taken its place on the EDRM. In this post I will take a look at the changes and consider whether they have gone too far, have got it just right, or maybe have a little room for more tweaks.
Provided for your review is a short but important comment from a recent LinkedIn Group discussion on the evaluation of machine learning protocols for technology-assisted review (TAR). The comment introduced a problems with TAR in the fact that it is limited text.
I don’t usually comment on competitors’ claims, but I thought that I needed to address some potentially serious misunderstandings that could come out of Cormack and Grossman’s latest article, “ Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery .” Although they find in this paper that an active learning process is superior to random sampling, it would be a mistake to think their conclusions would apply to all random sampling predictive coding regimens.
Based on a compilation of research from analyst firms and industry expert reports in the information governance arena, the following short list of enablers highlights companies and firms that may be useful in the consideration of information governance products and services.
Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review
The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P < 0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents.
Provided as a non-comprehensive overview of key and publicly announced eDiscovery related mergers, acquisitions and investments to date in 2014, the following listing highlights key industry activities through the lens of announcement date, acquired company, acquiring or investing company and acquisition amount (if known).
Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings.
Good vendors share what they know they see. Great vendors share what they may not see so you can make informed decisions as to risk and exposure.
The proliferation of data and how it is being managed — or in most cases mismanaged — is causing more organizations to question whether they have information assets or liabilities. Two of the major drivers pushing organizations to finally get their data under control are costs and risks.“People are starting to get interested in reducing their overall data in many cases for regulatory issues,” said Dera Nevin, managing director and an electronic discovery lawyer at re:Discovery Law PC.Nevin, who was speaking to the International Legal Technology Association last week at an event hosted by Norton Rose Fulbright Canada LLP, [...]