Predictive Coding: A Working List of Technologies and a One-Question Provider Implementation Survey


Predictive Coding Technologies:  A Deeper Look

As defined in The Grossman-Cormack Glossary of Technology-Assisted Review(1), Predictive Coding is an industry-specific term generally used to describe a technology-assisted review process involving the use of a machine learning algorithm to distinguish relevant from non-relevant documents, based on a subject matter expert’s coding of a training set of documents.  This definition provides a baseline description that identifies one particular function that a general set of commonly accepted machine learning algorithms may used for in technology-assisted review.

With the growing awareness and use of the technology-assisted review feature of predictive coding in the legal arena today, it appears that it is increasingly more important for electronic discovery professionals to have a general understanding of the algorithm approaches that may be implemented in electronic discovery platforms to facilitate predictive coding of electronically stored information.  This general understanding is important as each potential algorithmic approach has efficiency advantages and disadvantages that my impact the efficacy of predictive coding.

To help in developing this general understanding of potential predictive coding algorithms and to provide an opportunity for electronic discovery providers to share the approaches they use in their platforms to accomplish predictive coding, the following working list of predictive coding technologies and corresponding one-question provider implementation survey are provided for your consideration and use.

Note: The running results of a previously presented general survey on eDiscovery provider use of predictive coding are available for review(2) (click here for survey results).  The initial 120-second survey(3) (click here for initial survey form) contained six high level questions related to technology development, offering integration, machine learning approach and sampling approach of providers in relation to predictive coding.  The following working list and one-question provider implementation survey are designed to build on the machine learning question from the initial general survey by providing additional and important layers of detail.

A Working List of Predictive Coding Technologies

Courtesy of industry search expert Herb Roitblat, provided below is a working list of identified machine learning approaches that have been applied or have the potential to be applied to the discipline of eDiscovery to facilitate predictive coding.  This working list is designed to provide a reference point for identified predictive coding technologies and may over time include additions, adjustments and/or amendments based on feedback from experts and organizations applying and implementing these technologies in their specific eDiscovery platforms.

Listed in Alphabetical Order

  • Active Learning: An iterative process that presents for reviewer judgment those documents that are most likely to be misclassified.  In conjunction with Support Vector Machines, it presents those documents that are closest to the current position of the separating line.  The line is moved if any of the presented documents has been misclassified.
  • Language Modeling:  A mathematical approach that seeks to summarize the meaning of words by looking at how they are used in the set of documents.  Language modeling in predictive coding builds a model for word occurrence in the responsive and in the non-responsive documents and classifies documents according to the model that best accounts for the words in a document being considered.
  • Latent Semantic Analysis:  A mathematical approach that seeks to summarize the meaning of words by looking at the documents that share those words.  LSA builds up a mathematical model of how words are related to documents and lets users take advantage of these computed relations to categorize documents.
  • Linguistic Analysis:  Linguists examine responsive and non-responsive documents to derive classification rules that maximize the correct classification of documents.
  • Naïve Bayesian Classifier:  A system that examines the probability that each word in a new document came from the word distribution derived from trained responsive document or from trained non-responsive documents.  The system is naïve in the sense that it assumes that all words are independent of one another.
  • Nearest Neighbor Classifier:  A classification system that categorizes documents by finding an already classified example that is very similar (near) to the document being considered. It gives the new document the same category as the most similar trained example.
  • Probabilistic Latent Semantic Analysis:  A second mathematical approach that seeks to summarize the meaning of words by looking at the documents that share those words.  PLSA builds up a mathematical model of how words are related to documents and lets users take advantage of these computed relations to categorize documents.
  • Relevance Feedback:  A computational model that adjusts the criteria for implicitly identifying responsive documents following feedback by a knowledgeable user as to which documents are relevant and which are not.
  • Support Vector Machine:  A mathematical approach that seeks to find a line that separates responsive from non-responsive documents so that, ideally, all of the responsive documents are on one side of the line and all of the non-responsive ones are on the other side.

Click here to provide specific additions, corrections and/or updates.

A One-Question Provider Implementation Survey.

Provided below is a simple one-question survey designed to help electronic discovery professionals identify the specific machine learning approaches used by eDiscovery providers in delivering the technology-assisted review feature of predictive coding.  This one-question survey is a detailed follow-up to the provider-centric 120-Second Survey on predictive coding initiated earlier this year.

Representatives of leading eDiscovery providers(4) are encouraged to complete the short one-question survey on behalf of their organizations.  Results of survey (excluding responder contact information) will be aggregated and published on the ComplexDiscovery website for usage by the eDiscovery community.  (Click here for an example of responders and results from the previously initiated general survey on predictive coding.) 

Provider Predictive Coding Background Information

Company / Firm (required)

Responder First and Last Name (required)

Responder Title / Role with Company (required)

Responder Email (required)

Name of Predictive Coding Platform (required)

Provider Predictive Coding Technology Implementation

Which predictive coding technologies are utilized by in your eDiscovery platform?

Select All Technologies That Apply

Active Learning

Language Modeling

Latent Semantic Analysis

Linguistic Analysis

Naive Bayesian Classifier

Nearest Neighbor Classifier

Probabilistic Latent Semantic Analysis

Relevance Feedback

Support Vector Machine

Other (Share In Comment Section)

Additional Clarifications and Comments

Please share any clarifications or comments that may help in providing an understanding of provided answers.


(1) The Grossman-Cormack Glossary of Technology-Assisted Review (2013 Fed. Cts.L. Rev. 7) by Maura Grossman and Gordan Cormack. EDRM.

(2) Predictive Discovery?  Initial Results of 120-Second Provider Predictive Coding Survey (February 2013), ComplexDiscovery.

(3) Predict Coding and Providers:  A 120-Second Survey (February 2013), ComplexDiscovery.

(4) Got Technology-Assisted Review? A Short List of Providers and Terms (January 2013), ComplexDiscovery.

Current Responders and Results available at:


Comments are closed.