Information Governance and eDiscovery: How Does Your Vendor Deal With Non-Textual Files?

Here Is What We See.  Here Is What We Can’t See.

Tawny Owl (IGP0073)

When you are evaluating information governance and electronic discovery solutions do you ask your vendor/service provider the basic questions of:

1)  Does your system/process identify both textual and non-textual ESI files?

2)  How does your system/process index and classify non-textual ESI files? (Example: Image only PDFs.)

3)  How does your system/process identify text within non-textual ESI files?  (Example:  Graphics with words published to an image only PDF.)

If your vendor/service provider cannot adequately answer these three simple questions, then you may want to consider the potential risk and exposure associated with not fully considering non-textual ESI in your information governance and eDiscovery efforts.

Good vendors share what they know they see.  Great vendors share what they may not see so you can make informed decisions as to risk and exposure.

Worthy questions. Worthy considerations. Worthy of answers (from your vendor/service provider).

ComplexD QR Code

  • ComplexD

    How many predictive coding vendors in legal technology only deal with text-based files?

  • Craig Ball

    Sticking my neck out, I’d say that *no* predictive coding vendor deals with non-textual files. An image of text without a searchable text layer is a text file that hasn’t been processed to extract its textual content. That’s not just a quibble, because files that are truly non-textual (emoticons, photos, records of gestures and a host of other meaningful data forms that are truly devoid of text) are ignored by TAR. We would have to pair such files with, e.g., a descriptive narrative to make them a component of a lexical analysis like predictive coding. Predictive coding is not semantic, but it is (currently) exclusively lexical in its reliance upon n-grams.

    I suggest that we stop referring to incompletely processed text files as non-textual. They are merely not YET textual compared with the onslaught of truly non-textual data we choose to ignore at our peril.