Process Document Images results in no extracted text or invalid text

Netwrix Data Classification
Copy Article URL Copied

Documents containing images, or images themselves are resulting in no extracted text – or invalid text.

A lack of text is typically related to the applied configuration. Whereas invalid text is typically related to the image(s) being processed).

No Text
The first step is to ensure that the product’s OCR capabilities are enabled, please:

  1. Navigate to the “Config” section of the Administration Interface
  2. Expand “Text Processing
  3. Select “Content Type Extraction Methods
  4. Edit each of the image types that you wish to OCR selecting the “Tesseract” option

To OCR images contained within documents (such as PDFs, or Office documents) please also enable the “Process Document Images” mode, found within: “Config” → “Core” → “Collector“.
Tesseract requires the Visual C++ Redistributable for Visual Studio 2015 to be installed, this is available from the following link.
Invalid Text

Sometimes OCR processing will result in garbled or invalid text. Typically this is because the document is either rotated, or at too low resolution for processing (the recommended DPI is 300 for OCR processing). If this is no the case please raise a support request, attaching the image to the request, for us to investigate further.

Go Up