Documents are being crawled successfully but no text has been extracted

Netwrix Data Classification
Other
https://kb.netwrix.com/3535
Copy Article URL Copied

There are several potential reasons for text extraction to fail:

  • Invalid File (corruption, password protection, etc)
  • Configuration (invalid file mappings)
  • Installation (missing packages)

Each content type has its own extraction configuration with multiple potential extraction methods at is disposal. Given this Its important that the following steps are performed per content type (extension) that is experiencing issues.

Verifying the Current Configuration

First we must identify the current extraction method that is being used:

  1. Navigate to the “Config” section of the Administration Interface
  2. Expand “Text Processing”
  3. Select “Content Type Extraction Methods”
  4. Find the appropriate Content Type from the list of available methods, then proceed to the appropriate debugging section

Content types that are not listed will be processed as “Unknown”, it is also possible to map extensions to specific content types – this allows custom extensions to be mapped to known Content Types, please see Processing Custom Files KB article for more details.

Tika/Built-in/Aspose/iText

In the majority of cases these extraction methods will work correctly with little intervention, with three exceptions:

  1. Server Language – Tika is currently unsupported on servers running a non-English locale. In these cases the extraction method should be changed from Tika to an alternate method, or the server’s language altered to “English”. Please note – Tika has been deprecated from 5.4.8 onwards.
  2. Legacy File Types – Some older file types (such as Office OLE files) will fail to extract text due to their internal format. In these cases we would recommend installing the appropriate iFilter pack and enabling the “iFilter Fallback” option, doing so will ensure the best chance of extracting text from the affected documents.
  3. Protected Files – Currently extracting text from encrypted or password protected files is not supported.

iFilter

iFilters are a widely used generic way of extracting text on a Windows machine. We recommend troubleshooting iFilter processing with the following steps:

  1. Verify that the iFilter pack is installed (Office iFilter pack, Adobe PDF iFilter or a custom internal iFilter)
  2. Review the “Text Extraction Failures” summary from the main Dashboard – failures are grouped by error code. Typical failures related to permissions or password protected files which cannot be extracted – by selecting “Details” it is possible to see a list of affected files to review.
  3. Verify one (or more) of your test documents: are you able to open them successfully? File corruption/encryption will result in no text being extracted.

Additional logging can be enabled when testing specific documents to identify particular failure reasons by setting the error mode to “Errors, Warnings & Info” and enabling the “Collector” trace mode.

Tesseract

Tesseract is the default OCR extraction method – please refer to the following KB article: https://kb.netwrix.com/3517

Go Up