OCR integration

New in version 2.3.

To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika or you can explicitly set the path to tesseract binary. Then add an image (png, jpg, …) into your Fscrawler Root directory. After the next index update, the text will be indexed and placed in “_source.content”.

OCR settings

Here is a list of OCR settings (under fs.ocr prefix)`:

Name Default value Documentation
fs.ocr.enabled true Disable/Enable OCR
fs.ocr.language "eng" OCR Language
fs.ocr.path null OCR Path
fs.ocr.data_path null OCR Data Path
fs.ocr.output_type txt OCR Output Type
fs.ocr.pdf_strategy ocr_and_text OCR PDF Strategy

Disable/Enable OCR

New in version 2.7.

You can completely disable using OCR by setting fs.ocr.enabled property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false

By default, OCR is activated if tesseract can be found on your system.

OCR Language

If you are using the default Docker image (see Using docker) or if you have installed any of the Tesseract Languages, you can use them when parsing your documents by setting fs.ocr.language property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    language: "eng"

Note

You can define multiple languages by using + sign as a separator:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    language: "eng+fas+fra"

OCR Path

If your Tesseract application is not available in default system PATH, you can define the path to use by setting fs.ocr.path property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    path: "/path/to/tesseract/bin/"

When you set it, it’s highly recommended to set the OCR Data Path.

OCR Data Path

Set the path to the ‘tessdata’ folder, which contains language files and config files if Tesseract can not be automatically detected. You can define the path to use by setting fs.ocr.data_path property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    path: "/path/to/tesseract/bin/"
    data_path: "/path/to/tesseract/share/tessdata/"

OCR Output Type

New in version 2.5.

Set the output type from ocr process. fs.ocr.output_type property can be defined to txt or hocr in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    output_type: "hocr"

Note

When omitted, txt value is used.

OCR PDF Strategy

By default, FSCrawler will also try to extract also images from your PDF documents and run OCR on them. This can be a CPU intensive operation. If you don’t mean to run OCR on PDF but only on images, you can set fs.ocr.pdf_strategy to "no_ocr" or to "auto":

name: "test"
fs:
  ocr:
    pdf_strategy: "auto"

Supported strategies are:

  • auto: No OCR is performed on PDF documents if there is more than 10 characters extracted. See PDFParser OCR Options.
  • no_ocr: No OCR is performed on PDF documents. OCR might be performed on images though if OCR is not disabled. See Disable/Enable OCR.
  • ocr_only: Only OCR is performed.
  • ocr_and_text: OCR and text extraction is performed.

Note

When omitted, ocr_and_text value is used. If you have performance issues, it’s worth using the auto option

instead as only documents with barely no text will go through the OCR process.