OCR integration

New in version 2.3.

To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika or you can explicitly set the path to tesseract binary. Then add an image (png, jpg, …) into your Fscrawler Root directory. After the next index update, the text will be indexed and placed in “_source.content”.

OCR settings

Here is a list of OCR settings (under fs.ocr prefix)`:

Name

Default value

Documentation

fs.ocr.enabled

true

Disable/Enable OCR

fs.ocr.language

"eng"

OCR Language

fs.ocr.path

null

OCR Path

fs.ocr.data_path

null

OCR Data Path

fs.ocr.output_type

txt

OCR Output Type

fs.ocr.pdf_strategy

ocr_and_text

OCR PDF Strategy

fs.ocr.page_seg_mode

1

OCR Page Seg Mode

fs.ocr.page_seg_mode | 1 | OCR Page Seg Mode |

Disable/Enable OCR

New in version 2.7.

You can completely disable using OCR by setting fs.ocr.enabled property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false

By default, OCR is activated if tesseract can be found on your system.

OCR Language

If you are using the default Docker image (see Using docker) or if you have installed any of the Tesseract Languages, you can use them when parsing your documents by setting fs.ocr.language property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    language: "eng"

Note

You can define multiple languages by using + sign as a separator:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    language: "eng+fas+fra"

OCR Path

If your Tesseract application is not available in default system PATH, you can define the path to use by setting fs.ocr.path property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    path: "/path/to/tesseract/bin/"

When you set it, it’s highly recommended to set the OCR Data Path.

OCR Data Path

Set the path to the ‘tessdata’ folder, which contains language files and config files if Tesseract can not be automatically detected. You can define the path to use by setting fs.ocr.data_path property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    path: "/path/to/tesseract/bin/"
    data_path: "/path/to/tesseract/share/tessdata/"

OCR Output Type

New in version 2.5.

Set the output type from ocr process. fs.ocr.output_type property can be defined to txt or hocr in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    output_type: "hocr"

Note

When omitted, txt value is used.

OCR PDF Strategy

By default, FSCrawler will also try to extract also images from your PDF documents and run OCR on them. This can be a CPU intensive operation. If you don’t mean to run OCR on PDF but only on images, you can set fs.ocr.pdf_strategy to "no_ocr" or to "auto":

name: "test"
fs:
  ocr:
    pdf_strategy: "auto"

Supported strategies are:

  • auto: No OCR is performed on PDF documents if there is more than 10 characters extracted. See PDFParser OCR Options.

  • no_ocr: No OCR is performed on PDF documents. OCR might be performed on images though if OCR is not disabled. See Disable/Enable OCR.

  • ocr_only: Only OCR is performed.

  • ocr_and_text: OCR and text extraction is performed.

Note

When omitted, ocr_and_text value is used. If you have performance issues, it’s worth using the auto option instead as only documents with barely no text will go through the OCR process.

OCR Page Seg Mode

Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

  • 0 = Orientation and script detection (OSD) only.

  • 1 = Automatic page segmentation with OSD.

  • 2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)

  • 3 = Fully automatic page segmentation, but no OSD.

  • 4 = Assume a single column of text of variable sizes.

  • 5 = Assume a single uniform block of vertically aligned text.

  • 6 = Assume a single uniform block of text.

  • 7 = Treat the image as a single text line.

  • 8 = Treat the image as a single word.

  • 9 = Treat the image as a single word in a circle.

  • 10 = Treat the image as a single character.

  • 11 = Sparse text. Find as much text as possible in no particular order.

  • 12 = Sparse text with OSD.

  • 13 = Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

OCR Preserve Interword Spacing

Spaces between the words will be deleted.