OCR integration
New in version 2.3.
To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika or you can explicitly set the path to tesseract binary. Then add an image (png, jpg, …) into your Fscrawler Root directory. After the next index update, the text will be indexed and placed in “_source.content”.
OCR settings
Here is a list of OCR settings (under fs.ocr
prefix)`:
Name |
Default value |
Documentation |
|
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
Disable/Enable OCR
New in version 2.7.
You can completely disable using OCR by setting fs.ocr.enabled
property in your
~/.fscrawler/test/_settings.yaml
file:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
enabled: false
By default, OCR is activated if tesseract can be found on your system.
OCR Language
If you are using the default Docker image (see Using docker) or if you have installed any of the
Tesseract Languages,
you can use them when parsing your documents by setting fs.ocr.language
property in your
~/.fscrawler/test/_settings.yaml
file:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
language: "eng"
Note
You can define multiple languages by using +
sign as a separator:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
language: "eng+fas+fra"
OCR Path
If your Tesseract application is not available in default system PATH,
you can define the path to use by setting fs.ocr.path
property in
your ~/.fscrawler/test/_settings.yaml
file:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
path: "/path/to/tesseract/bin/"
When you set it, it’s highly recommended to set the OCR Data Path.
OCR Data Path
Set the path to the ‘tessdata’ folder, which contains language files and
config files if Tesseract can not be automatically detected. You can
define the path to use by setting fs.ocr.data_path
property in your
~/.fscrawler/test/_settings.yaml
file:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
path: "/path/to/tesseract/bin/"
data_path: "/path/to/tesseract/share/tessdata/"
OCR Output Type
New in version 2.5.
Set the output type from ocr process. fs.ocr.output_type
property can be defined to
txt
or hocr
in your ~/.fscrawler/test/_settings.yaml
file:
name: "test"
fs:
url: "/path/to/data/dir"
ocr:
output_type: "hocr"
Note
When omitted, txt
value is used.
OCR PDF Strategy
By default, FSCrawler will also try to extract also images from your PDF
documents and run OCR on them. This can be a CPU intensive operation. If
you don’t mean to run OCR on PDF but only on images, you can set
fs.ocr.pdf_strategy
to "no_ocr"
or to "auto"
:
name: "test"
fs:
ocr:
pdf_strategy: "auto"
Supported strategies are:
auto
: No OCR is performed on PDF documents if there is more than 10 characters extracted. See PDFParser OCR Options.no_ocr
: No OCR is performed on PDF documents. OCR might be performed on images though if OCR is not disabled. See Disable/Enable OCR.ocr_only
: Only OCR is performed.ocr_and_text
: OCR and text extraction is performed.
Note
When omitted, ocr_and_text
value is used. If you have performance issues, it’s worth using the auto
option
instead as only documents with barely no text will go through the OCR process.
OCR Page Seg Mode
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0
= Orientation and script detection (OSD) only.1
= Automatic page segmentation with OSD.2
= Automatic page segmentation, but no OSD, or OCR. (not implemented)3
= Fully automatic page segmentation, but no OSD.4
= Assume a single column of text of variable sizes.5
= Assume a single uniform block of vertically aligned text.6
= Assume a single uniform block of text.7
= Treat the image as a single text line.8
= Treat the image as a single word.9
= Treat the image as a single word in a circle.10
= Treat the image as a single character.11
= Sparse text. Find as much text as possible in no particular order.12
= Sparse text with OSD.13
= Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
OCR Preserve Interword Spacing
Spaces between the words will be deleted.