Welcome to FSCrawler’s documentation!
Warning
This documentation is for the version of FSCrawler currently under development. Were you looking for the documentation of the latest stable version?
Welcome to the FS Crawler for Elasticsearch.
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files, update existing ones and removes old ones.
Remote file system over SSH/FTP crawling.
REST interface to let you “upload” your binary documents to elasticsearch.
Note
FS Crawler 2.10-SNAPSHOT is using Tika 2.9.2 and is tested against:
Elasticsearch 6.8.23. (Deprecated)
- Directory layout
- CLI options
- JVM Settings
- Configuring the logger
- Status files
- Example job file specification
- The most simple crawler
- Local FS settings
- Root directory
- Update rate
- Includes and excludes
- Filter content
- Indexing JSon docs
- Indexing XML docs
- Add as Inner Object
- Index folders
- Dealing with multiple types and multiple dirs
- Dealing with multiple types within the same dir
- Using filename as elasticsearch
_id
- Adding file attributes
- Enabling raw metadata
- Disabling file size field
- Ignore deleted files
- Ignore content
- Continue on Error
- Language detection
- Storing binary source document
- Extracted characters
- Ignore Above
- File checksum
- Follow Symlinks
- Tika Config Path
- SSH settings
- FTP settings
- Elasticsearch settings
- REST service
License
Important
This software is licensed under the Apache 2 license, quoted below.
Copyright 2011-2024 David Pilato
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Incompatible 3rd party library licenses
To support JPEG 2000 (JPX/JP2) images, you need to manually add jai-imageio-jpeg2000:1.4.0 library to
the external
directory:
cd external
wget https://repo1.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/1.4.0/jai-imageio-jpeg2000-1.4.0.jar
See pdfbox documentation for more details about the license details.
Special thanks
Thanks to JetBrains for the IntelliJ IDEA License!