Using docker
Pull the Docker image from Docker Hub:
docker pull dadoonet/fscrawler
Note
This image is very big (500+mb) as it contains Tesseract and
all the trained language data.
If you don’t want to use OCR at all, you can use a smaller image (around 230mb) by pulling instead
dadoonet/fscrawler:noocr
docker pull dadoonet/fscrawler:noocr
Let say your documents are located in ~/tmp dir and you want to store your fscrawler jobs in ~/.fscrawler.
You can run FSCrawler with:
docker run -it --rm \
-v ~/.fscrawler:/root/.fscrawler \
-v ~/tmp:/tmp/es:ro \
dadoonet/fscrawler
Note
The configuration file is expected to be stored on your machine in ~/.fscrawler/fscrawler/_settings.yaml.
Remember to change the URL of your elasticsearch instance as the container won’t be able to see it
running under the default 127.0.0.1. You will need to use the actual IP address of the host.
Or use the FSCRAWLER_ELASTICSEARCH_URLS environment variable to set the elasticsearch URL.
See docker-options for more information.
If you need to add a 3rd party library (jar) or your Tika custom jar, you can put it in a external directory and
mount it as well:
docker run -it --rm \
-v ~/.fscrawler:/root/.fscrawler \
-v ~/tmp:/tmp/es:ro \
-v "$PWD/external:/usr/share/fscrawler/external" \
dadoonet/fscrawler
If you want to use the REST service, don’t forget to also expose the port:
docker run -it --rm \
-v ~/.fscrawler:/root/.fscrawler \
-v ~/tmp:/tmp/es:ro \
-p 8080:8080 \
dadoonet/fscrawler
If you want to change the log level for FSCrawler, you can run:
docker run -it --rm \
-v ~/.fscrawler:/root/.fscrawler \
-v ~/tmp:/tmp/es:ro \
-v ~/logs:/root/logs \
-e FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug" \
dadoonet/fscrawler
And you can read the logs from the ~/logs directory:
tail -f ~/logs/documents.log
You can pass all the CLI options to the docker container as well:
docker run -it --rm \
-v ~/.fscrawler:/root/.fscrawler \
-v ~/tmp:/tmp/es:ro \
dadoonet/fscrawler job_name --restart --loop 1
See CLI options for more information.
Using docker compose
In this section, the following directory layout is assumed:
.
├── .env
├── docs
│ └── <your PDF, DOC, ... files>
└── docker-compose.yml
The .env file looks like this:
# Password for the 'elastic' user (at least 6 characters)
ES_LOCAL_PASSWORD=changeme
# Version of Elastic products
ES_LOCAL_VERSION=9.4.1
# Set the ES container name
ES_LOCAL_CONTAINER_NAME=es-fscrawler
# Set to 'basic' or 'trial' to automatically start the 30-day trial
ES_LOCAL_LICENSE=basic
#ES_LOCAL_LICENSE=trial
# Port to expose Elasticsearch HTTP API to the host
ES_LOCAL_PORT=9200
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-XX:UseSVE=0 -Xms128m -Xmx2g"
# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler
# FSCrawler Settings
FSCRAWLER_VERSION=2.10-SNAPSHOT
FSCRAWLER_PORT=8080
# Optionally, you can change the log level settings
FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug"
And, the docker-compose.yml file looks like this:
---
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:${ES_LOCAL_VERSION}
container_name: ${ES_LOCAL_CONTAINER_NAME}
volumes:
- dev-elasticsearch:/usr/share/elasticsearch/data
ports:
- 127.0.0.1:${ES_LOCAL_PORT}:9200
environment:
- discovery.type=single-node
- ELASTIC_PASSWORD=${ES_LOCAL_PASSWORD}
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=false
- xpack.license.self_generated.type=${ES_LOCAL_LICENSE}
- xpack.ml.use_auto_machine_memory_percent=true
- ES_JAVA_OPTS=${ES_LOCAL_JAVA_OPTS}
- cluster.routing.allocation.disk.watermark.low=${ES_LOCAL_DISK_SPACE_REQUIRED}
- cluster.routing.allocation.disk.watermark.high=${ES_LOCAL_DISK_SPACE_REQUIRED}
- cluster.routing.allocation.disk.watermark.flood_stage=${ES_LOCAL_DISK_SPACE_REQUIRED}
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test:
[
"CMD-SHELL",
"curl --output /dev/null --silent --head --fail -u elastic:${ES_LOCAL_PASSWORD} http://elasticsearch:9200",
]
interval: 10s
timeout: 10s
retries: 30
# FSCrawler
fscrawler:
image: dadoonet/fscrawler:${FSCRAWLER_VERSION}
container_name: fscrawler
restart: always
environment:
- FS_JAVA_OPTS=${FS_JAVA_OPTS}
- FSCRAWLER_ELASTICSEARCH_URLS=http://${ES_LOCAL_CONTAINER_NAME}:9200
- FSCRAWLER_ELASTICSEARCH_USERNAME=elastic
- FSCRAWLER_ELASTICSEARCH_PASSWORD=${ES_LOCAL_PASSWORD}
- FSCRAWLER_REST_URL=http://fscrawler:${FSCRAWLER_PORT}
volumes:
- ${PWD}/docs:/tmp/es:ro
depends_on:
elasticsearch:
condition: service_healthy
ports:
- ${FSCRAWLER_PORT}:8080
command: --rest
volumes:
dev-elasticsearch:
Copy your pdf/doc files into the docs directory and run the full stack, including FSCrawler with:
docker-compose up
When the job has finished indexing, you can check your documents in Elasticsearch with:
curl -u elastic:changeme "http://localhost:9200/fscrawler/_search"
Note
You will find this example in the contrib/docker-compose-example-elasticsearch project directory.
Running as a Service on Windows
Create a fscrawlerRunner.bat as:
set JAVA_HOME=c:\Program Files\Java\jdk15.0.1
set FS_JAVA_OPTS=-Xmx2g -Xms2g
/Elastic/fscrawler/bin/fscrawler.bat --config_dir /Elastic/fscrawler data >> /Elastic/logs/fscrawler.log 2>&1
Then use fscrawlerRunner.bat to create your windows service.
Local installation
If you prefer to run FSCrawler from a ZIP distribution on your machine instead of Docker:
You can download FSCrawler 2.10 from Sonatype.
The filename ends with .zip.
Warning
This is a SNAPSHOT version. You can also download a stable version from Maven Central.
Note
There’s an issue with the download links for SNAPSHOT versions.
Hint
Due to a bug with the underlying service we rely on to provide SNAPSHOT hosting, we’ve had to temporarily remove browse access for SNAPSHOT releases. You should still be able to publish and consume SNAPSHOT releases as usual, but you cannot browse them via the UI.
So you must now download the maven-metadata.xml
file. Check the <snapshotVersion> tag to find the latest SNAPSHOT version of the zip file.
<snapshotVersion>
<extension>zip</extension>
<value>2.10-20250801.161301-75</value>
<updated>20250801161301</updated>
</snapshotVersion>
Note the value tag which contains the version you need to download. And use that value in the following URL:
After extracting the ZIP, you get a directory with bin/ (run scripts), config/ (logging), lib/ (core and
dependencies), external/ (optional JARs), and logs/. See Directory layout for the full directory layout.
Optional libraries (external)
You may need to add JARs to the external directory for some formats. For example, to support JPEG2000 (JPX/JP2)
images in PDFs, add the jai-imageio-jpeg2000 library: download it from
Maven Central and put
jai-imageio-jpeg2000-1.4.0.jar in the external directory.