Download FSCrawler¶
Depending on your Elasticsearch cluster version, you can download FSCrawler 2.9 using the following links:
- fscrawler-es7-2.9 for Elasticsearch V7.
- fscrawler-es6-2.9 for Elasticsearch V6.
Tip
This is a stable version. You can choose another version than 2.9 from Maven Central:
- fscrawler-es7-* for Elasticsearch V7.
- fscrawler-es6-* for Elasticsearch V6.
You can also download a SNAPSHOT version from Sonatype:
- fscrawler-es7-* for Elasticsearch V7.
- fscrawler-es6-* for Elasticsearch V6.
The distribution contains:
$ tree
.
├── LICENSE
├── NOTICE
├── README.md
├── bin
│ ├── fscrawler
│ └── fscrawler.bat
├── config
│ └── log4j2.xml
└── lib
├── ... All needed jars
Using docker¶
Pull the Docker image:
docker pull dadoonet/fscrawler
Note
This image is very big (1.2+gb) as it contains Tesseract and
all the trained language data.
If you don’t want to use OCR at all, you can use a smaller image (around 530mb) by pulling instead
dadoonet/fscrawler:noocr
docker pull dadoonet/fscrawler:noocr
Let say your documents are located in ~/tmp
dir and you want to store your fscrawler jobs in ~/.fscrawler
.
You can run FSCrawler with:
docker run -it --rm -v ~/.fscrawler:/root/.fscrawler -v ~/tmp:/tmp/es:ro dadoonet/fscrawler fscrawler job_name
On the first run, if the job does not exist yet in ~/.fscrawler
, FSCrawler will ask you if you want to create it:
10:16:53,880 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [67.3mb/876.5mb=7.69%], RAM [2.1gb/3.8gb=55.43%], Swap [1023.9mb/1023.9mb=100.0%].
10:16:53,899 WARN [f.p.e.c.f.c.FsCrawlerCli] job [job_name] does not exist
10:16:53,900 INFO [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
10:16:56,745 INFO [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [/root/.fscrawler/job_name/_settings.yaml]. Please review and edit before relaunch
Note
The configuration file is actually stored on your machine in ~/.fscrawler/job_name/_settings.yaml
.
Remember to change the URL of your elasticsearch instance as the container won’t be able to see it
running under the default 127.0.0.1
. You will need to use the actual IP address of the host.
Using docker compose¶
In this section, the following directory layout is assumed:
.
├── config
│ └── job_name
│ └── _settings.yaml
├── data
│ └── <your files>
├── logs
│ └── <fscrawler logs>
└── docker-compose.yml
For example, to connect to a docker container named elasticsearch
, modify your _settings.yaml
.
name: "job_name"
elasticsearch:
nodes:
- url: "http://elasticsearch:9200"
And, prepare the following docker-compose.yml
.
version: '3'
services:
# Elasticsearch Cluster
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
container_name: elasticsearch
environment:
- bootstrap.memory_lock=true
- discovery.type=single-node
restart: always
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data:/usr/share/elasticsearch/data
ports:
- 9200:9200
networks:
- fscrawler_net
# FSCrawler
fscrawler:
image: dadoonet/fscrawler:$FSCRAWLER_VERSION
container_name: fscrawler
restart: always
volumes:
- ${PWD}/config:/root/.fscrawler
- ${PWD}/logs:/usr/share/fscrawler/logs
- ../../test-documents/src/main/resources/documents/:/tmp/es:ro
depends_on:
- elasticsearch
command: fscrawler --rest idx
networks:
- fscrawler_net
volumes:
data:
driver: local
networks:
fscrawler_net:
driver: bridge
Then, you can run Elasticsearch.
docker-compose up -d elasticsearch
docker-compose logs -f elasticsearch
Wait for elasticsearch to be started:
After starting Elasticsearch, you can run FSCrawler.
docker-compose up fscrawler
Running as a Service on Windows¶
Create a fscrawlerRunner.bat
as:
set JAVA_HOME=c:\Program Files\Java\jdk15.0.1
set FS_JAVA_OPTS=-Xmx2g -Xms2g
/Elastic/fscrawler/bin/fscrawler.bat --config_dir /Elastic/fscrawler data >> /Elastic/logs/fscrawler.log 2>&1
Then use fscrawlerRunner.bat
to create your windows service.