Version 2.5
A bug was causing a lot of data going over the wire each time FSCrawler was running. To fix this issue, we changed the default mapping and we set
store: trueon fieldfile.filename. If this field is not stored andremove_deletedistrue(default), FSCrawler will fail while crawling your documents. You need to create the new mapping accordingly and reindex your existing data either by deleting the old index and running again FSCrawler or by using the reindex API as follows:# Backup old index data POST _reindex { "source": { "index": "job_name" }, "dest": { "index": "job_name_backup" } } # Remove job_name index DELETE job_name
Restart FSCrawler with the following command. It will just create the right mapping again.
$ bin/fscrawler job_name --loop 0
Then restore old data:
POST _reindex { "source": { "index": "job_name_backup" }, "dest": { "index": "job_name" } } # Remove backup index DELETE job_name_backup
The default mapping changed for FSCrawler for
meta.raw.*fields. Might be better to reindex your data.The
excludesparameter is also used for directory names. But this new implementation also brings a breaking change if you were usingexcludespreviously. In the previous implementation, the regular expression was only applied to the filename. It’s now applied to the full virtual path name.For example if you have a
/tmpdir as follows:/tmp └── folder ├── foo.txt └── bar.txtPreviously excluding
foo.txtwas excluding the virtual file/folder/foo.txt. If you still want to exclude any file namedfoo.txtwhatever its directory you now need to specify*/foo.txt:{ "name" : "test", "fs": { "excludes": [ "*/foo.txt" ] } }
For more information, read Includes and excludes.
For new indices, FSCrawler now uses
_docas the default type name for clusters running elasticsearch 6.x or superior.