Use external tools to extract text from images, PDFs, and other document types. SpamAssassin rules will then be applied to the extracted text.
Login to Plesk and go to Warden Anti-spam and Virus Protection
-> Settings
-> Plugin Settings
check the ExtractText plugin then save the page to enable the plugin. After enabling the plugin don't forget to install the external tools for the document types that you want to scan using the instructions below.
If you would like to extract text from images:
// RHEL / Centos / CloudLinux / AlmaLinux
yum install tesseract
// Debian / Ubuntu
apt-get install tesseract-ocr
Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText
Under Extracttext use
add the following line:
tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
Under Extracttext external
add the following line:
tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
If you would like to extract text from pdfs:
// RHEL / Centos / CloudLinux / AlmaLinux
yum install poppler-utils
// Debian / Ubuntu
apt-get install poppler-utils
Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText
Under Extracttext use
add the following line:
pdftotext .pdf application/pdf
Under Extracttext external
add the following line:
pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
If you would like to extract text from word documents:
// RHEL / Centos / CloudLinux / AlmaLinux
yum install antiword
// Debian / Ubuntu
apt-get install antiword
Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText
Under Extracttext use
add the following line:
antiword .doc application/(?:vnd\.?)?ms-?word.*
Under Extracttext external
add the following line:
antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
If you would like to extract text from RTF documents:
// Debian / Ubuntu (only)
apt-get install unrtf
Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText
Under Extracttext use
add the following line:
unrtf .doc .rtf application/rtf text/rtf
Under Extracttext external
add the following line:
unrtf /usr/bin/unrtf --nopict {}
If you would like to extract text from OpenDocument documents:
// RHEL / AlmaLinux 9 (only)
yum install odt2txt
// Debian / Ubuntu
apt-get install odt2txt
Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText
Under Extracttext use
add the following lines:
odt2txt .odt .ott application/.*?opendocument.*text
odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
Under Extracttext external
add the following line:
odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}