ExtractText

Use external tools to extract text from images, PDFs, and other document types. SpamAssassin rules will then be applied to the extracted text.

Enable the ExtractText plugin

Login to Plesk and go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings check the ExtractText plugin then save the page to enable the plugin. After enabling the plugin don't forget to install the external tools for the document types that you want to scan using the instructions below.

Images

If you would like to extract text from images:

// RHEL / Centos / CloudLinux / AlmaLinux
yum install tesseract

// Debian / Ubuntu
apt-get install tesseract-ocr

Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText

Under Extracttext use add the following line:

tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)

Under Extracttext external add the following line:

tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -

PDF Documents

If you would like to extract text from pdfs:

// RHEL / Centos / CloudLinux / AlmaLinux
yum install poppler-utils

// Debian / Ubuntu
apt-get install poppler-utils

Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText

Under Extracttext use add the following line:

pdftotext .pdf application/pdf

Under Extracttext external add the following line:

pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -

Word Documents

If you would like to extract text from word documents:

// RHEL / Centos / CloudLinux / AlmaLinux
yum install antiword

// Debian / Ubuntu
apt-get install antiword

Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText

Under Extracttext use add the following line:

antiword .doc application/(?:vnd\.?)?ms-?word.*

Under Extracttext external add the following line:

antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {}

RTF Documents

If you would like to extract text from RTF documents:

// Debian / Ubuntu (only)
apt-get install unrtf

Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText

Under Extracttext use add the following line:

unrtf .doc .rtf application/rtf text/rtf

Under Extracttext external add the following line:

unrtf /usr/bin/unrtf --nopict {}

OpenDocument Documents

If you would like to extract text from OpenDocument documents:

// RHEL / AlmaLinux 9 (only)
yum install odt2txt

// Debian / Ubuntu
apt-get install odt2txt

Go to Warden Anti-spam and Virus Protection -> Settings -> Plugin Settings -> ExtractText

Under Extracttext use add the following lines:

odt2txt .odt .ott application/.*?opendocument.*text
odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter

Under Extracttext external add the following line:

odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}