Automation Action: Convert Document To Text
Convert PDF, Word, Open Document, Excel, Richtext or HTML documents or attachments to plain text or extract PDF form data.
This action enables you to parse and extract text data from Word, PDF, Open Document, Excel, RichText, Markdown and HTML attachments or local document files. Documents are converted to plain text which is then assigned to a variable. This action can also extract PDF form data.
Select a Document To Convert - this can be any local file or a %variable% replacement. You can specify multiple documents if required, separated by commas (any file paths that contain commas must be enclosed in quotes).
Enable the Include Incoming Attachments option to convert attached documents matching the Matching Mask. Enter *.* to convert all supported attachments.
Select the variable to receive the plain text from the Assign To list.
The document(s) will be converted to plain text. Excel files will be converted to CSV text. Markdown documents will be converted to HTML first and then the HTML converted to plain text.
If multiple files are converted within the same action then the extracted text from each file will be appended to the returned text.
PDF Extract Form Data
Enable the PDF Extract Form Data option to extract only form data from PDF files. If enabled then form data only will be extracted in the following format:
{form field Name}: {value}
{form field name}: {value}
...
Enable the Return PDF Form Data As Json option to return the form data as Json text.
PDF Text Extract Mode
When converting PDF documents to text you have a number of options:
- Keep Positioning Method 1 : Some positioning will be retained.
- Keep Positioning Method 2: Same as above but using a different extraction method. This may provide a more accurate plaintext representation of the PDF document in some cases.
- Keep Reading Order Method 1 : The text will be extracted in reading order - with no positioning indentation.
- Keep Reading Order Method 2 : Same as above but using a different extraction method.
- Extract To CSV : Extracts each text element to CSV text with columns: Page, Bounds (left,top,right,bottom), Text, Font, Size, Weight, RGB. The CSV will contain a row for each text element.
Enable the Remove Repeated Blank Lines option if you need repeated blank lines removed from the text. This option is useful in cases where there is differing amounts of blank space in the PDF document which your extraction rules do not need.
You can then use the text in other actions - or use Extract Field actions to parse & extract data from the text.
To test the text extraction select or enter a document and click the Test button. The results will be displayed. Click the Copy button to copy the extracted text to the clipboard. You can then paste this into the Extract Field Helper Message if you need to extract data from the text.