ai_workflows.document_utilities module

Utilities for reading and processing documents for AI workflows.

class ai_workflows.document_utilities.DocumentInterface(llm_interface: LLMInterface | None = None, pdf_image_dpi: int = 150, pdf_image_max_bytes: int = 4194304, max_parallel_requests: int = 5)

Bases: object

Utility class for reading and processing documents for AI workflows.

__init__(llm_interface: LLMInterface | None = None, pdf_image_dpi: int = 150, pdf_image_max_bytes: int = 4194304, max_parallel_requests: int = 5)

Initialize the document interface for reading and processing documents.

Parameters:

llm_interface (LLMInterface) – LLM interface for interacting with LLMs in AI workflows (defaults to None, which won’t use an LLM to convert supported document types to markdown).
pdf_image_dpi (int) – DPI to use for rendering PDF images. Default is 150, which is generally plenty for LLM applications.
pdf_image_max_bytes (int) – Maximum size in bytes for an image to be processed. Default is 4MB.
max_parallel_requests (int) – Maximum number of parallel requests to make when accessing LLM. Default is 5.

convert_to_json(filepath: str, json_context: str, json_job: str, json_output_spec: str, markdown_first: bool | None = None, json_output_schema: str | None = '') → list[dict]

Convert a document to JSON.

Parameters:

filepath (str) – Path to the file.
json_context (str) – Context for the LLM prompt used in JSON conversion (e.g., “The file contains a survey instrument administered by trained enumerators to households in Zimbabwe.”). (Required for JSON output.)
json_job (str) – Description of the job to do for the LLM prompt used in JSON conversion (e.g., “Your job is to extract each question or form field included in the text or page given.”). (Required for JSON output.)
json_output_spec (str) – JSON output specification for the LLM prompt (e.g., “Respond in correctly-formatted JSON with a single key named questions that is a list of dicts, one for each question or form field, each with the keys listed below…”). (Required for JSON output.)
markdown_first (Optional[bool]) – Whether to convert to Markdown first and then to JSON using an LLM. Set this to true if page-by-page conversion is not working well for elements that span pages; the Markdown-first approach will convert page-by-page to Markdown and then convert to JSON as the next step. The default is None, which will use the Markdown path for small PDF files and the page-by-page path for larger ones.
json_output_schema (str) – JSON schema for output validation. Defaults to “”, which auto-generates a validation schema based on the json_output_spec. If explicitly set to None, will skip JSON validation.

Returns:

List of dicts, one for each batch (e.g., for each page). Use the merge_dicts() function to combine these into a single dict.

Return type:

list[dict]

convert_to_markdown(filepath: str, use_text: bool = False) → str

Convert a document to markdown.

Parameters:

filepath (str) – Path to the file.
use_text (bool) – Whether to use extracted text to help the LLM with extracting text from the images. Default is False.

Returns:

Markdown output.

Return type:

str

static convert_to_pdf(filepath: str, output_dir: str) → str

Convert a document to PDF using LibreOffice.

Parameters:

filepath (str) – Path to the document file.
output_dir (str) – Path to the output directory.

Returns:

Path to the converted PDF file. Throws exception on failure.

Return type:

str

markdown_to_json(markdown: str, json_context: str, json_job: str, json_output_spec: str, json_output_schema: str | None = '', max_chunk_size: int = 0, min_chunk_size: int = 2000) → list[dict]

Convert Markdown text to JSON using an LLM. If needed, will automatically split text into chunks and process each separately. Returns a list of dicts with JSON results, one for each chunk.

Parameters:

markdown (str) – Markdown text to convert to JSON.
json_context (str) – Context for the LLM prompt (e.g., “The file contains a survey instrument administered by trained enumerators to households in Zimbabwe.”).
json_job (str) – Job to do for the LLM prompt (e.g., “Your job is to extract each question or form field included in the text.”).
json_output_spec (str) – Output format for the LLM prompt (e.g., “Respond in correctly-formatted JSON with a single key named questions that is a list of dicts, one for each question or form field, each with the keys listed below…”).
json_output_schema (str) – JSON schema for output validation. Defaults to “”, which auto-generates a validation schema based on the json_output_spec. If explicitly set to None, will skip JSON validation.
max_chunk_size (int) – Maximum number of tokens allowed per chunk of Markdown processed. Default is 0, which will use a default based on the LLM’s maximum output tokens.
min_chunk_size (int) – Minimum number of desired tokens in a chunk of Markdown processed. Default is 2000. Set to -1 to disable.

Returns:

List of dicts with JSON results, one for each chunk. Use the merge_dicts() function to combine these into a single dict.

Return type:

list[dict]

static markdown_to_text(markdown: str) → str

Convert Markdown text to plain text by removing formatting.

Parameters:: markdown (str) – Input Markdown text to be converted.
Returns:: Plain text with Markdown formatting removed.
Return type:: str

static merge_dicts(dict_list: list[dict], strategy: str = 'retain') → dict

Merge a list of dictionaries into a single dictionary.

Parameters:

dict_list (list[dict]) – List of dictionaries to merge.
strategy (str) – Strategy for handling non-list duplicate items. ‘retain’ (default): retain the original value. ‘overwrite’: overwrite with the last value. ‘collect’: collect values into a list.

Returns:

A single merged dictionary.

Return type:

dict

split_markdown(text: str, max_tokens: int, min_tokens: int = 2000) → List[str]

Split Markdown text into chunks based on token count and document structure.

This function provides a convenient interface to the MarkdownSplitter class, creating chunks that respect both markdown structure and token limits.

Parameters:

text (str) – The Markdown text to split
max_tokens (int) – Maximum number of tokens allowed per chunk
min_tokens (int) – Minimum number of desired tokens in a chunk. Default is 2000. Set to -1 to disable.

Returns:

List of text chunks, each within the token limit

Return type:

List[str]

class ai_workflows.document_utilities.ExcelDocumentConverter

Bases: object

Utility class to convert Excel files Markdown tables (if they don’t have any images or charts).

class ExcelContent(filepath: str)

Bases: object

Class for representing Excel file content.

class TableRange(start_row: int, end_row: int, start_col: int, end_col: int, has_header: bool = True, is_pivot_table: bool = False)

Bases: object

Represents a contiguous table range in a worksheet.

__init__(start_row: int, end_row: int, start_col: int, end_col: int, has_header: bool = True, is_pivot_table: bool = False) → None

__init__(filepath: str)

Initialize the Excel content object.

Parameters:: filepath (str) – Path to the Excel file.

static find_tables(sheet: Worksheet) → List[TableRange]

Identify contiguous table ranges in a worksheet.

Parameters:: sheet (Worksheet) – Worksheet object to analyze.
Returns:: List of TableRange objects representing the identified table ranges.
Return type:: List[TableRange]

has_unsupported_content() → Tuple[bool, str]

Check if workbook contains content that we don’t support. Only checks for images and charts, allowing all other formatting to be quietly lost in conversion.

Returns:: Tuple indicating if PDF conversion is needed and the reason why.
Return type:: Tuple[bool, str]

static convert_excel_to_markdown(excel_path: str, include_hidden_sheets: bool = False, lose_unsupported_content: bool = False) → Tuple[bool, str]

Convert Excel file to Markdown if possible, otherwise indicate PDF conversion needed.

Parameters:

excel_path (str) – Path to the Excel file.
include_hidden_sheets (bool) – Whether to include hidden sheets in the conversion. Default is False.
lose_unsupported_content (bool) – Whether to quietly lose unsupported content in the conversion (if False, will return failure when file contains images and/or charts). Default is False.

Returns:

Tuple indicating if conversion was successful and the Markdown text.

class ai_workflows.document_utilities.MarkdownSplitter(count_tokens: Callable[[str], int], max_tokens: int, min_tokens: int = 2000)

Bases: object

Split Markdown text into chunks while preserving document structure.

__init__(count_tokens: Callable[[str], int], max_tokens: int, min_tokens: int = 2000)

Initialize the Markdown splitter with token counting function and maximum tokens.

This class splits Markdown text into chunks while preserving the document structure and ensuring each chunk stays within a specified token limit.

Parameters:

count_tokens (Callable[[str], int]) – Function that takes a string and returns its token count
max_tokens (int) – Maximum number of tokens allowed per chunk
min_tokens (int) – Minimum number of tokens desired per chunk. Defaults to 2000. Set to -1 to disable.

split_text(text: str) → List[str]

Split Markdown text recursively according to heading hierarchy and structure.

This function splits text using a hierarchical approach, starting with highest level headers and progressively moving to finer-grained splits until all chunks are within the token limit.

Parameters:: text (str) – The Markdown text to split
Returns:: List of text chunks, each within the token limit
Return type:: List[str]

class ai_workflows.document_utilities.PDFDocumentConverter(llm_interface: LLMInterface | None = None, pdf_image_dpi: int = 150, pdf_image_max_bytes: int = 4194304, max_parallel_requests: int = 5)

Bases: object

Utility class for converting PDF files to Markdown.

__init__(llm_interface: LLMInterface | None = None, pdf_image_dpi: int = 150, pdf_image_max_bytes: int = 4194304, max_parallel_requests: int = 5)

Initialize for converting PDF files.

Parameters:

llm_interface (LLMInterface) – LLM interface for interacting with LLMs in AI workflows (defaults to None, which won’t use an LLM to convert PDF files to Markdown).
pdf_image_dpi (int) – DPI to use for rendering PDF images. Default is 150, which is generally plenty for LLM applications.
pdf_image_max_bytes (int) – Maximum size in bytes for an image to be processed. Default is 4MB.
max_parallel_requests (int) – Maximum number of parallel requests to make when accessing LLM. Default is 5.

static pdf_to_images(pdf_path: str, dpi: int = 150) → list[Image]

Convert a PDF to a list of PIL Images.

This function opens a PDF file, renders each page as an image at the specified DPI, and returns a list of these images.

Parameters:

pdf_path (str) – Path to the PDF file.
dpi (int) – DPI to use for rendering the PDF. Default is 150, which is generally plenty for LLM applications.

Returns:

List of PIL Images representing the pages within the PDF.

Return type:

list[Image.Image]

static pdf_to_images_and_text(pdf_path: str, dpi: int = 150) → list[tuple[Image, str]]

Convert a PDF to a list of PIL Images, each with extracted text.

Parameters:

pdf_path (str) – Path to the PDF file.
dpi (int) – DPI to use for rendering the PDF. Default is 150, which is generally plenty for LLM applications.

Returns:

List of tuples, one for each page, each with an image and the text extracted from that page.

Return type:

list[tuple[Image.Image, str]]

pdf_to_json(pdf_path: str, json_context: str, json_job: str, json_output_spec: str, json_output_schema: str | None = '', use_text: bool = False) → list[dict]

Process a PDF file page-by-page to extract elements and output JSON text.

This function reads a PDF file, converts pages to images, processes each image with an LLM, and assembles the returned elements into a single JSON output.

Parameters:

pdf_path (str) – Path to the PDF file.
json_context (str) – Context for the LLM prompt (e.g., “The file contains a survey instrument.”).
json_job (str) – Job to do for the LLM prompt (e.g., “Your job is to extract each question or form field included on the page.”). In this case, the job will be to process each page, one at a time.
json_output_spec (str) – Output format for the LLM prompt (e.g., “Respond in correctly-formatted JSON with a single key named questions that is a list of dicts, one for each question or form field, each with the keys listed below…”).
json_output_schema (str) – JSON schema for output validation. Defaults to “”, which auto-generates a validation schema based on the json_output_spec. If explicitly set to None, will skip JSON validation.
use_text (bool) – Whether to use extracted text to help the LLM with extracting text from the images. Default is False.

Returns:

List of parsed results from all pages, one per page, in order.

Return type:

list[dict]

pdf_to_markdown(pdf_path: str, use_text: bool = False) → str

Process a PDF file to extract elements and output Markdown text.

This function reads a PDF file, converts it to images, processes each image with an LLM, and assembles the returned elements into a single markdown output. If no LLM is available, the function falls back to PyMuPDFLLM for Markdown conversion.

Parameters:

pdf_path (str) – Path to the PDF file.
use_text (bool) – Whether to use extracted text to help the LLM with extracting text from the images. Default is False.

Returns:

Markdown text.

Return type:

str

class ai_workflows.document_utilities.UnstructuredDocumentConverter(heading_style: str = 'atx')

Bases: object

Convert various document types to markdown using Unstructured.

class DocumentElement(type: str, content: str, metadata: Dict | None = None, level: int | None = None)

Bases: object

Represents a processed document element.

__init__(type: str, content: str, metadata: Dict | None = None, level: int | None = None) → None

__init__(heading_style: str = 'atx')

Initialize the Unstructured document converter.

Parameters:: heading_style (str) – ‘atx’ for # style or ‘setext’ for underline style.

static content_with_links(element: DocumentElement) → str

Convert content to Markdown with links as needed.

Parameters:: element (DocumentElement) – DocumentElement object.
Returns:: String with content, including hyperlinks.
Return type:: str

convert_to_markdown(file_path: str | Path) → str

Convert document to Markdown format.

Parameters:: file_path (Union[str, Path]) – Path to input file.
Returns:: Markdown formatted string.
Return type:: str