File Loader Action Block

The File Loader action block lets you load and process various file formats from URLs, extracting text content for use in workflows. It’s ideal for document processing, content analysis, and data extraction tasks.

📤

Use the File Loader block in Indite’s Flow Builder to extract text from files for seamless workflow integration!

Features

Multiple File Formats: Supports PDF, DOC, DOCX, TXT, JSON, and MD files.
URL-based Loading: Loads files directly from web URLs.
Text Extraction: Extracts readable text from documents.
Size Limits: Handles files up to 10MB.
Response Mapping: Maps extracted content to workflow variables.
Real-time Processing: Processes files on-demand during workflow execution.

Supported File Types

Document Formats

PDF: Portable Document Format files.
DOC: Microsoft Word 97-2003 documents.
DOCX: Microsoft Word 2007+ documents.
TXT: Plain text files.
MD: Markdown files.
JSON: JavaScript Object Notation files.

File Size Limits

Maximum Size: 10MB per file.
Recommended: Under 5MB for optimal performance.
Large Files: Split or compress large documents.

Configuration

Basic Setup

File URL: Specify a direct URL to the file to load.

https://example.com/documents/report.pdf
https://storage.googleapis.com/bucket/presentation.docx
https://github.com/user/repo/raw/main/README.md

🌐

Configure the File Loader block with a direct URL to your file for seamless processing.

URL Requirements:

Must be publicly accessible or properly authenticated.
Should return appropriate content-type headers.
Must link directly to the file (not a preview page).

Use Cases

Document Analysis

File URL: https://company.com/annual-report.pdf
Purpose: Extract text for financial analysis.
Next Steps: Process with AI models for insights.

Content Migration

File URL: https://old-system.com/legacy-document.doc
Purpose: Convert legacy documents to a new format.
Next Steps: Save to a modern document management system.

Research Processing

File URL: https://research-portal.org/paper.pdf
Purpose: Extract abstract and key findings.
Next Steps: Summarize and categorize research.

Configuration Documentation

File URL: https://api.service.com/docs/config.json
Purpose: Load configuration settings.
Next Steps: Parse and apply settings to the system.

Text Extraction Process

PDF Processing

Extracts text from all pages.
Preserves basic formatting structure.
Handles text-based PDFs (not scanned images).
Maintains paragraph breaks and sections.

Word Document Processing

Extracts text from DOC/DOCX files.
Preserves document structure.
Includes headers, paragraphs, and lists.
Filters out formatting metadata.

Plain Text Processing

Loads content directly.
Preserves original formatting.
Handles various text encodings.
Maintains line breaks and spacing.

JSON Processing

Parses JSON structure.
Extracts text values from objects.
Maintains hierarchical information.
Converts to readable text format.

Markdown Processing

Parses Markdown syntax.
Converts to plain text.
Preserves heading structure.
Maintains list formatting.

Response Structure

The File Loader returns structured data:

{
  "content": "Extracted text content from the document...",
  "metadata": {
    "filename": "document.pdf",
    "fileSize": 1024576,
    "fileType": "application/pdf",
    "pageCount": 15,
    "extractedAt": "2024-01-15T10:30:00Z"
  },
  "success": true,
  "processingTime": 2.5
}

📈

Access structured data from the File Loader block to use in your workflows.

Response Mapping

Map file content to workflow variables:

Basic Content Extraction

content → {{documentText}}
metadata.filename → {{fileName}}
metadata.fileSize → {{fileSize}}
metadata.pageCount → {{pageCount}}

🔗

Map extracted content to variables for downstream processing in your workflows.

Advanced Processing

// Split content into sections
content.split('\n\n') → {{documentSections}}
 
// Extract specific patterns
content.match(/\b\d{4}-\d{2}-\d{2}\b/g) → {{extractedDates}}
 
// Content analysis
{
  "wordCount": content.split(' ').length,
  "characterCount": content.length,
  "hasNumbers": /\d/.test(content)
} → {{contentAnalysis}}

🧠

Perform advanced content processing to extract and analyze data in your workflows.

Integration Patterns

Document Processing Pipeline

1. File Loader → Extract text from document
2. Text Preprocessing → Clean and normalize text
3. AI Analysis → Analyze content with language models
4. Data Extraction → Extract structured information
5. Storage → Save processed results

Content Analysis Workflow

1. File Loader → Load document content
2. Sentiment Analysis → Analyze document sentiment
3. Topic Extraction → Identify key topics
4. Summary Generation → Create document summary
5. Categorization → Classify document type

Research Pipeline

1. File Loader → Load research papers
2. Abstract Extraction → Extract key sections
3. Citation Analysis → Identify references
4. Knowledge Extraction → Extract key findings
5. Database Update → Store research data

Error Handling

Common issues and solutions:

URL Access Errors

"File not accessible"

Verify URL is correct and publicly accessible.
Check if authentication is required.
Ensure file hasn’t been moved or deleted.

File Format Errors

"Unsupported file type"

Verify file extension is supported (PDF, DOC, DOCX, TXT, JSON, MD).
Check actual file format matches extension.
Ensure file isn’t corrupted.

Size Limit Errors

"File too large"

Check file size is under 10MB.
Consider compressing the file.
Split large documents into smaller parts.

Processing Errors

"Text extraction failed"

Verify file isn’t password protected.
Check if PDF is text-based (not scanned).
Ensure file format is valid and not corrupted.

🔍

Troubleshoot File Loader issues by verifying URLs, file formats, and sizes.

Best Practices

URL Management

Direct Links: Use direct file URLs, not preview pages.
Stable URLs: Ensure URLs remain accessible over time.
Authentication: Handle authenticated URLs appropriately.
Error Handling: Implement retry logic for temporary failures.

File Optimization

Size Management: Keep files under recommended limits.
Format Selection: Choose appropriate formats for content.
Quality: Ensure files are high-quality and not corrupted.
Structure: Use well-structured documents for better extraction.

Processing Efficiency

Caching: Cache processed content for repeated use.
Parallel Processing: Process multiple files concurrently.
Batch Operations: Group similar file processing tasks.
Monitoring: Track processing times and success rates.

Security Considerations

URL Validation: Validate URLs before processing.
Content Scanning: Scan extracted content for sensitive data.
Access Control: Ensure proper access controls on source files.
Data Handling: Follow data privacy guidelines for processed content.

🔒

Follow best practices for URL management, file optimization, and security in the File Loader block.

Advanced Features

Content Preprocessing

// Text cleaning and normalization
function preprocessText(content) {
  return content
    .replace(/\s+/g, ' ')           // Normalize whitespace
    .replace(/[^\w\s.,!?]/g, '')    // Remove special characters
    .trim();                        // Remove leading/trailing spaces
}

Metadata Extraction

// Extract document metadata
{
  "title": extractTitle(content),
  "author": extractAuthor(content),
  "createdDate": extractDate(content),
  "language": detectLanguage(content),
  "topics": extractTopics(content)
}

Content Analysis

// Analyze document structure
{
  "sections": identifySections(content),
  "headings": extractHeadings(content),
  "tables": identifyTables(content),
  "images": countImages(metadata),
  "references": extractReferences(content)
}

Performance Optimization

Loading Strategies

Concurrent Loading: Load multiple files simultaneously.
Progressive Processing: Process files as they load.
Caching: Cache frequently accessed files.
Compression: Use compressed formats when possible.

Memory Management

Streaming: Process large files in chunks.
Cleanup: Release memory after processing.
Limits: Set appropriate memory limits.
Monitoring: Track memory usage patterns.

Troubleshooting

Common Issues

"Connection timeout"

Check network connectivity.
Verify server response times.
Consider increasing timeout values.
Implement retry mechanisms.

"Invalid file format"

Verify file extension matches content.
Check if file is corrupted.
Ensure file is a supported format.
Test with different files.

"Empty content extracted"

Check if file contains extractable text.
Verify file isn’t password protected.
Ensure file format is properly structured.
Test with known good files.

Debugging Steps

Verify URL: Test URL accessibility in a browser.
Check File: Download and verify file integrity.
Test Extraction: Try with simpler test files.
Monitor Logs: Review processing logs for errors.
Network Check: Verify connectivity and firewall rules.

Node Display

The File Loader node displays:

Configuration Status: Shows "Configure..." if URL is not set.
File Info: Displays "File Loader" with truncated URL.
URL Display: Shows first 20 characters of URL with "..." if longer.
Status: Indicates processing status and file information.

Example Workflows

Document Summarization

1. File Loader → Load research paper
2. Text Preprocessing → Clean extracted text
3. AI Summarization → Generate summary
4. Quality Check → Validate summary quality
5. Output → Deliver summary to user

Content Migration

1. File Loader → Load legacy documents
2. Format Conversion → Convert to standard format
3. Content Validation → Verify content integrity
4. Metadata Extraction → Extract document properties
5. Storage → Save to new system

Compliance Checking

1. File Loader → Load policy documents
2. Content Analysis → Analyze for compliance keywords
3. Gap Analysis → Identify missing requirements
4. Report Generation → Create compliance report
5. Notification → Alert compliance team

LLM Agent Generate Document