File Loader Action Block
The File Loader action block lets you load and process various file formats from URLs, extracting text content for use in workflows. Itβs ideal for document processing, content analysis, and data extraction tasks.
Use the File Loader block in Inditeβs Flow Builder to extract text from files for seamless workflow integration!
Features
- Multiple File Formats: Supports PDF, DOC, DOCX, TXT, JSON, and MD files.
- URL-based Loading: Loads files directly from web URLs.
- Text Extraction: Extracts readable text from documents.
- Size Limits: Handles files up to 10MB.
- Response Mapping: Maps extracted content to workflow variables.
- Real-time Processing: Processes files on-demand during workflow execution.
Supported File Types
Document Formats
- PDF: Portable Document Format files.
- DOC: Microsoft Word 97-2003 documents.
- DOCX: Microsoft Word 2007+ documents.
- TXT: Plain text files.
- MD: Markdown files.
- JSON: JavaScript Object Notation files.
File Size Limits
- Maximum Size: 10MB per file.
- Recommended: Under 5MB for optimal performance.
- Large Files: Split or compress large documents.
Configuration
Basic Setup
File URL: Specify a direct URL to the file to load.
https://example.com/documents/report.pdf
https://storage.googleapis.com/bucket/presentation.docx
https://github.com/user/repo/raw/main/README.mdConfigure the File Loader block with a direct URL to your file for seamless processing.
URL Requirements:
- Must be publicly accessible or properly authenticated.
- Should return appropriate content-type headers.
- Must link directly to the file (not a preview page).
Use Cases
Document Analysis
File URL: https://company.com/annual-report.pdf
Purpose: Extract text for financial analysis.
Next Steps: Process with AI models for insights.Content Migration
File URL: https://old-system.com/legacy-document.doc
Purpose: Convert legacy documents to a new format.
Next Steps: Save to a modern document management system.Research Processing
File URL: https://research-portal.org/paper.pdf
Purpose: Extract abstract and key findings.
Next Steps: Summarize and categorize research.Configuration Documentation
File URL: https://api.service.com/docs/config.json
Purpose: Load configuration settings.
Next Steps: Parse and apply settings to the system.Text Extraction Process
PDF Processing
- Extracts text from all pages.
- Preserves basic formatting structure.
- Handles text-based PDFs (not scanned images).
- Maintains paragraph breaks and sections.
Word Document Processing
- Extracts text from DOC/DOCX files.
- Preserves document structure.
- Includes headers, paragraphs, and lists.
- Filters out formatting metadata.
Plain Text Processing
- Loads content directly.
- Preserves original formatting.
- Handles various text encodings.
- Maintains line breaks and spacing.
JSON Processing
- Parses JSON structure.
- Extracts text values from objects.
- Maintains hierarchical information.
- Converts to readable text format.
Markdown Processing
- Parses Markdown syntax.
- Converts to plain text.
- Preserves heading structure.
- Maintains list formatting.
Response Structure
The File Loader returns structured data:
{
"content": "Extracted text content from the document...",
"metadata": {
"filename": "document.pdf",
"fileSize": 1024576,
"fileType": "application/pdf",
"pageCount": 15,
"extractedAt": "2024-01-15T10:30:00Z"
},
"success": true,
"processingTime": 2.5
}Access structured data from the File Loader block to use in your workflows.
Response Mapping
Map file content to workflow variables:
Basic Content Extraction
content β {{documentText}}
metadata.filename β {{fileName}}
metadata.fileSize β {{fileSize}}
metadata.pageCount β {{pageCount}}Map extracted content to variables for downstream processing in your workflows.
Advanced Processing
// Split content into sections
content.split('\n\n') β {{documentSections}}
// Extract specific patterns
content.match(/\b\d{4}-\d{2}-\d{2}\b/g) β {{extractedDates}}
// Content analysis
{
"wordCount": content.split(' ').length,
"characterCount": content.length,
"hasNumbers": /\d/.test(content)
} β {{contentAnalysis}}Perform advanced content processing to extract and analyze data in your workflows.
Integration Patterns
Document Processing Pipeline
1. File Loader β Extract text from document
2. Text Preprocessing β Clean and normalize text
3. AI Analysis β Analyze content with language models
4. Data Extraction β Extract structured information
5. Storage β Save processed resultsContent Analysis Workflow
1. File Loader β Load document content
2. Sentiment Analysis β Analyze document sentiment
3. Topic Extraction β Identify key topics
4. Summary Generation β Create document summary
5. Categorization β Classify document typeResearch Pipeline
1. File Loader β Load research papers
2. Abstract Extraction β Extract key sections
3. Citation Analysis β Identify references
4. Knowledge Extraction β Extract key findings
5. Database Update β Store research dataError Handling
Common issues and solutions:
URL Access Errors
"File not accessible"
- Verify URL is correct and publicly accessible.
- Check if authentication is required.
- Ensure file hasnβt been moved or deleted.
File Format Errors
"Unsupported file type"
- Verify file extension is supported (PDF, DOC, DOCX, TXT, JSON, MD).
- Check actual file format matches extension.
- Ensure file isnβt corrupted.
Size Limit Errors
"File too large"
- Check file size is under 10MB.
- Consider compressing the file.
- Split large documents into smaller parts.
Processing Errors
"Text extraction failed"
- Verify file isnβt password protected.
- Check if PDF is text-based (not scanned).
- Ensure file format is valid and not corrupted.
Troubleshoot File Loader issues by verifying URLs, file formats, and sizes.
Best Practices
URL Management
- Direct Links: Use direct file URLs, not preview pages.
- Stable URLs: Ensure URLs remain accessible over time.
- Authentication: Handle authenticated URLs appropriately.
- Error Handling: Implement retry logic for temporary failures.
File Optimization
- Size Management: Keep files under recommended limits.
- Format Selection: Choose appropriate formats for content.
- Quality: Ensure files are high-quality and not corrupted.
- Structure: Use well-structured documents for better extraction.
Processing Efficiency
- Caching: Cache processed content for repeated use.
- Parallel Processing: Process multiple files concurrently.
- Batch Operations: Group similar file processing tasks.
- Monitoring: Track processing times and success rates.
Security Considerations
- URL Validation: Validate URLs before processing.
- Content Scanning: Scan extracted content for sensitive data.
- Access Control: Ensure proper access controls on source files.
- Data Handling: Follow data privacy guidelines for processed content.
Follow best practices for URL management, file optimization, and security in the File Loader block.
Advanced Features
Content Preprocessing
// Text cleaning and normalization
function preprocessText(content) {
return content
.replace(/\s+/g, ' ') // Normalize whitespace
.replace(/[^\w\s.,!?]/g, '') // Remove special characters
.trim(); // Remove leading/trailing spaces
}Metadata Extraction
// Extract document metadata
{
"title": extractTitle(content),
"author": extractAuthor(content),
"createdDate": extractDate(content),
"language": detectLanguage(content),
"topics": extractTopics(content)
}Content Analysis
// Analyze document structure
{
"sections": identifySections(content),
"headings": extractHeadings(content),
"tables": identifyTables(content),
"images": countImages(metadata),
"references": extractReferences(content)
}Performance Optimization
Loading Strategies
- Concurrent Loading: Load multiple files simultaneously.
- Progressive Processing: Process files as they load.
- Caching: Cache frequently accessed files.
- Compression: Use compressed formats when possible.
Memory Management
- Streaming: Process large files in chunks.
- Cleanup: Release memory after processing.
- Limits: Set appropriate memory limits.
- Monitoring: Track memory usage patterns.
Troubleshooting
Common Issues
"Connection timeout"
- Check network connectivity.
- Verify server response times.
- Consider increasing timeout values.
- Implement retry mechanisms.
"Invalid file format"
- Verify file extension matches content.
- Check if file is corrupted.
- Ensure file is a supported format.
- Test with different files.
"Empty content extracted"
- Check if file contains extractable text.
- Verify file isnβt password protected.
- Ensure file format is properly structured.
- Test with known good files.
Debugging Steps
- Verify URL: Test URL accessibility in a browser.
- Check File: Download and verify file integrity.
- Test Extraction: Try with simpler test files.
- Monitor Logs: Review processing logs for errors.
- Network Check: Verify connectivity and firewall rules.
Node Display
The File Loader node displays:
- Configuration Status: Shows "Configure..." if URL is not set.
- File Info: Displays "File Loader" with truncated URL.
- URL Display: Shows first 20 characters of URL with "..." if longer.
- Status: Indicates processing status and file information.
Example Workflows
Document Summarization
1. File Loader β Load research paper
2. Text Preprocessing β Clean extracted text
3. AI Summarization β Generate summary
4. Quality Check β Validate summary quality
5. Output β Deliver summary to userContent Migration
1. File Loader β Load legacy documents
2. Format Conversion β Convert to standard format
3. Content Validation β Verify content integrity
4. Metadata Extraction β Extract document properties
5. Storage β Save to new systemCompliance Checking
1. File Loader β Load policy documents
2. Content Analysis β Analyze for compliance keywords
3. Gap Analysis β Identify missing requirements
4. Report Generation β Create compliance report
5. Notification β Alert compliance team