# Test Fixtures for PDF Import Tests This directory contains PDF files used for integration testing of the `pdf2txt_multicolumn_safe()` function. ## Generated Fixtures These PDFs are generated by `create_test_fixtures.R` and provide controlled test cases: | File | Description | Size | Purpose | |------|-------------|------|---------| | `sample-single-column.pdf` | Single column document | 1 page | Test basic single-column extraction | | `sample-two-columns.pdf` | Two column layout | 1 page | Test two-column detection and extraction | | `sample-three-columns.pdf` | Three column layout (landscape) | 1 page | Test multi-column handling | | `empty-page.pdf` | Blank PDF | 1 page | Test empty page handling | | `multi-page.pdf` | Multi-page document | 5 pages | Test pagination | | `hyphenated-text.pdf` | Text with hyphenation | 1 page | Test hyphenation removal | | `special-chars.pdf` | Special characters and symbols | 1 page | Test unicode/special char handling | ## Creating/Updating Fixtures To regenerate the test fixtures: ```r source("create_test_fixtures.R") main() ``` Or from the command line: ```bash Rscript create_test_fixtures.R ``` ## Optional Real-World Fixtures You can add additional real-world PDF files for more comprehensive testing. Suggested additions: | File | Description | Notes | |------|-------------|-------| | `scientific-paper.pdf` | Actual research paper | From arXiv or open-access journal | | `complex-layout.pdf` | Document with mixed layout | Tables, figures, etc. | | `unicode-content.pdf` | Non-Latin characters | Test UTF-8 handling | | `large-doc.pdf` | Large multi-page document | Performance testing (50+ pages) | | `malformed.pdf` | Corrupted/invalid PDF | Error handling | ## Finding Test PDFs **Open sources for test PDFs:** 1. **arXiv.org** - Scientific papers (open access) - Example: https://arxiv.org/pdf/2301.00000.pdf - Most are 2-column academic format 2. **PubMed Central** - Biomedical papers - https://www.ncbi.nlm.nih.gov/pmc/ - Click "PDF" on any open-access article 3. **Public domain documents** - Project Gutenberg (converted to PDF) - Government documents - Wikipedia exports 4. **Sample PDFs** - Adobe Sample PDFs: https://acrobat.adobe.com/link/track?uri=urn:aaid:scds:US:f74b9666-c6df-4571-9e9e-e40e19bfcf1c - PDF Test Suite: Various online repositories ## Size Guidelines - **Small fixtures** (< 100KB): Good for fast CI/CD tests - **Medium fixtures** (100KB-1MB): For thorough integration tests - **Large fixtures** (> 1MB): Skip on CRAN, use for performance tests ## Git Considerations ### .gitignore Settings Add to `.gitignore` if fixtures are large: ```gitignore # Ignore large PDF fixtures tests/testthat/fixtures/*.pdf !tests/testthat/fixtures/sample-*.pdf !tests/testthat/fixtures/empty-*.pdf ``` This keeps small generated fixtures in git but excludes large real-world PDFs. ### .Rbuildignore Settings Add to `.Rbuildignore` if you want to exclude fixtures from the built package: ``` ^tests/testthat/fixtures/large-.*\.pdf$ ^tests/testthat/fixtures/.*-real\.pdf$ ``` ## Test Behavior When Fixtures Missing The integration tests use `skip_if_no_fixture()` to gracefully skip tests when PDFs are not available. This means: - ✅ Tests run when fixtures exist - ⏭️ Tests skip (not fail) when fixtures are missing - 📦 Package can be tested without fixtures ## Adding New Fixtures 1. **Add the PDF file** to this directory 2. **Update this README** with file description 3. **Add corresponding test** in `test-pdf_import_integration.R`: ```r it("extracts from your-new-fixture.pdf", { skip_if_not_installed("pdftools") skip_if_no_fixture("your-new-fixture.pdf") pdf_file <- fixture_path("your-new-fixture.pdf") result <- pdf2txt_multicolumn_safe(pdf_file) expect_type(result, "character") expect_gt(nchar(result), 0) # Add specific expectations based on known content }) ``` ## Fixture Metadata For reference, here are the expected characteristics of generated fixtures: ```r # Sample characteristics (approximate) fixtures_info <- list( "sample-single-column.pdf" = list( pages = 1, columns = 1, approx_chars = 500, contains = c("SAMPLE DOCUMENT", "Introduction", "Methods") ), "sample-two-columns.pdf" = list( pages = 1, columns = 2, approx_chars = 400, contains = c("TWO-COLUMN", "ABSTRACT", "Introduction") ), "sample-three-columns.pdf" = list( pages = 1, columns = 3, approx_chars = 200, contains = c("THREE-COLUMN", "Column 1", "Column 2", "Column 3") ) ) ``` ## Maintenance - Review fixtures quarterly to ensure they remain relevant - Update this README when adding/removing fixtures - Regenerate fixtures if testing requirements change - Keep total fixture size under 5MB if possible ## License & Copyright **Generated fixtures**: These are minimal synthetic documents created solely for testing purposes. No copyright restrictions. **Real-world fixtures**: If adding real PDFs, ensure they are: - Public domain, OR - Open access with appropriate license, OR - Used under fair use for testing purposes only **Do not commit** copyrighted material without proper authorization. --- Last updated: 2025-01-06