# Test Fixtures for PDF Import Tests

This directory contains PDF files used for integration testing of the `pdf2txt_multicolumn_safe()` function.

## Generated Fixtures

These PDFs are generated by `create_test_fixtures.R` and provide controlled test cases:

| File | Description | Size | Purpose |
|------|-------------|------|---------|
| `sample-single-column.pdf` | Single column document | 1 page | Test basic single-column extraction |
| `sample-two-columns.pdf` | Two column layout | 1 page | Test two-column detection and extraction |
| `sample-three-columns.pdf` | Three column layout (landscape) | 1 page | Test multi-column handling |
| `empty-page.pdf` | Blank PDF | 1 page | Test empty page handling |
| `multi-page.pdf` | Multi-page document | 5 pages | Test pagination |
| `hyphenated-text.pdf` | Text with hyphenation | 1 page | Test hyphenation removal |
| `special-chars.pdf` | Special characters and symbols | 1 page | Test unicode/special char handling |

## Creating/Updating Fixtures

To regenerate the test fixtures:

```r
source("create_test_fixtures.R")
main()
```

Or from the command line:
```bash
Rscript create_test_fixtures.R
```

## Optional Real-World Fixtures

You can add additional real-world PDF files for more comprehensive testing. Suggested additions:

| File | Description | Notes |
|------|-------------|-------|
| `scientific-paper.pdf` | Actual research paper | From arXiv or open-access journal |
| `complex-layout.pdf` | Document with mixed layout | Tables, figures, etc. |
| `unicode-content.pdf` | Non-Latin characters | Test UTF-8 handling |
| `large-doc.pdf` | Large multi-page document | Performance testing (50+ pages) |
| `malformed.pdf` | Corrupted/invalid PDF | Error handling |

## Finding Test PDFs

**Open sources for test PDFs:**

1. **arXiv.org** - Scientific papers (open access)
   - Example: https://arxiv.org/pdf/2301.00000.pdf
   - Most are 2-column academic format

2. **PubMed Central** - Biomedical papers
   - https://www.ncbi.nlm.nih.gov/pmc/
   - Click "PDF" on any open-access article

3. **Public domain documents**
   - Project Gutenberg (converted to PDF)
   - Government documents
   - Wikipedia exports

4. **Sample PDFs**
   - Adobe Sample PDFs: https://acrobat.adobe.com/link/track?uri=urn:aaid:scds:US:f74b9666-c6df-4571-9e9e-e40e19bfcf1c
   - PDF Test Suite: Various online repositories

## Size Guidelines

- **Small fixtures** (< 100KB): Good for fast CI/CD tests
- **Medium fixtures** (100KB-1MB): For thorough integration tests
- **Large fixtures** (> 1MB): Skip on CRAN, use for performance tests

## Git Considerations

### .gitignore Settings

Add to `.gitignore` if fixtures are large:

```gitignore
# Ignore large PDF fixtures
tests/testthat/fixtures/*.pdf
!tests/testthat/fixtures/sample-*.pdf
!tests/testthat/fixtures/empty-*.pdf
```

This keeps small generated fixtures in git but excludes large real-world PDFs.

### .Rbuildignore Settings

Add to `.Rbuildignore` if you want to exclude fixtures from the built package:

```
^tests/testthat/fixtures/large-.*\.pdf$
^tests/testthat/fixtures/.*-real\.pdf$
```

## Test Behavior When Fixtures Missing

The integration tests use `skip_if_no_fixture()` to gracefully skip tests when PDFs are not available. This means:

- ✅ Tests run when fixtures exist
- ⏭️ Tests skip (not fail) when fixtures are missing
- 📦 Package can be tested without fixtures

## Adding New Fixtures

1. **Add the PDF file** to this directory
2. **Update this README** with file description
3. **Add corresponding test** in `test-pdf_import_integration.R`:

```r
it("extracts from your-new-fixture.pdf", {
  skip_if_not_installed("pdftools")
  skip_if_no_fixture("your-new-fixture.pdf")
  
  pdf_file <- fixture_path("your-new-fixture.pdf")
  result <- pdf2txt_multicolumn_safe(pdf_file)
  
  expect_type(result, "character")
  expect_gt(nchar(result), 0)
  # Add specific expectations based on known content
})
```

## Fixture Metadata

For reference, here are the expected characteristics of generated fixtures:

```r
# Sample characteristics (approximate)
fixtures_info <- list(
  "sample-single-column.pdf" = list(
    pages = 1,
    columns = 1,
    approx_chars = 500,
    contains = c("SAMPLE DOCUMENT", "Introduction", "Methods")
  ),
  "sample-two-columns.pdf" = list(
    pages = 1,
    columns = 2,
    approx_chars = 400,
    contains = c("TWO-COLUMN", "ABSTRACT", "Introduction")
  ),
  "sample-three-columns.pdf" = list(
    pages = 1,
    columns = 3,
    approx_chars = 200,
    contains = c("THREE-COLUMN", "Column 1", "Column 2", "Column 3")
  )
)
```

## Maintenance

- Review fixtures quarterly to ensure they remain relevant
- Update this README when adding/removing fixtures
- Regenerate fixtures if testing requirements change
- Keep total fixture size under 5MB if possible

## License & Copyright

**Generated fixtures**: These are minimal synthetic documents created solely for testing purposes. No copyright restrictions.

**Real-world fixtures**: If adding real PDFs, ensure they are:
- Public domain, OR
- Open access with appropriate license, OR
- Used under fair use for testing purposes only

**Do not commit** copyrighted material without proper authorization.

---

Last updated: 2025-01-06