# Neville Letters vs Plays (1590-1615)
## Function-word bigram bootstrap analysis

This run compares the Neville letters to early modern plays using function-word bigrams, with length-matched windows and bootstrapping to reduce topical bias and length effects.

## Data sources
- Neville letters: `Neville_Letters_Corpus_v3.xml`
- Plays: `early_modern_plays.db` (plays with CREATION_YEAR between 1590 and 1615)
- Henry VIII sections: division IDs 54 (Shakespeare) and 109 (Fletcher)

## Features
- Tokens are lemmas filtered to the top 200 most frequent words from plays (1580–1620), used as a proxy for function words.
- Features are consecutive function-word bigrams.

## Sampling and scoring
- Window size: 10,000 function-word tokens.
- Neville baseline: 50 random windows from the Neville letters; their vectors are averaged to a centroid.
- Each play: 50 random windows (or 1 window if the play is shorter than 10,000 function-word tokens).
- Similarity: cosine similarity between each play window and the Neville centroid.
- Score reported per play: mean similarity across windows, plus standard deviation.

## Inclusion thresholds
- Plays and sections must have at least 5,000 function-word tokens to be included.

## Outputs
- `Neville_Bigram_Functionwords_Bootstrap_1590_1615.csv`: results ranked by mean similarity.
- `analyze_neville_bigrams_functionwords_bootstrap.py`: script to reproduce the run.

## Notes
- Entries with only one window have a standard deviation of 0.0000; interpret these cautiously because they are not bootstrapped across multiple windows.
