DNA evidence is the pre-eminent tool in the modern forensic scientists toolbox. It is widely accepted by the public, scientific and legal communities and it has been instrumental in determining both the innocence and guilt of individuals involved in the legal process. Despite this widespread acceptance there is unease regarding the statistical measures used to evaluate DNA evidence amongst some of members of all these communities. In particular, some people regard the random match probabilities associated with DNA evidence as just too small or basically unsupportable. In this vignette we discuss the basics of STR profiles, which serves as a reference for the package’s other vignettes:
db_vignette (“Empirical testing of DNA match probabilities”) we discuss what it means for a pair of DNA profiles to match or partially match, and we present how the
DNAtools package allows a rational examination of the statistical properties of a DNA database.
noa_vignette (“On the exact distribution of the numbers of alleles in DNA mixtures”) we show how to calculate the distribution of the number of distinct alleles present in a DNA mixture constituted by an arbitrary number of contributors.
Forensic genetics has its terminology which we briefly explain here. Human DNA consists of 23 pairs of chromosomes and those chromosomes are composed of a sequence of nucleotides which are labelled
T after the bases adenine, guanine, cytosine and thymine that are used to form them. Modern DNA typing uses short tandem repeats (STRs). These are regions of DNA which are highly variable, but are patterned in that they consist of repeats of a short sequence of DNA bases. The locations at which this information is collected are called loci, and the (length) variations in the patterns observed at each locus are called alleles. We have two alleles at each locus, because humans are a diploid species, meaning they have two copies of each chromosome. One allele comes from our mother, and the other from our father.
A pair of alleles at a locus is called a genotype, and therefore a DNA profile is actually a multi-locus genotype. Modern forensic laboratories genotype DNA evidence using commercial kits, called multiplexes which consist of 9–17 loci. The multiplex currently used in the United Kingdom (and until recently New Zealand and Denmark) is called AmpFlSTR SGM Plus, or SGM Plus for short, and consists of 10 loci, plus one gender specific locus, Amelogenin. Forensic laboratories in the United States which load profiles into the FBI’s Combined DNA Index System (CODIS) collect a core set of thirteen loci, although they are not constrained to use one multiplex.
|Alleles:||15, 18||14, 17||6, 9.3||17, 23||12, 15||15, 15||19, 23||11, 12||28, 28||13, 14|
Table above shows a DNA profile from the SGM plus multiplex. There are two numbers at each locus representing the two alleles that make up the genotype at that locus. The numbers relate to the number of times the pattern or motif that describe the alleles at the locus are repeated. For example, this person’s genotype at the locus TH01 is
6,9.3. This means that on one chromosome, the motif for TH01,
TCAT was repeated 6 times, and on the other chromosome it was repeated 9 times, and then followed by
.3 represents the fact that three of the four bases have been repeated.
The aim of the
DNAtools package is to provide statisticians and forensic scientists with access to the specific procedures described in the other vignettes. For example, for the database matching exercise (
db_vignette), early implementations by Weir (2004) and then Curran, Walsh, and Buckleton (2007) required custom written code for each new database and, in the case of Curran, Walsh, and Buckleton (2007), generation of at least half a dozen precursor files and a significant amount of memory. Tvedebrink (2010); Tvedebrink et al. (2012) reduced the computational effort of Weir (2004) and Curran, Walsh, and Buckleton (2007) by deriving recursion formulas for the expectation and variance of the computed summary statistics.
DNAtools aims to make all of these procedures easier to use in R.
In the listed vignettes the main features of the package are described, which allows statisticians and forensic scientists to easily examine the properties of a forensic DNA database. In particular, our package makes it simple to carry out a database comparison exercise where every DNA profile in the database is compared to every other database, and compare the resulting numbers of observed pairs of matching and partially matching profiles to expectation under a set of population genetic assumptions. Similarly, evaluating the distribution of the number of distinct alleles in high-order DNA mixtures is easily computed.
Curran, JM, SJ Walsh, and J Buckleton. 2007. “Empirical Testing of Estimated DNA Frequencies.” Forensic Science International: Genetics, 1(3-4), 267–272.
Tvedebrink, T. 2010. “Statistical Aspects of Forensic Genetics – Models for Qualitative and Quantitative STR Data.” Ph.D. thesis, Department of Mathematical Sciences, Aalborg University.
Tvedebrink, T, PS Eriksen, JM Curran, HS Mogensen, and N Morling. 2012. “Analysis of Matches and Partial-Matches in a Danish DNA Reference Profile Data Set.” Forensic Science International: Genetics 6(3), 387-392.
Weir, BS. 2004. “Matching and Partially-Matching DNA Profiles.” Journal of Forensic Sciences, 49(5), 1009–1014.