Anthropological Stuff in “R”

       First, “R” is your friend, so:

http://cran.r-project.org/

 


Getting Started with R

R can be used as a “calculator” if you want to just do some fast calculation.  For example, type the following at the Rgui prompt “>” (in the following, what you type is always shown in bold courier, while R’s responses are in regular courier font):

> 3+4/5  [hit return]
[1] 3.8

If you haven’t dealt with order of operations since grade school, then remember that if you really wanted (3+4)/5 then you need to so the following:

> (3+4)/5
[1] 1.4

But unlike your basic calculator “R” knows about a lot of different functions.  Let’s say I want to find the logarithm of 34, but I can’t remember whether “R” does base 10 or natural logs as the default.  I can check this by typing “?log” at the prompt:

> ?log

Which will pop-up a window with the following (note that you can mark and paste examples to the console (Rgui) to see what they do):

log                   package:base                   R Documentation
 
Logarithms and Exponentials
Description:
     'log' computes natural logarithms, 'log10' computes common (i.e.,
     base 10) logarithms, and 'log2' computes binary (i.e., base 2)
     logarithms. The general form 'logb(x, base)' computes logarithms
     with base 'base' ('log10' and 'log2' are only special cases).
     'log1p(x)' computes log(1+x) accurately also for |x| << 1 (and
     less accurately when x is approximately -1).
     'exp' computes the exponential function.
     'expm1(x)' computes exp(x) - 1 accurately also for |x| << 1.
Usage:
     log(x, base = exp(1))
     logb(x, base = exp(1))
     log10(x)
     log2(x)
     exp(x)
     expm1(x)
     log1p(x)
Arguments:
       x: a numeric or complex vector.
    base: positive number.  The base with respect to which logarithms
          are computed.  Defaults to e='exp(1)'.
Details:
     'exp' and 'log' are generic functions: methods can be defined for
     them individually or via the 'Math' group generic.
Value:
     A vector of the same length as 'x' containing the transformed
     values.  'log(0)' gives '-Inf' (when available).
Note:
     'log' and 'logb' are the same thing in R, but 'logb' is preferred
     if 'base' is specified, for S-PLUS compatibility.
References:
     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_. Wadsworth & Brooks/Cole. (for 'log', '\log10' and
     'exp'.)
     Chambers, J. M. (1998) _Programming with Data. A Guide to the S
     Language_. Springer. (for 'logb'.)
See Also:
     'Trig', 'sqrt', 'Arithmetic'.
Examples:
     log(exp(3))
     log10(1e7)# = 7
     x <- 10^-(1+2*1:9)
     cbind(x, log(1+x), log1p(x), exp(x)-1, expm1(x))

So, now for the envelope:

> log(34)
[1] 3.526361
> log(exp(34))
[1] 34
> log10(10^34)
[1] 34
> log2(2^34)
[1] 34
 

The calls with exponents demonstrate what they taught us in algebra classes in grade school (a logarithm is just the exponent that you raise the base to in order to get the original number [O.K., not good English, but this was in Math classes]) 

For the observant out there, you may be asking why “R” puts a “[1]” before all its answers.  This is because “R” knows that it is dealing with a scalar (a single number).  Later when we deal with vectors (more than one number) the bracketed number will help you keep place of where you’re at in the vector.

Now you’d be correct to point out that so far we haven’t done anything more than your garden-variety calculator can manage, so let’s try some built in functions that your calculator probably lacks.  Suppose we want to find what z-scores cut off bottom tails with particular areas in a standard normal distribution:

> qnorm(.025)
[1] -1.959964
> qnorm(.5)
[1] 0
> qnorm(.975)
[1] 1.959964

Or if it was a non-standard normal (say mean of 10 and standard deviation of 2), then:

> qnorm(.975,10,2)
[1] 13.91993

To see it the other way you use pnorm:

> pnorm(-1.96)
[1] 0.02499790
> pnorm(0)
[1] 0.5
> pnorm(1.96)
[1] 0.9750021
> pnorm(13.91993,10,2)
[1] 0.975
 

Or maybe you’re interested in how “high” the distribution is at the exact “z,” then you use:

> dnorm(-1.96)
[1] 0.05844094
> dnorm(0)
[1] 0.3989423
> dnorm(1.96)
[1] 0.05844094
> dnorm(13.91993,10,2)
[1] 0.02922248
 

And if you wanted more information on any of these functions you could type “?qnorm” or “?pnorm” or “?dnorm” in which case you’ll find another function “rnorm” that produces random normal deviates.  Let’s try that one (but before that you’re going to run a “mystical” function so that you get the same random number in the end that I did) :

> set.seed(34)
> rnorm(1)
[1] -0.1388900
 

The “set.seed” set the random number generator starting point (at the integer 34) so you’d get the same random normal that I did.  The “one” in the parenthesis tells “R” that you only want one random standard normal.  But who only wants one random number, so lets try 34 of them:

> rnorm(34)
 [1]  1.199812897 -0.747722402 -0.575248177 -0.263581513 -0.455492149  0.670620044 -0.849014621  1.066804504 -0.007460534
[10] -0.402880091  0.719107939 -0.180058654  1.046190759  0.401254928  1.356390044  0.019226639 -0.469417841 -1.842661894
[19] -0.279740938 -1.530769603  2.546154061 -1.081930761 -1.424597806  0.421643381  0.781368098 -0.899447775 -0.503869229
[28] -0.239373938  1.167999937 -0.798138650  1.210959214 -1.221456179 -0.003421086  0.728218315

And now you can see where those bracketed numbers come in handy, because you’ve produced your first vector.  But maybe you want to create many, many random normals and don’t want to look at them on the screen.  You have two choices: either wrap another function around “rnorm” or store the vector.  Let’s try a wrapper first:

> mean(rnorm(1000000,34,3.4))
[1] 33.99865
> sd(rnorm(1000000,34,3.4))
[1] 3.4009
 

There are two problems with using a wrapper here.  First, you ran the generator 2,000,000 times where you might have been interested in the mean and standard deviation from the same 1,000,000 long vector.  Second, you may want to do more complex things that could get rough if you have to keep wrapping.  So let’s store a vector.  The symbol for this is  = or <- (take your pick, I prefer the latter, some of the time?).

> x<-rnorm(10000,34,3.4)
> x[1:34]
 [1] 34.35526 34.41967 42.21910 35.11879 38.16902 35.42733 33.01487 38.45826 31.18965 39.87449 32.75161 37.29900 32.06589
[14] 31.24125 39.87919 30.31236 33.50599 30.79897 37.37804 37.01856 32.56543 34.61645 36.84017 33.15384 38.55359 35.92221
[27] 38.22694 33.74577 34.15306 41.91588 38.35132 33.78827 29.41453 35.07958

And in this case I put the first 34 numbers to the screen to make sure the vector was there (if you just type “x” you get 10,000 numbers in all their meaningless glory).  Now let’s do a histogram:

> hist(x,nc=50)
 

Nice, huh?  But maybe you heard about this fancy thing called a kernel density plot (move over SAS!), so:

> plot(density(x))

Smokin’!  But let’s get really fancy and overlay the expected normal density:

> lines(seq(20,45,.01),dnorm(seq(20,45,.01),34,3.4),col='blue',lwd=2)

But enough of this.  We are getting dangerously close to writing functions, so let’s try some.


 

Cadien et al. (1974)

As an early student the tables in Cadien et al. (1974) befuddled me.  I could see where the means came from but not the total variances  The trick is that total variance is equal to the sum of within and between variances.  You can use a little R function to see this.  Copy and past the following into your “Rgui”:

 
 
cadien1<-function (props=rep(1/4,4),xbar=rep(2,4),xvar=4:7) 
{
grand.mean=as.vector(props%*%xbar)
grand.var=as.vector(props%*%xvar)+as.vector(props%*%(xbar-grand.mean)^2)
return(c(grand.mean,grand.var))
}

Now type cadien1() at the R > prompt and you will get the following:

> cadien1()
[1] 2.0 5.5
> 
 

where the first number is the grand mean and the second is the grand variance.  This example was the first line in Table 1 from Cadien et al.  Now try the second line from that table, as well as the examples from Table 2.

 

> cadien1(xb=c(1,2,2,3),xv=rep(5,4))
[1] 2.0 5.5
> cadien1(xb=2:5,xv=4:7)
[1] 3.50 6.75
> cadien1(pr=c(.1,.1,.3,.5),xb=2:5,xv=4:7)
[1] 4.20 7.16
> 

 

Very simple, very easy.  But maybe you’re not a trusting soul, in which case paste the below and run it in the same fashion as the above (warning, on a slower computer you might want to cut down the number of simulations by reducing “N”)

 

cadien2<-function (props=rep(1/4,4),xbar=rep(2,4),xvar=4:7,N=100000) 
{
n<-props*N
xsd=sqrt(xvar)
sto<-c(rnorm(n[1],xbar[1],xsd[1]),rnorm(n[2],xbar[2],xsd[2]),rnorm(n[3],xbar[3],xsd[3]),rnorm(n[4],xbar[4],xsd[4]))
return(c(mean(sto),var(sto)))
}
 

Cadien JD, Harris EF, Jones WP, and Mandarino LJ  (1974)  Biological lineages, skeletal populations, and microevolution.  Yearbk Phys Anthropol 18:194-201.


 

Petersen (2002)

Petersen (2002) gives a parametric and nonparametric bootstrap test as alternatives to an F-test he previously gave for testing equality of two variance-covariance matrices.  See the 2002 reference below for a description of these tests as well as prior literature (e.g., Key and Jantz 1990a, 1990b, Zhivotovsky 1988).

 

Petersen HC (2002)  On statistical methods for comparison of intrasample morphometric variability: Zalavár revisted.  Am J Phys Anthropol 113:79-84.

First copy and paste the following into your “Rgui”:

 
  download.file('http://konig.la.utk.edu/oslo.txt','oslo.txt')
  oslo<-dget('oslo.txt')
  download.file('http://konig.la.utk.edu/berg.txt','berg.txt')
  berg<-dget('berg.txt')
 

At the prompt type “oslo” and “berg” to make sure you got it right.

Now mark and copy the following bold text into your “Rgui”:
  
zhivo<-function (h=berg,w=oslo)
{
SGVDF<-function (DF,p,SGV)
#########################################################################
#       This function takes the DF, p, and SGV and returns the corrected
#       DF and c*SGV (following Zhivotovsky, 1988; per Petersen 2000
#       correction)
########################################################################
{
        gvec<-rep(0,p)
        for(i in 1:p) {
               gvec[i]<-gamma(1/p+(DF-i+1)/2)/gamma((DF-i+1)/2)
                      }
        m1<-2*prod(gvec)
        for(i in 1:p) {
               gvec[i]<-gamma(2/p+(DF-i+1)/2)/gamma((DF-i+1)/2)
                      }
        m2<-4*prod(gvec)
        DFc<-2*m1^2/(m2-m1^2)
        cSGV<-DF/m1*SGV
        return(list(DFc=DFc,cSGV=cSGV))
}
#####################################################################
p<-NCOL(h)
det.ratio<-det(cov(h))/det(cov(w))
SGVh<-det(cov(h))^(1/p)
SGVw<-det(cov(w))^(1/p)
nw<-NROW(w)
nh<-NROW(h)
DFh<-nh-1
DFw<-nw-1
SGVh<-SGVDF(DFh,p,SGVh)
SGVw<-SGVDF(DFw,p,SGVw)
Fratio<-SGVh$cSGV/SGVw$cSGV
p<-pf(Fratio,SGVh$DFc,SGVw$DFc,lower.tail=F)
return(list(det.ratio=det.ratio,Fratio=Fratio,p=p))
}

Now type zhivo() at the R > prompt and you will get the following:

> zhivo()
$det.ratio
[1] 7.506352
 
$Fratio
[1] 1.220779
 
$p
[1] 0.01381720
 
In the above, “det.ratio” is the ratio of determinants of the two variance-covariance matrices, “Fratio” is the F-test, and “p” is the probability value for the F-test.  In the call to “zhivo” you can specify different datasets, for example “zhivo(samp1,samp2)” or “zhivo(h=samp1,w=samp2).”
 
Now it is time to do a nonparametric bootstrap, so paste the following:
 
 
bootnon<-function (h=berg,w=oslo,Nboot=9)
{
boot<-rep(0,Nboot+1)
p<-NCOL(h)
nw<-NROW(w)
nh<-NROW(h)
W<-w-rep(1,nw)%o%apply(w,2,mean)
H<-h-rep(1,nh)%o%apply(h,2,mean)
hw<-rbind(H,W)
 
boot[1]<-det(cov(hw[1:nh,]))/det(cov(hw[(nh+1):(nh+nw),]))
 
for (i in 2:(Nboot+1))  {
bootit<-sample(1:(nw+nh),replace=T)
hwb<-hw[bootit,1:p]
boot[i]<-det(cov(hwb[1:nh,]))/det(cov(hwb[(nh+1):(nh+nw),]))
      }
return(boot)
}
 
and run as “bootnon()”.  This will spit back a 10-element vector, for example:
 
> bootnon()
 [1] 7.5063520 0.7896946 1.8347205 0.9368910 0.4789018 3.9599567 0.2407633 1.6292649 5.1654673 1.3845933
 
where the first element (= 7.5063520) is the observed determinant ratio and the following nine are bootstrapped.  Nine bootstraps is pretty pathetic, so you’ll want to try something meatier once your sure things are working correctly, like:
 
> bootnon2<-bootnon(Nboot=999)
 
This will store the vector, then you can use “sum(bootnon2[1]>=bootnon2)/1000” to get the p-value.
 
 
Now it is time to do a parametric bootstrap, so paste the following:
 
 
bootwish<- function (h=berg,w=oslo,Nboot=9)
{
 
rwishart2<-function(df,SqrtSigma)
#################################################################
# rwishart2 simulates from a wishart distribution using the
# Cholesky decomposition of a variance-covariance matrix
#################################################################
{
p<-NROW(SqrtSigma)
Z <- matrix(0, p, p)
diag(Z) <- sqrt(rchisq(p, df:(df-p+1)))
pseq <- 1:(p-1)
Z[rep(p*pseq, pseq) + unlist(lapply(pseq, seq))] <- rnorm(p*(p-1)/2)
crossprod(Z %*% SqrtSigma)
}
#################################################################
 
boot<-rep(0,Nboot+1)
p<-NCOL(h)
nw<-NROW(w)
nh<-NROW(h)
DFh<-nh-1
DFw<-nw-1
 
boot[1]<-det(cov(h))/det(cov(w))
Sp<-(cov(h)*DFh+cov(w)*DFw)/(DFh+DFw)
SqrtSp<-chol(Sp)
 
for (i in 2:(Nboot+1))  {
boot[i]<-det(rwishart2(DFh,SqrtSp)/DFh)/det(rwishart2(DFw,SqrtSp)/DFw)
      }
return(boot)
}
 
and run as “bootwish()”.  This will spit back a 10-element vector, and then you can use just as you did the nonparametric bootstrap.
 
Just for fun, I compared the nonparametric bootstrap (for 9,999 samples plus the observed) to the F-ratio distribution, as below:
 
where the black curve is from the F-ratio and the blue line is from the bootstrap (modified to calculate F-ratio instead of the determinant ratio used above).  The vertical line is the observed F-ratio.