Exploratory Data Analysis and Visualizations (STAT GR5702)


Problem 1: Occupational Mobility

Source: Chapter 4, p. 73 no. 5

According to R help page, the Yamaguchi87 dataset in vcdExtra has become a classic for models comparing two-way mobility tables. Note that this package contains a frequency data frame with 75 observations on the following 4 variables. The total sample size is 28,887.

# Open and read the data
y <<- data.frame(Yamaguchi87)
a) How do the distribution of occupations of the sons in the three countries compare?

According to the overall plot of the distribution of occupations of the sons below, the number of jobs are highest in US compared to UK or Japan in all occupation areas, including those categorized as Upper NonManuals (UpNM), Lower NonManuals (LoNM), Upper Manuals (UpM), Lower Manuals (LoNM), and Farming (Farm). The opportunity for the son’s jobs seems to be the highest in US, followed by UK then Japan.

By observing the distribution by occupations shown in the following paragraph, all categories but Farm actually shares very similar pattern, that US has the highest number of occupations, followed by UK, then Japan. Upper nonmanuals are professionals, managers, and officials; lower nonmanuals are proprietors, sales workers, and clerical workers; upper manuals are skilled workers; lower manuals are semiskilled and unskilled nonfarm workers; and farm workers are farmers and farm laborers. For the last category, US still has the highest number of jobs but UK comes to be the least among the three country.

# Distribution plot by occupation
ggplot(y,aes(x=Country,y=Freq,fill=Son)) + geom_bar(stat="identity") + 
  ggtitle("Distributions by Occupations of the Sons in US, UK, and Japan") + ylab("Frequency") + xlab("Country") +
  facet_grid( . ~ Son) +guides(fill=FALSE)

b) How do the distributions of the sons’ and fathers’ occupations in the UK compare?

To compare the occupation distributions between sons and fathers, I plotted a side-by-side frequency graph by occupation in an ascending order below. It follows that nonmanual works are more dominant among the sons (e.g. Sons have higher frequency for UpNM and LoNM), while the fathers have higher number of occupations in upper and lower manuals and farming jobs.

c) Are you surprised by the results or are they what you would have expected?

I am not quite surprised by the results. I would expect US to have the highest frequency on both the father and son occupations compared to UK or Japan. A reason for this is the population factor. I also expected a similar pattern for occupation distribution in UK across all job categories between the fathers and the sons. Younger generations tend to have better lives and more opportunities statistically, where most parents want to set out better paths for their children. It is definitely more common to encounter fathers with manual or farming job and sons with nonmanual labors.

Problem 2: Whisky

Source: Chapter 4, p. 73 no. 6 from Simmons Survey. The Whisky file contains a data frame with 2,218 observations on the following 21 variables. All variables are coded 1 if consumed in last year, 0 if not.

a) Draw a barchart of the number of respondents per brand. What ordering of the brands do you think is best?

The following graph shows the frequency of brands used in last year for the respondents who report consuming scotch. We see that scotch Chivas Regal was reported to be the most consumed, and Singleton was the least consumed.