# My Favorite Statistical Procedure? Frequencies!¶

At a conference I attended a few years ago, a data scientist on a round table discussion replied to a question of what she considered the most important mathematical function in her work with: "the division operator". That clever response provided grist for my later answer to a similar question on my favorite statistical procedure: "frequencies and crosstabs". The commonality is, of course, the simplicity and ubiquity of the functions.

I spend much of my current analytics time in what used to be called exploratory data analysis (EDA) or now just data analysis. DA sits between business intelligence and statistical modeling, using comprehensible computations and visualizations to tell data stories. Among the leading "statistics" are simple counts or frequencies, and their multivariate analogs, crosstabs or contingency tables. Actually for me, they're all just frequencies, be they uni or multi-attribute.

Counts and frequencies play a foundational role in statistical analysis. In my early career, I used Poisson regression extensively. "Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables." In later years, my emphasis has been more on time series analysis, where count data such as visits, hits, defections, etc. are central.

The analysis that follows is all about frequencies and was done in R using its splendid data.table and tidyverse capabilities for data analysis. I've also done similar computations in Python/pandas and am confident the work could be done as well with standard SQL. Indeed, I believe most good BI/OLAP tools can handle the demands. I know several Tableau geeks who'd say it's a piece of cake!

Why another frequencies function in R? After all, there are the table and xtabs functions from base, count from plyr, and countless others from lesser-know packages. The answer is simple: frequenciesdyn is built on data.table, a very powerful and flexible data management addon package that performs group computations (e.g. frequencies) faster than others. It also fits nicely in tidyverse pipelines.

A data set on crime in Chicago is used for the analyses. The data, representing all reported crime in Chicago from 2001, are updated daily and posted a week in arrears. Attributes revolve on the what, where, and when of crime events. The file at this point consists of over 6.5M records.

The technologies deployed below are JupyterLab running an R 3.4 kernel. The scripts are driven primarily through the R data.table and tidyverse packages. Hopefully, readers will see just how powerful these tools are in collaboration. Notable is that neither data.table nor tidyverse is a part of "core" R; each is an addon maintained by the energetic R ecosystem.

In [ ]:



Set a few R options, load some libraries, change the working directory, and assign input file names.

In [124]:
options(warn=-1)
options(scipen=10)
options(datatable.print.topn=100)

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(pryr))
suppressMessages(library(plyr))
suppressMessages(library(dtplyr))
suppressMessages(library(fst))
suppressMessages(library(stringr))
suppressMessages(library(lubridate))

dir <- "c:/bigdata/raw/chicago"

cname <- "crimeplus.fst"
fname <- "fbicode.txt"
ename <- "chicagocommunities.xlsx"

setwd(dir)

cat("\n\n")



In [ ]:



Define several useful functions. frequenciesdyn is the workhorse counting function that drives much of the analyses. It dynamically executes code that invokes R's spendid data.table package. It also demonstrates the collaboration between data.table and the tidyverse ecosystem.

In [125]:
meta <- function(dt,dict=FALSE)
{
print(dim(dt))
print(object_size(dt))
if (dict==TRUE) print(str(dt))
cat("\n\n")
}

cat("\n\n")



In [126]:
frequenciesdyn <- function(DTstr, xstr)
{
return(eval(parse(text=sprintf('%s[,.(count=.N),.(%s)]', DTstr, xstr))) %>%
arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))
}

cat("\n\n")



In [127]:
allfreqs <- function(dtn)
{
dt <- eval(parse(text=paste("data.table(",dtn,")",sep="")))
nmes <-  names(dt)
lst <- map(nmes,function(nm) frequenciesdyn(dtn,nm))
names(lst) <- nmes
return(lst)
}

cat("\n\n")



In [ ]:



Read attributes of interest from the latest "fst" file export of the Chicago crime data set into an R data.table, chicagocrime. The underlying raw data are initially loaded and enriched in another JupyterLab process. The data.table is keyed on fbicodedesc, the crime code description. Note that data are published a week in arrears.

In [128]:
varsout <- c("date","fbicode","fbicodedesc","communityarea","name","longitude","latitude")
chicagocrime <- read_fst(cname,varsout,as.data.table = TRUE) %>% setkey(fbicodedesc)
meta(chicagocrime,dict=TRUE)

cat("\n\n")

[1] 6584682       7
290 MB
Classes 'data.table' and 'data.frame':	6584682 obs. of  7 variables:
$date : POSIXct, format: "2015-03-19 02:35:00" "2015-03-19 10:20:00" ...$ fbicode      : Factor w/ 26 levels "01A","01B","02",..: 5 5 5 5 5 5 5 5 5 5 ...
$fbicodedesc : Factor w/ 26 levels "Aggravated Assault",..: 1 1 1 1 1 1 1 1 1 1 ...$ communityarea: num  43 32 35 51 44 26 40 71 14 1 ...
$name : Factor w/ 77 levels "Albany Park",..: 65 66 18 63 15 72 69 5 1 60 ...$ longitude    : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
$latitude : num 41.8 41.9 41.8 41.7 41.7 ... - attr(*, ".internal.selfref")=<externalptr> - attr(*, "sorted")= chr "fbicodedesc" NULL  Load the communityarea lookup table consisting of an communityarea code, name, and population. In [129]: communities <- read_excel(ename,col_names=FALSE,skip=2) %>% dplyr::select(X__1:X__3) %>% tbl_dt %>% setnames(c("name","communityarea","population")) meta(communities,dict=TRUE) cat("\n\n")  [1] 77 3 7.41 kB Classes 'tbl_dt', 'tbl', 'data.table' and 'data.frame': 77 obs. of 3 variables:$ name         : chr  "Rogers Park" "West Ridge" "Uptown" "Lincoln Square" ...
$communityarea: num 1 2 3 4 5 6 7 8 9 10 ...$ population   : num  54991 71942 56362 39493 31867 ...
- attr(*, ".internal.selfref")=<externalptr>
NULL



Include an enrichment to the Chicago crime data from the Chicago Police Department. fbicode is in the Chicago crime data set; the (fbicode,fbicodedesc) file was built from this site, ultimately to be joined with the raw Chicago data. Note the various "dimensions" of crime.

In [130]:
fbicodes <- fread(fname)
meta(fbicodes,dict=TRUE)

idx <- c("01A")
homicide <- fbicodes[fbicode %in% idx,fbicodedesc]
nhomicide <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

idx <- c("01A","01B","02","03","04A","04B")
violentcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nviolentcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

idx <- c("05","06","07","09")
propertycrime <- fbicodes[fbicode %in% idx,fbicodedesc]
npropertycrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

idx <- c('01A','01B','02','03','04A','04B','05','06','07','09')
indexcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nindexcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

cat("\n\n")

[1] 26  2
4.32 kB
Classes 'data.table' and 'data.frame':	26 obs. of  2 variables:
$fbicode : chr "01A" "01B" "02" "03" ...$ fbicodedesc: chr  "Homicide 1st & 2nd Degree" "Involuntary Manslaughter" "Criminal Sexual Assault" "Robbery" ...
- attr(*, ".internal.selfref")=<externalptr>
NULL


In [131]:
indexcrime

cat("\n\n")

1. 'Homicide 1st & 2nd Degree'
2. 'Involuntary Manslaughter'
3. 'Criminal Sexual Assault'
4. 'Robbery'
5. 'Aggravated Assault'
6. 'Aggravated Battery'
7. 'Burglary'
8. 'Larceny'
9. 'Motor Vehicle Theft'
10. 'Arson'



Compute useful datetime values from the data.table. Several are used later.

In [132]:
dt <-chicagocrime[, max(date)]
dt

dtmax <- date(dt)+days(1)-seconds(1)
dtmax

dtm <- date(dtmax)
dtm

numdays <- yday(dtmax)
numdays

date(dtmax)-(numdays-1)
day(dtmax)

prdte <- paste(month(dtmax,label=TRUE),day(dtmax),sep="-")
prdte

dmonth <- data.table(map_df(frequenciesdyn("chicagocrime[homicide]","year=year(date)")$year, function(yr) list(month=ymd(paste(yr,"01-01",sep="-"))+numdays-1)))[ order(-month)][, month:=paste(month(month,label=TRUE),day(month),sep="-")] cat("\n\n")  [1] "2018-04-19 23:59:00 UTC" [1] "2018-04-19 23:59:59 UTC" 109 19 'Apr-19'   Use frequenciesdyn to compute chicagocrime attribute frequencies. First, find the simple counts of all 6.5M+ records-reported crime from 2001 through 2018-Apr-18 by the 77 Chicago community areas. View the first 10 of the frequencies returned in descending order. Note that "freqs" is simply an R data.table that can participate in subsequent computation. NA represents "missing" with this data, so almost 10% of reported crime is without communityarea. Computational performance is exceptional. In [133]: ptmpre <- proc.time() freqs <- frequenciesdyn("chicagocrime","communityarea") freqs %>% head(10) ptmpost <- proc.time() print(ptmpost-ptmpre) cat("\n\n")  communityareacountpercent NA 616029 9.355486 25 381532 5.794236 8 200855 3.050337 43 194875 2.959520 23 189074 2.871422 24 176819 2.685308 67 176234 2.676424 28 174499 2.650075 29 170842 2.594537 71 169208 2.569722  user system elapsed 0.44 0.05 0.48  Now examine the multi-attribute frequencies by communityarea and year. Since the counts are listed in descending order, it appears that 2001-2002 hold most missing values (NA) for communityarea. In [134]: freqs <- frequenciesdyn("chicagocrime","communityarea,year=year(date)") freqs %>% head(10) cat("\n\n")  communityareayearcountpercent NA 2001 481635 7.3144762 NA 2002 133160 2.0222693 25 2003 30834 0.4682686 25 2004 29487 0.4478121 25 2006 28896 0.4388367 25 2007 28535 0.4333543 25 2005 28376 0.4309396 25 2008 27260 0.4139911 25 2009 26037 0.3954177 25 2010 24296 0.3689776   Consider a similar communityarea, year set of frequencies, but this time include only homicides. The homicide data appear to be cleaner with respect to NA communityarea than the data as a whole. In [135]: freqs <- frequenciesdyn("chicagocrime[homicide]","communityarea,year=year(date)") freqs %>% head(10) cat("\n\n")  communityareayearcountpercent NA 2001 207 2.2974473 NA 2002 143 1.5871254 25 2016 88 0.9766926 25 2017 81 0.8990011 68 2016 50 0.5549390 25 2015 50 0.5549390 25 2002 49 0.5438402 25 2001 44 0.4883463 25 2003 44 0.4883463 29 2017 42 0.4661487   Add a second filter, year>2015, and focus on violentcrime. In [136]: freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]", "communityarea,year=year(date)") freqs %>% head(10) cat("\n\n")  communityareayearcountpercent 25 2016 2148 3.418423 25 2017 2065 3.286333 29 2016 1290 2.052963 43 2017 1194 1.900185 29 2017 1164 1.852441 43 2016 1072 1.706028 23 2017 1053 1.675791 67 2016 1004 1.597810 23 2016 983 1.564390 71 2016 979 1.558024   Include the "name" of the community area from the "denormalized" chicagocrime data.table. In [137]: freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]", "communityarea,name,year=year(date)") freqs %>% head(10) cat("\n\n")  communityareanameyearcountpercent 25 Austin 2016 2148 3.418423 25 Austin 2017 2065 3.286333 29 North Lawndale2016 1290 2.052963 43 South Shore 2017 1194 1.900185 29 North Lawndale2017 1164 1.852441 43 South Shore 2016 1072 1.706028 23 Humboldt Park 2017 1053 1.675791 67 West Englewood2016 1004 1.597810 23 Humboldt Park 2016 983 1.564390 71 Auburn Gresham2016 979 1.558024   How would we get the communityarea name if it weren't an attribute in the chicagocrime data.table? Since the frequenciesdyn result is itself a data table, simply join to the communities data.table. Note that the population attribute from communities is included. setcolorder shuffles the order of the returned attributes. In [138]: freqs <- setcolorder( left_join( frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]","communityarea,year=year(date)"), communities, by=c("communityarea"="communityarea")),c(1,5,2,6,3,4)) %>% arrange(desc(count)) freqs %>% head(10) cat("\n\n")  communityareanameyearpopulationcountpercent 25 Austin 2016 98514 2148 3.418423 25 Austin 2017 98514 2065 3.286333 29 North Lawndale2016 35912 1290 2.052963 43 South Shore 2017 52010 1194 1.900185 29 North Lawndale2017 35912 1164 1.852441 43 South Shore 2016 52010 1072 1.706028 23 Humboldt Park 2017 56323 1053 1.675791 67 West Englewood2016 35505 1004 1.597810 23 Humboldt Park 2016 56323 983 1.564390 71 Auburn Gresham2016 48743 979 1.558024   Try the same query for homicide. Austin is at the top of both because it has the largest population of any communityarea. In [139]: freqs <- setcolorder( left_join( frequenciesdyn("chicagocrime[homicide][year(date)>2015]","communityarea,year=year(date)"), communities, by=c("communityarea"="communityarea")),c(1,5,2,6,3,4)) %>% arrange(desc(count)) freqs %>% head(10) cat("\n\n")  communityareanameyearpopulationcountpercent 25 Austin 2016 98514 88 5.587302 25 Austin 2017 98514 81 5.142857 68 Englewood 2016 30654 50 3.174603 29 North Lawndale 2017 35912 42 2.666667 61 New City 2016 44377 42 2.666667 67 West Englewood 2016 35505 36 2.285714 49 Roseland 2017 44619 35 2.222222 29 North Lawndale 2016 35912 33 2.095238 43 South Shore 2017 52010 33 2.095238 69 Greater Grand Crossing2016 32602 33 2.095238   Change focus to type of crime. In [140]: freqs <- frequenciesdyn("chicagocrime","fbicode") freqs %>% head(10) cat("\n\n")  fbicodecountpercent 06 1378102 20.928907 08B 1027733 15.607937 14 754434 11.457410 26 673537 10.228846 18 659765 10.019694 05 380814 5.783332 07 308492 4.684995 08A 305381 4.637749 03 249975 3.796311 11 210540 3.197421   Include fbicodedesc and filter communityarea as well. In [141]: freqs <- frequenciesdyn("chicagocrime[communityarea>30]","fbicode,fbicodedesc") freqs %>% head(10) cat("\n\n")  fbicodefbicodedesccountpercent 06 Larceny 598451 19.037686 08B Simple Battery 518395 16.490976 14 Vandalism 372068 11.836080 26 Misc Non-Index Offense331984 10.560944 18 Drug Abuse 281166 8.944341 05 Burglary 195521 6.219837 08A Simple Assault 159541 5.075255 07 Motor Vehicle Theft 144421 4.594264 03 Robbery 126985 4.039597 11 Fraud 94943 3.020289   Get fbicodedesc from a join to the fbicodes data.table rather than as a denormalized attribute of chicagocrime In [142]: freqs <- setcolorder(left_join(frequenciesdyn("chicagocrime[communityarea>30]","fbicode"),fbicodes, by=c("fbicode"="fbicode")) %>% arrange(desc(count)),c(1,4,2,3))[] freqs %>% head(10) cat("\n\n")  fbicodefbicodedesccountpercent 06 Larceny 598451 19.037686 08B Simple Battery 518395 16.490976 14 Vandalism 372068 11.836080 26 Misc Non-Index Offense331984 10.560944 18 Drug Abuse 281166 8.944341 05 Burglary 195521 6.219837 08A Simple Assault 159541 5.075255 07 Motor Vehicle Theft 144421 4.594264 03 Robbery 126985 4.039597 11 Fraud 94943 3.020289   Now consider "time series" queries -- this one homicides by year. Sort by year. Print all. In [143]: freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]","year=year(date)") %>% arrange(year) freqs cat("\n\n")  yearcountpercent 2001 667 7.509570 2002 656 7.385724 2003 601 6.766494 2004 453 5.100203 2005 451 5.077685 2006 471 5.302860 2007 447 5.032650 2008 513 5.775726 2009 460 5.179014 2010 438 4.931322 2011 436 4.908804 2012 504 5.674398 2013 421 4.739923 2014 424 4.773700 2015 493 5.550552 2016 778 8.759288 2017 669 7.532087   Next, homicides by month/year. In [144]: freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]", "year=year(date),month=month(date,label=TRUE)") %>% arrange(year,month) freqs %>% head(12) cat("\n\n")  yearmonthcountpercent 2001 Jan 42 0.4728665 2001 Feb 27 0.3039856 2001 Mar 37 0.4165728 2001 Apr 59 0.6642648 2001 May 42 0.4728665 2001 Jun 69 0.7768521 2001 Jul 78 0.8781806 2001 Aug 58 0.6530061 2001 Sep 71 0.7993695 2001 Oct 71 0.7993695 2001 Nov 64 0.7205584 2001 Dec 49 0.5516776   What're the worst months for violentcrime? Summer is ominous. In [145]: freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]", "month=month(date,label=TRUE)") freqs cat("\n\n")  monthcountpercent Jul 56025 10.004607 Aug 53944 9.632995 Jun 52146 9.311919 Oct 50740 9.060844 Sep 50433 9.006022 May 50251 8.973521 Apr 44629 7.969578 Nov 43915 7.842076 Mar 41959 7.492786 Dec 41931 7.487786 Jan 41143 7.347069 Feb 32876 5.870798   Homicides by quarter embedded in year. R's powerful datetime functions are a godsend! In [146]: freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]", "quarter=quarter(date,with_year=TRUE)") %>% arrange(quarter) freqs %>% head(10) cat("\n\n")  quartercountpercent 2001.1 106 1.193425 2001.2 170 1.913983 2001.3 207 2.330556 2001.4 184 2.071605 2002.1 119 1.339788 2002.2 160 1.801396 2002.3 212 2.386850 2002.4 165 1.857690 2003.1 121 1.362306 2003.2 167 1.880207   An equivalent answer for quarterly homicides. In [147]: freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]", "quarter=quarter(date),year=year(date)") %>% arrange(year,quarter) freqs %>% head(12) cat("\n\n")  quarteryearcountpercent 1 2001 106 1.193425 2 2001 170 1.913983 3 2001 207 2.330556 4 2001 184 2.071605 1 2002 119 1.339788 2 2002 160 1.801396 3 2002 212 2.386850 4 2002 165 1.857690 1 2003 121 1.362306 2 2003 167 1.880207 3 2003 167 1.880207 4 2003 146 1.643774   How about violentcrime by quarter, independent of year? Almost 29% of violentcrimes occur in the summer. In [148]: freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]", "quarter=quarter(date)") %>% arrange(quarter) freqs cat("\n\n")  quartercountpercent 1 115978 20.71065 2 147026 26.25502 3 160402 28.64362 4 136586 24.39071   What hours in the day are most vulnerable to homicide? Include a cumulative sum of counts by hour. Nearly 20% of homicides occur between 11 PM and 2 AM. In [149]: freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]","hour=hour(date)") %>% mutate(cumpct=cumsum(percent)) freqs %>% head(10) cat("\n\n")  hourcountpercentcumpct 0 611 6.879081 6.879081 1 592 6.665166 13.544247 23 549 6.181040 19.725287 2 548 6.169782 25.895069 22 506 5.696915 31.591984 21 500 5.629363 37.221347 20 463 5.212790 42.434136 3 431 4.852511 47.286647 18 407 4.582301 51.868948 16 369 4.154470 56.023418   Now a bit of data profiling, looking at NA or missing data by attribute. TRUE indicates missing. latitude/longitude is much more "non-missing" than is communityarea, and may be helpful in resolving communitarea NAs. In [150]: frequenciesdyn("chicagocrime","is_na_name=is.na(name)") cat("\n\n") freqs <- frequenciesdyn("chicagocrime[is.na(name)]","is_na_name=is.na(name),year=year(date)") freqs %>% head(12) cat("\n\n") frequenciesdyn("chicagocrime","is_na_longitude=is.na(longitude)") cat("\n\n") frequenciesdyn("chicagocrime","is_na_name=is.na(name),is_na_longitude=is.na(longitude)") cat("\n\n")  is_na_namecountpercent FALSE 5968563 90.643147 TRUE 616119 9.356853   is_na_nameyearcountpercent TRUE 2001 481635 78.172398514 TRUE 2002 133166 21.613681773 TRUE 2008 254 0.041225802 TRUE 2009 224 0.036356613 TRUE 2010 189 0.030675892 TRUE 2011 186 0.030188973 TRUE 2007 164 0.026618234 TRUE 2004 82 0.013309117 TRUE 2006 61 0.009900685 TRUE 2005 57 0.009251460 TRUE 2003 56 0.009089153 TRUE 2012 32 0.005193802   is_na_longitudecountpercent FALSE 6451942 97.984109 TRUE 132740 2.015891   is_na_nameis_na_longitudecountpercent FALSE FALSE 5845188 88.769480 TRUE FALSE 606754 9.214629 FALSE TRUE 123375 1.873667 TRUE TRUE 9365 0.142224   There's no real limit to the number of attributes that can be evaluated in a frequenciesdyn statement. Consider types of violentcrime by communityarea and month/year. There are over 55,000 records returned from this counts query! In [151]: freqs <- frequenciesdyn("chicagocrime[violentcrime]", "name,fbicodedesc,year=year(date),month=month(date)") freqs %>% nrow() freqs %>% head(10) cat("\n\n")  55963 namefbicodedescyearmonthcountpercent NA Robbery 2001 10 1852 0.3268262 NA Aggravated Battery2001 7 1766 0.3116496 NA Robbery 2001 9 1748 0.3084731 NA Robbery 2001 8 1699 0.2998260 NA Robbery 2001 7 1656 0.2922377 NA Aggravated Battery2001 6 1642 0.2897671 NA Robbery 2001 12 1631 0.2878259 NA Robbery 2001 11 1594 0.2812964 NA Aggravated Battery2001 8 1585 0.2797082 NA Aggravated Battery2001 5 1561 0.2754729   Finally, look at a slightly more complex analysis that combines the results of three frequencies queries. I use this to monitor/predict both homicides and violentcrimes for the current year from year-to-date and previous years' figures. In [152]: ehomicide <- cbind( frequenciesdyn("chicagocrime[homicide][yday(date)<numdays]","year=year(date)") %>% arrange(desc(year)) %>% dplyr::select(-percent) %>% mutate(numdays=numdays,lastday=dmonth), frequenciesdyn("chicagocrime[homicide][month(date)<=month(dt)]","year=year(date)") %>% arrange(desc(year)) %>% dplyr::select(-percent,-year), frequenciesdyn("chicagocrime[homicide]","year=year(date)") %>% arrange(desc(year)) %>% dplyr::select(-percent,-year) ) %>% setnames(c("year","sofar","numdays","lastday","monthend","yearend")) %>% setcolorder(c("year","numdays","lastday","sofar","monthend","yearend")) ehomicide[1]$yearend <- NA
ehomicide[1]\$monthend <- NA

ehomicide

cat("\n\n")

yearnumdayslastdaysofarmonthendyearend
2018 109 Apr-19127 NA NA
2017 109 Apr-19175 193 669
2016 109 Apr-18169 185 778
2015 109 Apr-19102 119 493
2014 109 Apr-19 78 101 424
2013 109 Apr-19 84 95 421
2012 109 Apr-18141 158 504
2011 109 Apr-19 94 105 436
2010 109 Apr-19106 121 438
2009 109 Apr-19 88 110 460
2008 109 Apr-18107 134 513
2007 109 Apr-19109 125 447
2006 109 Apr-19 92 115 471
2005 109 Apr-19106 117 451
2004 109 Apr-18125 138 453
2003 109 Apr-19148 178 601
2002 109 Apr-19147 162 656
2001 109 Apr-19142 165 667



And of course it's easy to save the results of these frequencies queries for subsequent time series and other statistical analyses.

In [153]:
murders <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]",
"year=year(date),month=month(date)")
violentc <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]",
"year=year(date),month=month(date)")

cat("\n\n")



In [154]:
murders %>% head(10)
cat("\n\n")

cat("\n\n")

yearmonthcountpercent
2016 8 95 1.0695789
2017 6 86 0.9682504
2016 10 82 0.9232155
2016 11 79 0.8894393
2016 6 78 0.8781806
2001 7 78 0.8781806
2002 8 77 0.8669219
2017 7 76 0.8556631
2001 9 71 0.7993695
2001 10 71 0.7993695


yearmonthcountpercent
2002 7 4531 0.8091187
2001 7 4509 0.8051901
2002 8 4419 0.7891184
2001 8 4306 0.7689396
2002 6 4236 0.7564394
2001 10 4232 0.7557251
2002 9 4209 0.7516179
2001 9 4184 0.7471535
2001 6 4066 0.7260818
2003 7 3995 0.7134030



Finally, exercise the allfreqs function, which simply invokes frequenciesdyn for each attribute in the data.table.

In [155]:
ptmpre <- proc.time()

freqs <- allfreqs("chicagocrime")
names(freqs)

cat("\n\n")

ptmpost <- proc.time()
print(ptmpost-ptmpre)

cat("\n\n")

1. 'date'
2. 'fbicode'
3. 'fbicodedesc'
4. 'communityarea'
5. 'name'
6. 'longitude'
7. 'latitude'
fbicodedesccountpercent
Larceny 1378102 20.928907
Simple Battery 1027733 15.607937
Vandalism 754434 11.457410
Misc Non-Index Offense 673537 10.228846
Drug Abuse 659765 10.019694
Burglary 380814 5.783332
Motor Vehicle Theft 308492 4.684995
Simple Assault 305381 4.637749
Robbery 249975 3.796311
Fraud 210540 3.197421


longitudecountpercent
NA 132740 2.01589082
-87.90523 12843 0.19504359
-87.74153 9226 0.14011307
-87.62788 6535 0.09924549
-87.62410 4033 0.06124821
-87.62820 3024 0.04592477
-87.74273 2803 0.04256849
-87.72642 2599 0.03947040
-87.64700 2545 0.03865031
-87.62449 2343 0.03558258
   user  system elapsed
2.42    0.34    2.77



That's it for now. Hopefully, the queries provide a sense of what's possible with these functions in tandem with data.table and tidyverse capabilities. Next month, I'll take a look at simple graphs to showcase the frequencies results.