My Favorite Statistical Procedure? Frequencies!

At a conference I attended a few years ago, a data scientist on a round table discussion replied to a question of what she considered the most important mathematical function in her work with: "the division operator". That clever response provided grist for my later answer to a similar question on my favorite statistical procedure: "frequencies and crosstabs". The commonality is, of course, the simplicity and ubiquity of the functions.

I spend much of my current analytics time in what used to be called exploratory data analysis (EDA) or now just data analysis. DA sits between business intelligence and statistical modeling, using comprehensible computations and visualizations to tell data stories. Among the leading "statistics" are simple counts or frequencies, and their multivariate analogs, crosstabs or contingency tables. Actually for me, they're all just frequencies, be they uni or multi-attribute.

Counts and frequencies play a foundational role in statistical analysis. In my early career, I used Poisson regression extensively. "Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables." In later years, my emphasis has been more on time series analysis, where count data such as visits, hits, defections, etc. are central.

The analysis that follows is all about frequencies and was done in R using its splendid data.table and tidyverse capabilities for data analysis. I've also done similar computations in Python/pandas and am confident the work could be done as well with standard SQL. Indeed, I believe most good BI/OLAP tools can handle the demands. I know several Tableau geeks who'd say it's a piece of cake!

Why another frequencies function in R? After all, there are the table and xtabs functions from base, count from plyr, and countless others from lesser-know packages. The answer is simple: frequenciesdyn is built on data.table, a very powerful and flexible data management addon package that performs group computations (e.g. frequencies) faster than others. It also fits nicely in tidyverse pipelines.

A data set on crime in Chicago is used for the analyses. The data, representing all reported crime in Chicago from 2001, are updated daily and posted a week in arrears. Attributes revolve on the what, where, and when of crime events. The file at this point consists of over 6.5M records.

The technologies deployed below are JupyterLab running an R 3.4 kernel. The scripts are driven primarily through the R data.table and tidyverse packages. Hopefully, readers will see just how powerful these tools are in collaboration. Notable is that neither data.table nor tidyverse is a part of "core" R; each is an addon maintained by the energetic R ecosystem.

In [ ]:

Set a few R options, load some libraries, change the working directory, and assign input file names.

In [124]:
options(warn=-1)
options(scipen=10)
options(datatable.print.topn=100)

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(pryr))
suppressMessages(library(plyr))
suppressMessages(library(dtplyr))
suppressMessages(library(fst))
suppressMessages(library(stringr))
suppressMessages(library(lubridate))
suppressMessages(library(readxl))


dir <- "c:/bigdata/raw/chicago"

cname <- "crimeplus.fst"
fname <- "fbicode.txt"
ename <- "chicagocommunities.xlsx"

setwd(dir)

cat("\n\n")

In [ ]:

Define several useful functions. frequenciesdyn is the workhorse counting function that drives much of the analyses. It dynamically executes code that invokes R's spendid data.table package. It also demonstrates the collaboration between data.table and the tidyverse ecosystem.

In [125]:
meta <- function(dt,dict=FALSE)
{
    print(dim(dt))
    print(object_size(dt))
    if (dict==TRUE) print(str(dt))
    cat("\n\n")
}

cat("\n\n")

In [126]:
frequenciesdyn <- function(DTstr, xstr)
{    
        return(eval(parse(text=sprintf('%s[,.(count=.N),.(%s)]', DTstr, xstr))) %>% 
               arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))
} 

cat("\n\n")

In [127]:
allfreqs <- function(dtn)
{
        dt <- eval(parse(text=paste("data.table(",dtn,")",sep="")))
        nmes <-  names(dt)
        lst <- map(nmes,function(nm) frequenciesdyn(dtn,nm))
        names(lst) <- nmes
        return(lst)
}
                   
cat("\n\n")

In [ ]:

Read attributes of interest from the latest "fst" file export of the Chicago crime data set into an R data.table, chicagocrime. The underlying raw data are initially loaded and enriched in another JupyterLab process. The data.table is keyed on fbicodedesc, the crime code description. Note that data are published a week in arrears.

In [128]:
varsout <- c("date","fbicode","fbicodedesc","communityarea","name","longitude","latitude")
chicagocrime <- read_fst(cname,varsout,as.data.table = TRUE) %>% setkey(fbicodedesc) 
meta(chicagocrime,dict=TRUE)

cat("\n\n")
[1] 6584682       7
290 MB
Classes 'data.table' and 'data.frame':	6584682 obs. of  7 variables:
 $ date         : POSIXct, format: "2015-03-19 02:35:00" "2015-03-19 10:20:00" ...
 $ fbicode      : Factor w/ 26 levels "01A","01B","02",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ fbicodedesc  : Factor w/ 26 levels "Aggravated Assault",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ communityarea: num  43 32 35 51 44 26 40 71 14 1 ...
 $ name         : Factor w/ 77 levels "Albany Park",..: 65 66 18 63 15 72 69 5 1 60 ...
 $ longitude    : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ latitude     : num  41.8 41.9 41.8 41.7 41.7 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "fbicodedesc"
NULL




Load the communityarea lookup table consisting of an communityarea code, name, and population.

In [129]:
communities <- read_excel(ename,col_names=FALSE,skip=2) %>% 
    dplyr::select(X__1:X__3) %>% tbl_dt %>% setnames(c("name","communityarea","population"))

meta(communities,dict=TRUE)

cat("\n\n")
[1] 77  3
7.41 kB
Classes 'tbl_dt', 'tbl', 'data.table' and 'data.frame':	77 obs. of  3 variables:
 $ name         : chr  "Rogers Park" "West Ridge" "Uptown" "Lincoln Square" ...
 $ communityarea: num  1 2 3 4 5 6 7 8 9 10 ...
 $ population   : num  54991 71942 56362 39493 31867 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL




Include an enrichment to the Chicago crime data from the Chicago Police Department. fbicode is in the Chicago crime data set; the (fbicode,fbicodedesc) file was built from this site, ultimately to be joined with the raw Chicago data. Note the various "dimensions" of crime.

In [130]:
fbicodes <- fread(fname)
meta(fbicodes,dict=TRUE)

idx <- c("01A")
homicide <- fbicodes[fbicode %in% idx,fbicodedesc]
nhomicide <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

idx <- c("01A","01B","02","03","04A","04B")
violentcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nviolentcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

idx <- c("05","06","07","09")
propertycrime <- fbicodes[fbicode %in% idx,fbicodedesc]
npropertycrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]
         
idx <- c('01A','01B','02','03','04A','04B','05','06','07','09') 
indexcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nindexcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

cat("\n\n")
[1] 26  2
4.32 kB
Classes 'data.table' and 'data.frame':	26 obs. of  2 variables:
 $ fbicode    : chr  "01A" "01B" "02" "03" ...
 $ fbicodedesc: chr  "Homicide 1st & 2nd Degree" "Involuntary Manslaughter" "Criminal Sexual Assault" "Robbery" ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL




In [131]:
indexcrime

cat("\n\n")
  1. 'Homicide 1st & 2nd Degree'
  2. 'Involuntary Manslaughter'
  3. 'Criminal Sexual Assault'
  4. 'Robbery'
  5. 'Aggravated Assault'
  6. 'Aggravated Battery'
  7. 'Burglary'
  8. 'Larceny'
  9. 'Motor Vehicle Theft'
  10. 'Arson'

Compute useful datetime values from the data.table. Several are used later.

In [132]:
dt <-chicagocrime[, max(date)]
dt

dtmax <- date(dt)+days(1)-seconds(1)
dtmax

dtm <- date(dtmax)
dtm

numdays <- yday(dtmax)
numdays

date(dtmax)-(numdays-1)
day(dtmax)

prdte <- paste(month(dtmax,label=TRUE),day(dtmax),sep="-")
prdte

dmonth <- data.table(map_df(frequenciesdyn("chicagocrime[homicide]","year=year(date)")$year,
                  function(yr) list(month=ymd(paste(yr,"01-01",sep="-"))+numdays-1)))[
                  order(-month)][,
                  month:=paste(month(month,label=TRUE),day(month),sep="-")]

cat("\n\n")
[1] "2018-04-19 23:59:00 UTC"
[1] "2018-04-19 23:59:59 UTC"
109
19
'Apr-19'

Use frequenciesdyn to compute chicagocrime attribute frequencies. First, find the simple counts of all 6.5M+ records-reported crime from 2001 through 2018-Apr-18 by the 77 Chicago community areas. View the first 10 of the frequencies returned in descending order. Note that "freqs" is simply an R data.table that can participate in subsequent computation. NA represents "missing" with this data, so almost 10% of reported crime is without communityarea. Computational performance is exceptional.

In [133]:
ptmpre <- proc.time()

freqs <- frequenciesdyn("chicagocrime","communityarea")
freqs %>% head(10)

ptmpost <- proc.time()
print(ptmpost-ptmpre)

cat("\n\n")
communityareacountpercent
NA 616029 9.355486
25 381532 5.794236
8 200855 3.050337
43 194875 2.959520
23 189074 2.871422
24 176819 2.685308
67 176234 2.676424
28 174499 2.650075
29 170842 2.594537
71 169208 2.569722
   user  system elapsed 
   0.44    0.05    0.48 


Now examine the multi-attribute frequencies by communityarea and year. Since the counts are listed in descending order, it appears that 2001-2002 hold most missing values (NA) for communityarea.

In [134]:
freqs <- frequenciesdyn("chicagocrime","communityarea,year=year(date)") 
freqs %>% head(10)

cat("\n\n")
communityareayearcountpercent
NA 2001 481635 7.3144762
NA 2002 133160 2.0222693
25 2003 30834 0.4682686
25 2004 29487 0.4478121
25 2006 28896 0.4388367
25 2007 28535 0.4333543
25 2005 28376 0.4309396
25 2008 27260 0.4139911
25 2009 26037 0.3954177
25 2010 24296 0.3689776

Consider a similar communityarea, year set of frequencies, but this time include only homicides. The homicide data appear to be cleaner with respect to NA communityarea than the data as a whole.

In [135]:
freqs <- frequenciesdyn("chicagocrime[homicide]","communityarea,year=year(date)")
freqs %>% head(10)

cat("\n\n")
communityareayearcountpercent
NA 2001 207 2.2974473
NA 2002 143 1.5871254
25 2016 88 0.9766926
25 2017 81 0.8990011
68 2016 50 0.5549390
25 2015 50 0.5549390
25 2002 49 0.5438402
25 2001 44 0.4883463
25 2003 44 0.4883463
29 2017 42 0.4661487

Add a second filter, year>2015, and focus on violentcrime.

In [136]:
freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]",
                        "communityarea,year=year(date)") 
freqs %>% head(10)

cat("\n\n")
communityareayearcountpercent
25 2016 2148 3.418423
25 2017 2065 3.286333
29 2016 1290 2.052963
43 2017 1194 1.900185
29 2017 1164 1.852441
43 2016 1072 1.706028
23 2017 1053 1.675791
67 2016 1004 1.597810
23 2016 983 1.564390
71 2016 979 1.558024

Include the "name" of the community area from the "denormalized" chicagocrime data.table.

In [137]:
freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]",
                        "communityarea,name,year=year(date)") 
freqs %>% head(10)

cat("\n\n")
communityareanameyearcountpercent
25 Austin 2016 2148 3.418423
25 Austin 2017 2065 3.286333
29 North Lawndale2016 1290 2.052963
43 South Shore 2017 1194 1.900185
29 North Lawndale2017 1164 1.852441
43 South Shore 2016 1072 1.706028
23 Humboldt Park 2017 1053 1.675791
67 West Englewood2016 1004 1.597810
23 Humboldt Park 2016 983 1.564390
71 Auburn Gresham2016 979 1.558024

How would we get the communityarea name if it weren't an attribute in the chicagocrime data.table? Since the frequenciesdyn result is itself a data table, simply join to the communities data.table. Note that the population attribute from communities is included. setcolorder shuffles the order of the returned attributes.

In [138]:
freqs <- setcolorder(
left_join(
frequenciesdyn("chicagocrime[violentcrime][year(date)>2015]","communityarea,year=year(date)"),
communities, by=c("communityarea"="communityarea")),c(1,5,2,6,3,4)) %>% arrange(desc(count))

freqs %>% head(10)

cat("\n\n")
communityareanameyearpopulationcountpercent
25 Austin 2016 98514 2148 3.418423
25 Austin 2017 98514 2065 3.286333
29 North Lawndale2016 35912 1290 2.052963
43 South Shore 2017 52010 1194 1.900185
29 North Lawndale2017 35912 1164 1.852441
43 South Shore 2016 52010 1072 1.706028
23 Humboldt Park 2017 56323 1053 1.675791
67 West Englewood2016 35505 1004 1.597810
23 Humboldt Park 2016 56323 983 1.564390
71 Auburn Gresham2016 48743 979 1.558024

Try the same query for homicide. Austin is at the top of both because it has the largest population of any communityarea.

In [139]:
freqs <- setcolorder(
left_join(
frequenciesdyn("chicagocrime[homicide][year(date)>2015]","communityarea,year=year(date)"),
communities, by=c("communityarea"="communityarea")),c(1,5,2,6,3,4)) %>% arrange(desc(count))

freqs %>% head(10)

cat("\n\n")
communityareanameyearpopulationcountpercent
25 Austin 2016 98514 88 5.587302
25 Austin 2017 98514 81 5.142857
68 Englewood 2016 30654 50 3.174603
29 North Lawndale 2017 35912 42 2.666667
61 New City 2016 44377 42 2.666667
67 West Englewood 2016 35505 36 2.285714
49 Roseland 2017 44619 35 2.222222
29 North Lawndale 2016 35912 33 2.095238
43 South Shore 2017 52010 33 2.095238
69 Greater Grand Crossing2016 32602 33 2.095238

Change focus to type of crime.

In [140]:
freqs <- frequenciesdyn("chicagocrime","fbicode")

freqs %>% head(10)

cat("\n\n")
fbicodecountpercent
06 1378102 20.928907
08B 1027733 15.607937
14 754434 11.457410
26 673537 10.228846
18 659765 10.019694
05 380814 5.783332
07 308492 4.684995
08A 305381 4.637749
03 249975 3.796311
11 210540 3.197421

Include fbicodedesc and filter communityarea as well.

In [141]:
freqs <- frequenciesdyn("chicagocrime[communityarea>30]","fbicode,fbicodedesc")

freqs %>% head(10)

cat("\n\n")
fbicodefbicodedesccountpercent
06 Larceny 598451 19.037686
08B Simple Battery 518395 16.490976
14 Vandalism 372068 11.836080
26 Misc Non-Index Offense331984 10.560944
18 Drug Abuse 281166 8.944341
05 Burglary 195521 6.219837
08A Simple Assault 159541 5.075255
07 Motor Vehicle Theft 144421 4.594264
03 Robbery 126985 4.039597
11 Fraud 94943 3.020289

Get fbicodedesc from a join to the fbicodes data.table rather than as a denormalized attribute of chicagocrime

In [142]:
freqs <- setcolorder(left_join(frequenciesdyn("chicagocrime[communityarea>30]","fbicode"),fbicodes,
                    by=c("fbicode"="fbicode")) %>%
                    arrange(desc(count)),c(1,4,2,3))[] 

freqs %>% head(10)

cat("\n\n")
fbicodefbicodedesccountpercent
06 Larceny 598451 19.037686
08B Simple Battery 518395 16.490976
14 Vandalism 372068 11.836080
26 Misc Non-Index Offense331984 10.560944
18 Drug Abuse 281166 8.944341
05 Burglary 195521 6.219837
08A Simple Assault 159541 5.075255
07 Motor Vehicle Theft 144421 4.594264
03 Robbery 126985 4.039597
11 Fraud 94943 3.020289

Now consider "time series" queries -- this one homicides by year. Sort by year. Print all.

In [143]:
freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]","year=year(date)") %>%
                        arrange(year)

freqs 

cat("\n\n")
yearcountpercent
2001 667 7.509570
2002 656 7.385724
2003 601 6.766494
2004 453 5.100203
2005 451 5.077685
2006 471 5.302860
2007 447 5.032650
2008 513 5.775726
2009 460 5.179014
2010 438 4.931322
2011 436 4.908804
2012 504 5.674398
2013 421 4.739923
2014 424 4.773700
2015 493 5.550552
2016 778 8.759288
2017 669 7.532087

Next, homicides by month/year.

In [144]:
freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]",
               "year=year(date),month=month(date,label=TRUE)") %>%
                arrange(year,month)

freqs %>% head(12)

cat("\n\n")
yearmonthcountpercent
2001 Jan 42 0.4728665
2001 Feb 27 0.3039856
2001 Mar 37 0.4165728
2001 Apr 59 0.6642648
2001 May 42 0.4728665
2001 Jun 69 0.7768521
2001 Jul 78 0.8781806
2001 Aug 58 0.6530061
2001 Sep 71 0.7993695
2001 Oct 71 0.7993695
2001 Nov 64 0.7205584
2001 Dec 49 0.5516776

What're the worst months for violentcrime? Summer is ominous.

In [145]:
freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]",
               "month=month(date,label=TRUE)") 

freqs 

cat("\n\n")
monthcountpercent
Jul 56025 10.004607
Aug 53944 9.632995
Jun 52146 9.311919
Oct 50740 9.060844
Sep 50433 9.006022
May 50251 8.973521
Apr 44629 7.969578
Nov 43915 7.842076
Mar 41959 7.492786
Dec 41931 7.487786
Jan 41143 7.347069
Feb 32876 5.870798

Homicides by quarter embedded in year. R's powerful datetime functions are a godsend!

In [146]:
freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]",
               "quarter=quarter(date,with_year=TRUE)") %>%
               arrange(quarter)

freqs %>% head(10)

cat("\n\n")
quartercountpercent
2001.1 106 1.193425
2001.2 170 1.913983
2001.3 207 2.330556
2001.4 184 2.071605
2002.1 119 1.339788
2002.2 160 1.801396
2002.3 212 2.386850
2002.4 165 1.857690
2003.1 121 1.362306
2003.2 167 1.880207

An equivalent answer for quarterly homicides.

In [147]:
freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]",
               "quarter=quarter(date),year=year(date)") %>%
                arrange(year,quarter)

freqs %>% head(12)

cat("\n\n")
quarteryearcountpercent
1 2001 106 1.193425
2 2001 170 1.913983
3 2001 207 2.330556
4 2001 184 2.071605
1 2002 119 1.339788
2 2002 160 1.801396
3 2002 212 2.386850
4 2002 165 1.857690
1 2003 121 1.362306
2 2003 167 1.880207
3 2003 167 1.880207
4 2003 146 1.643774

How about violentcrime by quarter, independent of year? Almost 29% of violentcrimes occur in the summer.

In [148]:
freqs <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]",
               "quarter=quarter(date)") %>%
                arrange(quarter)
freqs 

cat("\n\n")
quartercountpercent
1 115978 20.71065
2 147026 26.25502
3 160402 28.64362
4 136586 24.39071

What hours in the day are most vulnerable to homicide? Include a cumulative sum of counts by hour. Nearly 20% of homicides occur between 11 PM and 2 AM.

In [149]:
freqs <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]","hour=hour(date)") %>%
              mutate(cumpct=cumsum(percent))

freqs %>% head(10)

cat("\n\n")
hourcountpercentcumpct
0 611 6.879081 6.879081
1 592 6.665166 13.544247
23 549 6.181040 19.725287
2 548 6.169782 25.895069
22 506 5.696915 31.591984
21 500 5.629363 37.221347
20 463 5.212790 42.434136
3 431 4.852511 47.286647
18 407 4.582301 51.868948
16 369 4.154470 56.023418

Now a bit of data profiling, looking at NA or missing data by attribute. TRUE indicates missing. latitude/longitude is much more "non-missing" than is communityarea, and may be helpful in resolving communitarea NAs.

In [150]:
frequenciesdyn("chicagocrime","is_na_name=is.na(name)") 
cat("\n\n")

freqs <- frequenciesdyn("chicagocrime[is.na(name)]","is_na_name=is.na(name),year=year(date)") 
freqs %>% head(12)
cat("\n\n")

frequenciesdyn("chicagocrime","is_na_longitude=is.na(longitude)") 
cat("\n\n")

frequenciesdyn("chicagocrime","is_na_name=is.na(name),is_na_longitude=is.na(longitude)") 

cat("\n\n")
is_na_namecountpercent
FALSE 5968563 90.643147
TRUE 616119 9.356853

is_na_nameyearcountpercent
TRUE 2001 481635 78.172398514
TRUE 2002 133166 21.613681773
TRUE 2008 254 0.041225802
TRUE 2009 224 0.036356613
TRUE 2010 189 0.030675892
TRUE 2011 186 0.030188973
TRUE 2007 164 0.026618234
TRUE 2004 82 0.013309117
TRUE 2006 61 0.009900685
TRUE 2005 57 0.009251460
TRUE 2003 56 0.009089153
TRUE 2012 32 0.005193802

is_na_longitudecountpercent
FALSE 6451942 97.984109
TRUE 132740 2.015891

is_na_nameis_na_longitudecountpercent
FALSE FALSE 5845188 88.769480
TRUE FALSE 606754 9.214629
FALSE TRUE 123375 1.873667
TRUE TRUE 9365 0.142224

There's no real limit to the number of attributes that can be evaluated in a frequenciesdyn statement. Consider types of violentcrime by communityarea and month/year. There are over 55,000 records returned from this counts query!

In [151]:
freqs <- frequenciesdyn("chicagocrime[violentcrime]",
                "name,fbicodedesc,year=year(date),month=month(date)")
freqs %>% nrow()
freqs %>% head(10)

cat("\n\n")
55963
namefbicodedescyearmonthcountpercent
NA Robbery 2001 10 1852 0.3268262
NA Aggravated Battery2001 7 1766 0.3116496
NA Robbery 2001 9 1748 0.3084731
NA Robbery 2001 8 1699 0.2998260
NA Robbery 2001 7 1656 0.2922377
NA Aggravated Battery2001 6 1642 0.2897671
NA Robbery 2001 12 1631 0.2878259
NA Robbery 2001 11 1594 0.2812964
NA Aggravated Battery2001 8 1585 0.2797082
NA Aggravated Battery2001 5 1561 0.2754729

Finally, look at a slightly more complex analysis that combines the results of three frequencies queries. I use this to monitor/predict both homicides and violentcrimes for the current year from year-to-date and previous years' figures.

In [152]:
ehomicide <- cbind(
     frequenciesdyn("chicagocrime[homicide][yday(date)<numdays]","year=year(date)")  %>%
                       arrange(desc(year)) %>% dplyr::select(-percent) %>% mutate(numdays=numdays,lastday=dmonth),
     frequenciesdyn("chicagocrime[homicide][month(date)<=month(dt)]","year=year(date)") %>%
                        arrange(desc(year)) %>% dplyr::select(-percent,-year),
     frequenciesdyn("chicagocrime[homicide]","year=year(date)")  %>%
                       arrange(desc(year)) %>% dplyr::select(-percent,-year)
     ) %>%
     setnames(c("year","sofar","numdays","lastday","monthend","yearend"))    %>%    
     setcolorder(c("year","numdays","lastday","sofar","monthend","yearend"))

ehomicide[1]$yearend <- NA
ehomicide[1]$monthend <- NA

ehomicide

cat("\n\n")
yearnumdayslastdaysofarmonthendyearend
2018 109 Apr-19127 NA NA
2017 109 Apr-19175 193 669
2016 109 Apr-18169 185 778
2015 109 Apr-19102 119 493
2014 109 Apr-19 78 101 424
2013 109 Apr-19 84 95 421
2012 109 Apr-18141 158 504
2011 109 Apr-19 94 105 436
2010 109 Apr-19106 121 438
2009 109 Apr-19 88 110 460
2008 109 Apr-18107 134 513
2007 109 Apr-19109 125 447
2006 109 Apr-19 92 115 471
2005 109 Apr-19106 117 451
2004 109 Apr-18125 138 453
2003 109 Apr-19148 178 601
2002 109 Apr-19147 162 656
2001 109 Apr-19142 165 667

And of course it's easy to save the results of these frequencies queries for subsequent time series and other statistical analyses.

In [153]:
murders <- frequenciesdyn("chicagocrime[homicide][year(date)<2018]",
                          "year=year(date),month=month(date)")
violentc <- frequenciesdyn("chicagocrime[violentcrime][year(date)<2018]",
                           "year=year(date),month=month(date)")

cat("\n\n")

In [154]:
murders %>% head(10)
cat("\n\n")

violentc %>% head(10)

cat("\n\n")
yearmonthcountpercent
2016 8 95 1.0695789
2017 6 86 0.9682504
2016 10 82 0.9232155
2016 11 79 0.8894393
2016 6 78 0.8781806
2001 7 78 0.8781806
2002 8 77 0.8669219
2017 7 76 0.8556631
2001 9 71 0.7993695
2001 10 71 0.7993695

yearmonthcountpercent
2002 7 4531 0.8091187
2001 7 4509 0.8051901
2002 8 4419 0.7891184
2001 8 4306 0.7689396
2002 6 4236 0.7564394
2001 10 4232 0.7557251
2002 9 4209 0.7516179
2001 9 4184 0.7471535
2001 6 4066 0.7260818
2003 7 3995 0.7134030

Finally, exercise the allfreqs function, which simply invokes frequenciesdyn for each attribute in the data.table.

In [155]:
ptmpre <- proc.time()

freqs <- allfreqs("chicagocrime")
names(freqs)

freqs[["fbicodedesc"]] %>% head(10)
cat("\n\n")

freqs[["longitude"]] %>% head(10)

ptmpost <- proc.time()
print(ptmpost-ptmpre)

cat("\n\n")
  1. 'date'
  2. 'fbicode'
  3. 'fbicodedesc'
  4. 'communityarea'
  5. 'name'
  6. 'longitude'
  7. 'latitude'
fbicodedesccountpercent
Larceny 1378102 20.928907
Simple Battery 1027733 15.607937
Vandalism 754434 11.457410
Misc Non-Index Offense 673537 10.228846
Drug Abuse 659765 10.019694
Burglary 380814 5.783332
Motor Vehicle Theft 308492 4.684995
Simple Assault 305381 4.637749
Robbery 249975 3.796311
Fraud 210540 3.197421

longitudecountpercent
NA 132740 2.01589082
-87.90523 12843 0.19504359
-87.74153 9226 0.14011307
-87.62788 6535 0.09924549
-87.62410 4033 0.06124821
-87.62820 3024 0.04592477
-87.74273 2803 0.04256849
-87.72642 2599 0.03947040
-87.64700 2545 0.03865031
-87.62449 2343 0.03558258
   user  system elapsed 
   2.42    0.34    2.77 


That's it for now. Hopefully, the queries provide a sense of what's possible with these functions in tandem with data.table and tidyverse capabilities. Next month, I'll take a look at simple graphs to showcase the frequencies results.