R, Python, Julia -- and Polyglot

A poll released recently showed Python increasing its lead over R as the language of choice for analytics professionals. Setting aside questions of the representativeness to the analytics practitioner population of a sample produced from online polling, the findings have nonetheless sparked spirited discussion on the future of software for the trade.

My unscientific sample of opinion shows Python slightly ahead of R, with users of each quite passionate about their favorite. And my take is that with the mature ecosystems of both, Python and R will continue to develop, grow, and compete for the foreseeable future.

What I find particularly heartening are the significant developments surrounding interoperability of the two platforms -- the ability to invoke R within Python programs as well as, conversely, Python within R. Indeed, I've written on both Python within R and R within Python for Data Science Central in recent months. Kudos to Python commercial vendor Anaconda and R commercial vendor RStudio for actively promoting these "polyglot" features.

Now complicate this analytics software divide even further by introducing Julia, a language designed from the ground up for performant analytics. With MIT bona-fides, Julia has significantly progressed since its release in 2009. '“Julia has been revolutionizing scientific and technical computing since 2009,” says Edelman, the year the creators started working on a new language that combined the best features of Ruby, MatLab, C, Python, R, and others.' I'm now on my third go-round with Julia and am finally beginning to feel it's legit. The essential DataFrames package is the real deal.

A new competitor such as Julia is considerably behind from the get-go, remaining so until it can both attain a noticeable programmer presence and establish an open source ecosystem. Julia is approaching that point now, helped in no small part by star recognition and a polyglot commitment that allows it to co-exist in Python/R worlds. I just love the prospect of using R's uber-productive ggplot in Python and Julia. And I must admit I'm quite impressed by R-to-Julia package XRJulia developed by venerable S architect/developer John Chambers, and the Julia-to-R library, Rif from R luminary Laurent Gautier -- even though getting them to work is not for the faint of heart.

This Julia kernel Jupyter Notebook purports to demonstrate interoperability from Julia to R and Julia to Python, showcasing the RCall and Pandas packages. I first read a personal, daily-updated dataset of daily stock index levels into a Julia DataFrame. I then summarize the data for a subset of portfolios, "feeding" the resultant DataFrame to a series of R ggplot scripts. I finally invoke Python Pandas within Julia to read the data into a Python DataFrame that is summarized and transformed to Julia for similar R ggplot visualizations. A subsequent blog will examine R to Julia and Python to Julia functionality.

The software used here is Julia 1.0.0, Python 3.6.5, Microsoft Open R 3.4.4, and JupyterLab 0.32.1.

Load Julia libraries.

In [1]:
using Pkg
using DataFrames
using PyCall
using PyPlot
using RCall
using RDatasets
using CSV
using DataFramesMeta
using Base
using Dates
using Query
using Pandas

println("\n")

Assign relevant directory/file names. Migrate to the working directory.

In [2]:
wdir = "c:/data/russell/2017"
fname = "russellmelt.csv"

cd(wdir)
println(pwd(),"\n")
c:\data\russell\2017

Define helpful Julia DataFrame frequencies and meta-data functions.

In [3]:
function frequencies(df,vars)

    freqs = by(df, vars, nrow);
    freqs = DataFrames.rename(freqs, :x1 => :count);
    freqs[:percent] = 100*freqs[:count]/sum(freqs[:count]);
    sort!(freqs, [DataFrames.order(:count, rev = true)]);
    
    return(freqs)
    
end

println("\n")

In [4]:
function metaj(df)
    
    println(DataFrames.typeof(df),"\n")
    println(DataFrames.size(df),"\n")
    println(DataFrames.describe(df,stats=[:eltype, :nmissing, :first, :last]),"\n")
    println(DataFrames.head(df),"\n")
    println(DataFrames.tail(df),"\n")
    
end

println("\n")

Load R libraries.

In [5]:
reval("options(warn=-1)")

@rlibrary ggplot2
@rlibrary ggthemes
@rlibrary RColorBrewer

println("\n")

Read the russell stock index file into a Julia DataFrame.

In [6]:
russellj = CSV.read(fname,allowmissing=:all)

metaj(russellj)

println("\n")
DataFrames.DataFrame

(728852, 4)

4×5 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ variable │ eltype                    │ nmissing │ first      │
├─────┼──────────┼───────────────────────────┼──────────┼────────────┤
│ 1   │ name     │ CategoricalString{UInt32} │ 0        │ Top200V    │
│ 2   │ pdate    │ Date                      │ 0        │ 2005-01-03 │
│ 3   │ type     │ CategoricalString{UInt32} │ 0        │ idxwodiv   │
│ 4   │ value    │ Float64                   │ 84       │ 690.669    │

6×4 DataFrames.DataFrame
│ Row │ name    │ pdate      │ type     │ value   │
├─────┼─────────┼────────────┼──────────┼─────────┤
│ 1   │ Top200V │ 2005-01-03 │ idxwodiv │ 690.669 │
│ 2   │ Top200V │ 2005-01-04 │ idxwodiv │ 683.625 │
│ 3   │ Top200V │ 2005-01-05 │ idxwodiv │ 682.217 │
│ 4   │ Top200V │ 2005-01-06 │ idxwodiv │ 685.307 │
│ 5   │ Top200V │ 2005-01-07 │ idxwodiv │ 682.723 │
│ 6   │ Top200V │ 2005-01-10 │ idxwodiv │ 684.465 │

6×4 DataFrames.DataFrame
│ Row │ name │ pdate      │ type       │ value       │
├─────┼──────┼────────────┼────────────┼─────────────┤
│ 1   │ 1000 │ 2018-09-14 │ pctchwodiv │ 0.000588594 │
│ 2   │ 1000 │ 2018-09-17 │ pctchwodiv │ -0.00626227 │
│ 3   │ 1000 │ 2018-09-18 │ pctchwodiv │ 0.00531552  │
│ 4   │ 1000 │ 2018-09-19 │ pctchwodiv │ 0.000643434 │
│ 5   │ 1000 │ 2018-09-20 │ pctchwodiv │ 0.00769342  │
│ 6   │ 1000 │ 2018-09-21 │ pctchwodiv │ -0.00058588 │



Try out the frequencies function.

In [7]:
print(frequencies(russellj,[:name,:type]))

println("\n")
168×4 DataFrames.DataFrame
│ Row │ name           │ type       │ count │ percent  │
├─────┼────────────────┼────────────┼───────┼──────────┤
│ 1   │ GlobalLargeCap │ idxwodiv   │ 5798  │ 0.795498 │
│ 2   │ GlobalLargeCap │ idxwdiv    │ 5798  │ 0.795498 │
│ 3   │ GlobalLargeCap │ pctchwdiv  │ 5798  │ 0.795498 │
│ 4   │ GlobalLargeCap │ pctchwodiv │ 5798  │ 0.795498 │
│ 5   │ Global         │ idxwodiv   │ 5798  │ 0.795498 │
│ 6   │ Global         │ idxwdiv    │ 5798  │ 0.795498 │
│ 7   │ Global         │ pctchwdiv  │ 5798  │ 0.795498 │
│ 8   │ Global         │ pctchwodiv │ 5798  │ 0.795498 │
⋮
│ 160 │ 1000           │ pctchwodiv │ 3455  │ 0.474033 │
│ 161 │ 3000E          │ idxwodiv   │ 3334  │ 0.457432 │
│ 162 │ 3000E          │ idxwdiv    │ 3334  │ 0.457432 │
│ 163 │ 3000E          │ pctchwdiv  │ 3334  │ 0.457432 │
│ 164 │ 3000E          │ pctchwodiv │ 3334  │ 0.457432 │
│ 165 │ Microcap       │ idxwodiv   │ 1650  │ 0.226383 │
│ 166 │ Microcap       │ idxwdiv    │ 1650  │ 0.226383 │
│ 167 │ Microcap       │ pctchwdiv  │ 1650  │ 0.226383 │
│ 168 │ Microcap       │ pctchwodiv │ 1650  │ 0.226383 │

Define a function to compute summary groupings of the Julia Russell DataFrame by portfolio name. The code is a bit clunky right now.

In [8]:
function mkgroup(df,selvars)
    
    seltype = "idxwdiv"

    rf = @from i in df begin
        @where (i.name in selvars) && i.type == seltype
        @orderby descending(i.name), i.pdate
        @select {i.name, date=i.pdate, idxwdiv=i.value, grdollar=i.value, pctch=i.value}     
        @collect DataFrames.DataFrame
    end  
        
    rfg = DataFrames.groupby(rf,[:name]);

    for i in range(1,length=length(rfg))
        rfg[i][:pctch] = [-999.99;[rfg[i][j,:idxwdiv]/rfg[i][j-1,:idxwdiv]-1 for j in range(2,stop=nrow(rfg[i]))]]
        rfg[i][:grdollar] = [rfg[i][j,:idxwdiv]/rfg[i][1,:idxwdiv] for j in range(1,length=nrow(rfg[i]))]
    end

    rfg = DataFrames.DataFrame(rfg)
    allowmissing!(rfg, :pctch)
    rfg[rfg.pctch.==-999.99,:pctch] = missing
        
    return(rfg)

end
    
println("\n")

Invoke the function to calculate daily percent change and growth of $1 variables for the Russell 3000, Russell 3000 Growth, and Russell 3000 Value portfolios, placing results in a Julia DataFrame

In [9]:
selvars = ["3000", "3000G", "3000V"]

russjg = mkgroup(russellj,selvars);

metaj(russjg)

println("\n")
DataFrames.DataFrame

(10365, 5)

5×5 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ variable │ eltype                    │ nmissing │ first      │
├─────┼──────────┼───────────────────────────┼──────────┼────────────┤
│ 1   │ name     │ CategoricalString{UInt32} │ 0        │ 3000       │
│ 2   │ date     │ Date                      │ 0        │ 2005-01-03 │
│ 3   │ idxwdiv  │ Float64                   │ 0        │ 2742.75    │
│ 4   │ grdollar │ Float64                   │ 0        │ 1.0        │
│ 5   │ pctch    │ Float64                   │ 3        │ missing    │

6×5 DataFrames.DataFrame
│ Row │ name │ date       │ idxwdiv │ grdollar │ pctch       │
├─────┼──────┼────────────┼─────────┼──────────┼─────────────┤
│ 1   │ 3000 │ 2005-01-03 │ 2742.75 │ 1.0      │ missing     │
│ 2   │ 3000 │ 2005-01-04 │ 2707.4  │ 0.987114 │ -0.0128865  │
│ 3   │ 3000 │ 2005-01-05 │ 2692.4  │ 0.981643 │ -0.00554144 │
│ 4   │ 3000 │ 2005-01-06 │ 2702.31 │ 0.985256 │ 0.00367974  │
│ 5   │ 3000 │ 2005-01-07 │ 2696.66 │ 0.983199 │ -0.00208786 │
│ 6   │ 3000 │ 2005-01-10 │ 2708.11 │ 0.987371 │ 0.00424329  │

6×5 DataFrames.DataFrame
│ Row │ name  │ date       │ idxwdiv │ grdollar │ pctch       │
├─────┼───────┼────────────┼─────────┼──────────┼─────────────┤
│ 1   │ 3000V │ 2018-09-14 │ 8472.63 │ 2.73575  │ 0.00191472  │
│ 2   │ 3000V │ 2018-09-17 │ 8460.78 │ 2.73193  │ -0.00139842 │
│ 3   │ 3000V │ 2018-09-18 │ 8492.75 │ 2.74225  │ 0.00377867  │
│ 4   │ 3000V │ 2018-09-19 │ 8512.87 │ 2.74875  │ 0.00236882  │
│ 5   │ 3000V │ 2018-09-20 │ 8577.18 │ 2.76951  │ 0.00755464  │
│ 6   │ 3000V │ 2018-09-21 │ 8587.24 │ 2.77276  │ 0.00117248  │



Run an R ggplot histogram of the pctch attribute from the grouped Julia DataFrame. Note the "concessions" to Julia syntax.

In [10]:
gp = ggplot(russjg, aes(x=:pctch,col=:name)) + 
geom_histogram(binwidth=.005) + 
facet_wrap(R"~name") +
theme(var"axis.text.x" = element_text(angle = 45, hjust = 1)) +
theme(var"legend.position" = "none", var"plot.background" = element_rect(fill = "#DEEBF7"), 
var"panel.background" = element_rect(fill = "#DEEBF7")) +
labs(title="Russell Indexes",subtitle="2005 to Present\n", y="Frequency", x="\nDay-to-Day Change")   

 
println(gp)
RObject{VecSxp}

Same for grdollar line charts.

In [11]:
gp = ggplot(russjg, aes(x=:date,y=:grdollar,col=:name)) + 
geom_line() +
facet_wrap(R"~name") +
theme(var"axis.text.x" = element_text(angle = 45, hjust = 1)) +
theme(var"legend.position" = "none", var"plot.background" = element_rect(fill = "#DEEBF7"), 
var"panel.background" = element_rect(fill = "#DEEBF7")) +
scale_color_manual(values=["#9ECAE1","#2171B5","#08306B"]) +
labs(title="Russell Indexes",subtitle="2005 to Present\n", y="Growth of \$1", x="\nDate")   

println(gp)
RObject{VecSxp}

Now use the Julia Python Pandas package to read the same initial data into a Pandas DataFrame. Subset the data before moving from Pandas to a Julia DataFrame. This filtering is a critical performance consideration.

In [12]:
russellp = Pandas.read_csv(fname);
println(typeof(russellp),"\n")
println(size(russellp),"\n\n\n")

russpf = DataFrames.DataFrame(
    reset_index(loc(russellp)[(russellp.name == "3000") | 
    (russellp.name == "3000V") | (russellp.name == "3000G"),:]));

metaj(russpf)
Pandas.DataFrame

(728852, 4)



DataFrames.DataFrame

(41460, 5)

5×5 DataFrames.DataFrame
│ Row │ variable │ eltype  │ nmissing │ first      │ last        │
├─────┼──────────┼─────────┼──────────┼────────────┼─────────────┤
│ 1   │ index    │ Int64   │          │ 363268     │ 440471      │
│ 2   │ name     │ String  │          │ 3000V      │ 3000        │
│ 3   │ pdate    │ String  │          │ 2005-01-03 │ 2018-09-21  │
│ 4   │ type     │ String  │          │ idxwodiv   │ pctchwodiv  │
│ 5   │ value    │ Float64 │          │ 2476.17    │ -0.00088594 │

6×5 DataFrames.DataFrame
│ Row │ index  │ name  │ pdate      │ type     │ value   │
├─────┼────────┼───────┼────────────┼──────────┼─────────┤
│ 1   │ 363268 │ 3000V │ 2005-01-03 │ idxwodiv │ 2476.17 │
│ 2   │ 363269 │ 3000V │ 2005-01-04 │ idxwodiv │ 2447.58 │
│ 3   │ 363270 │ 3000V │ 2005-01-05 │ idxwodiv │ 2432.51 │
│ 4   │ 363271 │ 3000V │ 2005-01-06 │ idxwodiv │ 2443.69 │
│ 5   │ 363272 │ 3000V │ 2005-01-07 │ idxwodiv │ 2434.34 │
│ 6   │ 363273 │ 3000V │ 2005-01-10 │ idxwodiv │ 2442.0  │

6×5 DataFrames.DataFrame
│ Row │ index  │ name │ pdate      │ type       │ value       │
├─────┼────────┼──────┼────────────┼────────────┼─────────────┤
│ 1   │ 440466 │ 3000 │ 2018-09-14 │ pctchwodiv │ 0.0008706   │
│ 2   │ 440467 │ 3000 │ 2018-09-17 │ pctchwodiv │ -0.00658808 │
│ 3   │ 440468 │ 3000 │ 2018-09-18 │ pctchwodiv │ 0.00524307  │
│ 4   │ 440469 │ 3000 │ 2018-09-19 │ pctchwodiv │ 0.000239965 │
│ 5   │ 440470 │ 3000 │ 2018-09-20 │ pctchwodiv │ 0.00787632  │
│ 6   │ 440471 │ 3000 │ 2018-09-21 │ pctchwodiv │ -0.00088594 │

Invoke the grouping function as above.

In [13]:
selvars = ["3000", "3000G", "3000V"]

russpfg = mkgroup(russpf,selvars);
russpfg[:date] = [Dates.Date(d) for d in russpfg[:date]]

metaj(russpfg)

println("\n")
DataFrames.DataFrame

(10365, 5)

5×5 DataFrames.DataFrame
│ Row │ variable │ eltype  │ nmissing │ first      │ last         │
├─────┼──────────┼─────────┼──────────┼────────────┼──────────────┤
│ 1   │ name     │ String  │          │ 3000V      │ 3000         │
│ 2   │ date     │ Date    │          │ 2005-01-03 │ 2018-09-21   │
│ 3   │ idxwdiv  │ Float64 │          │ 3097.0     │ 9072.02      │
│ 4   │ grdollar │ Float64 │          │ 1.0        │ 3.30764      │
│ 5   │ pctch    │ Float64 │ 3        │ missing    │ -0.000867343 │

6×5 DataFrames.DataFrame
│ Row │ name  │ date       │ idxwdiv │ grdollar │ pctch       │
├─────┼───────┼────────────┼─────────┼──────────┼─────────────┤
│ 1   │ 3000V │ 2005-01-03 │ 3097.0  │ 1.0      │ missing     │
│ 2   │ 3000V │ 2005-01-04 │ 3061.79 │ 0.988633 │ -0.0113673  │
│ 3   │ 3000V │ 2005-01-05 │ 3043.2  │ 0.982629 │ -0.00607275 │
│ 4   │ 3000V │ 2005-01-06 │ 3058.34 │ 0.987519 │ 0.00497645  │
│ 5   │ 3000V │ 2005-01-07 │ 3046.67 │ 0.983749 │ -0.00381786 │
│ 6   │ 3000V │ 2005-01-10 │ 3056.33 │ 0.986868 │ 0.00317097  │

6×5 DataFrames.DataFrame
│ Row │ name │ date       │ idxwdiv │ grdollar │ pctch        │
├─────┼──────┼────────────┼─────────┼──────────┼──────────────┤
│ 1   │ 3000 │ 2018-09-14 │ 9018.27 │ 3.28804  │ 0.00100482   │
│ 2   │ 3000 │ 2018-09-17 │ 8959.01 │ 3.26644  │ -0.00657073  │
│ 3   │ 3000 │ 2018-09-18 │ 9006.34 │ 3.28369  │ 0.00528214   │
│ 4   │ 3000 │ 2018-09-19 │ 9008.58 │ 3.28451  │ 0.000248677  │
│ 5   │ 3000 │ 2018-09-20 │ 9079.89 │ 3.31051  │ 0.0079164    │
│ 6   │ 3000 │ 2018-09-21 │ 9072.02 │ 3.30764  │ -0.000867343 │



Finally, invoke R ggplot2 against the grouped dataframe as before.

In [14]:
gp = ggplot(russpfg, aes(x=:pctch,col=:name)) + 
geom_histogram(binwidth=.005) + 
facet_wrap(R"~name") +
theme(var"axis.text.x" = element_text(angle = 45, hjust = 1)) +
theme(var"legend.position" = "none", var"plot.background" = element_rect(fill = "#DEEBF7"), 
var"panel.background" = element_rect(fill = "#DEEBF7")) +
labs(title="Russell Indexes",subtitle="2005 to Present\n", y="Frequency", x="\nDay-to-Day Change")   

println(gp,"\n")
RObject{VecSxp}


In [15]:
gp = ggplot(russpfg, aes(x=:date,y=:grdollar,col=:name)) + 
geom_line() +
facet_wrap(R"~name") +
theme(var"axis.text.x" = element_text(angle = 45, hjust = 1)) +
theme(var"legend.position" = "none", var"plot.background" = element_rect(fill = "#DEEBF7"), 
var"panel.background" = element_rect(fill = "#DEEBF7")) +
scale_color_manual(values=["#9ECAE1","#2171B5","#08306B"]) +
labs(title="Russell Indexes",subtitle="2005 to Present\n", y="Growth of \$1", x="\nDate")   

println(gp,"\n")
RObject{VecSxp}


Voila, demonstrations of polyglot Julia with both R and Python. Look for this capability to become commonplace in analytics programming.