Tessera Stack for Big Data Computing

The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

Exchangeable Graph Models, Graph Limits, and Graphons

Recently, nonparametric approaches for modeling relational data based on exchangeable graph models have been gaining increasing interest. Relational data are typically encoded in the form of arrays, and the development of exchangeable graph models relies on a generalization of de Finetti’s theorem to exchangeable arrays due to Aldous and Hoover. The key object underlying such models is referred to as a graphon, a notion introduced by Lovász and Szegedy as limits of graph sequences. This talk will attempt to survey some recent literature on the theory of exchangeable random graphs and the estimation of graphons, drawing connections to applications in network analysis such as link prediction, community detection, and network comparison. The goal of this talk is to initiate discussions and collaborations on this relatively new topic in the Purdue statistics community.

Genomic Control for Association Studies: A Literature Review

A major spin-off of the Human Genome Project (HGP) is the effort to understand diseases and complex traits in humans using genetic information. Perhaps the most common approach is to utilize case-control studies. Unfortunately, case-control studies may rely upon unrealistic assumptions and may present spurious associations. On a statistical level, the test statistics may be inflated. Genomic Control introduced in 1999 by Devlin and Roeder is one time-tested method of corrective action. I will review this paper in detail and discuss other statistical approaches developed in the last 15 years to advance association studies.

Hadoop2 Basic and Rhipe Package

In this talk, I will first introduce the basic knowledge of Hadoop, and also the difference between Hadoop1 and Hadoop2 (YARN), and then mainly focus on Hadoop2 with all the terminology and detailed work-flow of a MapReduce Job. The second part of the talk is about the Rhipe package in R, i.e. the R and Hadoop Integrated Programming Environment. I will start with a word count example, and then focus more on the unique property of Rhipe and what kind of job can be achieved by using it.

Super Data Analysis with R Shiny

Shiny by RStudio levels up your data analysis and consulting work by providing a simple package for transforming your analysis into interactive web-based apps. With little or no experience in web design or JavaScript, any R user can develop and deploy beautiful interactive apps on the web, for use in R studio, or in their R markdown reports. This talk will highlight the key components of a Shiny app including the layout of the user interface, how the user interface interacts with the R server, and how reactive variables work. This talk will conclude with two examples of how I have used Shiny: