Shiny Web Application – Workshop

“The impact of data scientists’ work depends on how well others can understand their insights to take further actions” (blog)

In this workshop, I will introduce you to the concept of Declarative Reactive Web Frameworks, allowing for interactive user-friendly data visualization and data analytics, particularly Shiny. Shiny is an R package that creates interactive applications for data visualization.

You will learn some Shiny basics: how to build your reactive app and deploy it to the server.

Workshop materials:

  1. R installation Instructions – slides
  2. CSV file –  download
  3. Workshop Slides – slides
  4. Shiny Workshop Files: zip file
  5. Video of the workshop: youtube

Credits: Some ideas are based on the great tutorial by Dean Attali.


Seminar on ITMS and LVS: Quantitative Methods and Text Mining

The main objective of this workshop is to introduce researchers to user-friendly analytical tools. ITMS and LVS are two web-based tools for visualization and quantitative analysis.  In contrast to existing software programs (e.g., SAS, SPSS, and Tableau), these two applications are built in R and require no installation or programming skills.

This hands-on workshop will provide an overview of available statistical and text-mining techniques in these tools. You will learn how to import csv, text and pdf files, create plots, and run statistical analysis, including conditional trees and random forest tests. You will also learn about natural language pre-processing techniques, such as stopwords removal and stemming. Finally, you will be able to perform topic modeling and cluster analysis.

Part 1: Quantitative Methods – Language Variation Suite LVS slides

Part 2: Text Mining Methods – Interactive Text Mining Suite ITMS slides

Workshop exercice materials:

  • sample of categorical data csv file – link
  • sample of continuous data csv file – link

Interactive Text Mining Suite – Version 2 Release

ITMS – Interactive Text Mining Suite ITMS is a web application for text analysis. This application offers the computational and statistical power of R and the Shiny web application interactivity.

The new release includes the following features:

  • Import Zotero rdf files, Google Book API, json and xml
  • Dynamic preprocessing  steps
  • Stemming in multiple languages
  • Tuning parameters for cluster classification
  • Word cloud comparison
  • Word cloud customization
  • Metadata extraction


Contributors: Jefferson Davis, Irina Trapido and Jay Lee

As always, please do not hesitate to contact if you have any issues or to request new features!

Word and Sentence Analysis

We just added a new feature to Interactive Text Mining Tool: Word and Sentence Length.

In Data Visualization Panel, select Word Frequency Tab and click on Length. You can select a specific document and explore its content.

sentence          word

If you are interested in Punctuation Visualization (for more information, read Adam Calhoun’s Blog), select  Punctuation Analysis Tab and click Punctuation.punct

Language Variation Suite and Interactive Text Mining Tool – QR Code

Our Language Variation Suite and Interactive Text-Mining Tool are now accessible via SmartPhone and iPad.

Use QR Scanner App to scan the following QR codes, open them in your browser on Smart Phone or iPad. Make sure you have your files (csv or text files) in your Dropbox. Navigate to the Descriptive Statistics in LVS or Data Preparation in ITMS, select choose files and upload them.

code                                   code (1)

We always welcome any suggestions, feedback and bug reports!

Workshop: LVS and ITMS Introduction

    Visual Analytics – “The science of analytical reasoning facilitated by visual interactive interfaces” (Thomas et al. 2005)

Materials for workshop:

  1. Categorical Data (csv): Labov’s New York study 1966 (for more information, visit
  2. Continuous Data (csv): Corpus of Caracas (Bentivoglio & Sedano 1993)
  3.  Dante Translation (txt): dante 1, dante 2, dante 3
  4. Slides for workshop (pdf): presentation

“Mastery of quantitative methods is increasingly becoming a vital component of linguistic training” (Johnson, 2008)

Links to LVS and ITMS



Interactive Topic Modeling – ITMS

Topic modeling refers to an algorithm that explains “an observed corpus with a small set of distributions over terms” and “models for uncovering underlying semantic structure of a document collection”  (Blei et al. 2003, Blei et al. 2009, Blei 2012). Several algorithms have been put forth to build a probabilistic topic model, e.g  mixture-of-unigram (Nigam et al. 2000), Latent Semantic Indexing (Deerwester et al. 1990; Hofmann 1999) and Latent Dirichlet Allocation LDA (Blei et al. 2003). For more information, see Matthew Jockers and David Blei.

Interactive Text Mining Suite applies various LDA algorithms (topicmodels, lda and stm R packages). In addition, it allows users interactively choose number of topics, iterations and select the best models.

Screen shot 2016-03-18 at 1.46.42 PMScreen shot 2016-03-18 at 1.48.48 PM

We  welcome suggestions and feedback.

Interactive Text Mining Suite ITMS

ITMS integrates visual and statistical R with an interactive Shiny application to examine unstructured data (aka text documents). At present, ITMS provides several text-mining analyses for scholarly articles and literary texts (e.g. topic, frequency and cluster analyses).
Screen shot 2016-03-18 at 12.35.39 AM

ITMS is an ongoing project by interdisciplinary team of researchers from Indiana University (Olga Scrivner and Jefferson Davis). We are also developing an NEH proposal to advance this research.

Screen shot 2016-03-18 at 12.34.33 AM

Your feedback and suggestions as well as bug reports will be very appreciated (obscrivn AT indiana PERIOD edu).

Data Explorarion – Conditional Trees

Classification tree, or conditional tree, is a simple non-parametric regression analysis, commonly used in social and psychological studies. While in linear regression all the information is combined linearly, in conditional trees this information is shown as recursive splitting (tree branches), thus providing more visual interpretation (Strobl et al. 2009). If the independent variable (predictor) is  significantly associated with our dependent variable (response), a split tree branch is created that splits values of independent variables (For more information, see R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R; S. Tagliamonte S. & R.H. Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice; A. Lohmann. 2013. Classification trees and Random forests in linguistic study).

Language Variation Suite  allows you to build a conditional tree from your data. After the csv data is uploaded, select Logistic Regression Analysis from  the top panel “Inferential Statistics”.inferencial

Next, you will need to select your variables from Modeling tab. First, make a selection for your dependent variable (response). Second, select your independent factors (predictors).

selectdependent                   selectindependent

Finally, go to Conditional Trees tab – it may take a couple of seconds to generate your plot, depending on how many factors you have selected. The following screenshot is the conditional tree from my sociolinguistic data.


According to this plot, the distribution of intervocalic /d/  in Venezuelan Spanish is split by age into two groups (20-54) and (55+). In the first age group, the production of /d/ is further conditioned by the chronological period: more aspirated /s/ occurs in the corpus of 1987, especially with low economic class (here box plot is used with continuous data), but we also see a lot of outliers in the box plot. The second age group (+55) is also conditioned by period. In the 1987 corpus, the low-economic status shows the highest use of aspirated /s/. In this study, we calculate the intensity of /d/ production – higher or lower intensity of aspiration.

This analysis is used for exploratory purposes. For more robust predictions, random forest and mixed regression models should be used (more on them in the next post).