Language Variation Suite – v.2 release

Language Variation Suite has been released with more customizable features for language variation analysis.

New features include:

  1. Plot Customization – titles, labels, colors

  2. Redesigned user-friendly interface

  3. Random Slope 

  4. Tuning parameters for cluster analysis

  5. Customized url link:
  6. R code snippets


Do not hesitate to contact if you have any issues or if you like to request new features.

LVS team


NWAV45 Workshop – Language Variation Suite

Our workshop Optimizing Language Variation Analysis: Language Variation Suite is held on 11/03/16 at Simon Fraser University, Vancouver, Canada!

“Mastery of quantitative methods is increasingly becoming a vital component of linguistic training” (Johnson 2008:1)

“The science of analytical reasoning facilitated by visual interactive interfaces” (Thomas et al. 2005)

Workshop files:

  1. Categorical data csv – Use of R in New York (Labov 1966)
  2. Continuous data csv – Intervocalic /d/ (Díaz-Campos et al. 2016)
  3. Language Variation Suite
  4. Handouts
  5. Slides

Please come and learn how to perform advanced statistical methods with a user-friendly interactive toolkit for (socio)linguistic analysis. Do not hesitate to contact us if you have any questions and suggestions (obscrivn AT indiana DOT edu)

Olga, Manuel and Rafael

Stepwise Regressions in Language Variation Suite – LVS

LVS provides three types of model comparison (LRT, AIC, and BIC) using the package MASS. The stepwise regression uses both directions (step up and step down) and selects the best model (best predictors).

All three criteria assess model fit.  LRT is based on log likelihood ratio (k = qchisq(1-p, df=1), where for p=0.05, k = 3.84). For more information on AIC ( Akaike Information Criterion ) and BIC (Bayesian information criterion) – see

Steps to perform stepwise regression in LVS:

  1. Upload csv or excel file – Panel DATA Screen shot 2016-08-06 at 5.04.42 PM
  2. Go to Panel INFERENTIAL STATISTICS – tab MODELINGScreen shot 2016-08-06 at 5.06.33 PM
  3. Select your regression model (dependent and independent factors), type of regression (see tab REGRESSION) and click RUN regression.
  4. Go to STEPWISE REGRESSION tab and click RUN stepwise model.Screen shot 2016-08-06 at 5.11.55 PM
  5. Return to Modeling and Regression and update your selection with the best fitted model.

As always, your feedback and suggestions are greatly appreciated! (LVS Team)

Varbrul Weights in LVS

Language Variation Suite has added a Varbrul analysis. Varbrul is “an implementation of logistic regression that is used by many sociolinguists” (Keith Johnson, 2008, 174). At present LVS calculates Varbrul weights for a binary dependent variable and categorical independent variables. Varbrul output format is based on chapter 5.7 (K.Johnson,2008):  inverse logit (inv.logit) is used from the package gtools. Contrasts option in logistic regression is set to contr.sum. For a binary variable, the calculation is inv.logit(coeficient*1) and inv.logit(coeficient*-1), which outputs weights for two values (e.g. men and women).

Screen shot 2016-08-04 at 12.18.48 AM

Shiny application allows for an interactive quantitative analysis and it is based on R programming language. Since this toolkit is still under development, we will greatly appreciate its evaluation, comments, and feedback!

Language Variation Suite – New Version Release

LVS  (an interactive toolkit for sociolinguists) has been enhanced with new features: 1) Excel and CSV formats, 2) Recoding and factor modification, and 3) Token Frequency extraction from text files.

New menu Adjust Data (Data Panel) allows for an interactive data modification, such as excluding and recoding certain factors or adding logarithmic transformation to continuous data.

Screen shot 2016-07-11 at 6.22.22 PM

New menu Frequency (Data Panel) allows for frequency extraction from text files. This feature makes it possible to add a frequency column to your dataset. The dataset should have a column named token containing  words (a single word per cell).

Screen shot 2016-07-11 at 6.24.08 PM

Feel free to explore LVS – it also works with SmartPhones and IPads. Please let us know your feedback and suggestions.

LVS Team

Word and Sentence Analysis

We just added a new feature to Interactive Text Mining Tool: Word and Sentence Length.

In Data Visualization Panel, select Word Frequency Tab and click on Length. You can select a specific document and explore its content.

sentence          word

If you are interested in Punctuation Visualization (for more information, read Adam Calhoun’s Blog), select  Punctuation Analysis Tab and click Punctuation.punct

Language Variation Suite and Interactive Text Mining Tool – QR Code

Our Language Variation Suite and Interactive Text-Mining Tool are now accessible via SmartPhone and iPad.

Use QR Scanner App to scan the following QR codes, open them in your browser on Smart Phone or iPad. Make sure you have your files (csv or text files) in your Dropbox. Navigate to the Descriptive Statistics in LVS or Data Preparation in ITMS, select choose files and upload them.

code                                   code (1)

We always welcome any suggestions, feedback and bug reports!

Workshop: LVS and ITMS Introduction

    Visual Analytics – “The science of analytical reasoning facilitated by visual interactive interfaces” (Thomas et al. 2005)

Materials for workshop:

  1. Categorical Data (csv): Labov’s New York study 1966 (for more information, visit
  2. Continuous Data (csv): Corpus of Caracas (Bentivoglio & Sedano 1993)
  3.  Dante Translation (txt): dante 1, dante 2, dante 3
  4. Slides for workshop (pdf): presentation

“Mastery of quantitative methods is increasingly becoming a vital component of linguistic training” (Johnson, 2008)

Links to LVS and ITMS



Data Explorarion – Conditional Trees

Classification tree, or conditional tree, is a simple non-parametric regression analysis, commonly used in social and psychological studies. While in linear regression all the information is combined linearly, in conditional trees this information is shown as recursive splitting (tree branches), thus providing more visual interpretation (Strobl et al. 2009). If the independent variable (predictor) is  significantly associated with our dependent variable (response), a split tree branch is created that splits values of independent variables (For more information, see R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R; S. Tagliamonte S. & R.H. Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice; A. Lohmann. 2013. Classification trees and Random forests in linguistic study).

Language Variation Suite  allows you to build a conditional tree from your data. After the csv data is uploaded, select Logistic Regression Analysis from  the top panel “Inferential Statistics”.inferencial

Next, you will need to select your variables from Modeling tab. First, make a selection for your dependent variable (response). Second, select your independent factors (predictors).

selectdependent                   selectindependent

Finally, go to Conditional Trees tab – it may take a couple of seconds to generate your plot, depending on how many factors you have selected. The following screenshot is the conditional tree from my sociolinguistic data.


According to this plot, the distribution of intervocalic /d/  in Venezuelan Spanish is split by age into two groups (20-54) and (55+). In the first age group, the production of /d/ is further conditioned by the chronological period: more aspirated /s/ occurs in the corpus of 1987, especially with low economic class (here box plot is used with continuous data), but we also see a lot of outliers in the box plot. The second age group (+55) is also conditioned by period. In the 1987 corpus, the low-economic status shows the highest use of aspirated /s/. In this study, we calculate the intensity of /d/ production – higher or lower intensity of aspiration.

This analysis is used for exploratory purposes. For more robust predictions, random forest and mixed regression models should be used (more on them in the next post).

Data Exploration – Cluster Analysis

Cluster Analysis (Descriptive Statistics) examines how variables or individuals are grouped. The visual representation is often referred to as a dendrogram,  as groups are clustered into tree branches.

To perform this analysis – select cluster from Descriptive Statistics:

Screen shot 2016-03-01 at 8.15.36 PM


Next step is to select your dependent variable (response) from your uploaded CSV file. Dependent variable can be binary (e.g. yes, no), continuous (e.g. 1,2,3) or multinomial (e.g. deletion, retention, aspiration). Final step is to choose your independent factor – make sure it has at least three values. For instance, I have my file with a dependent variable (Object-Verb/Verb-Object) from my data on Old French texts. I would like to examine how  different genres are grouped according to their use of word order. So I select “genre” as my independent factor, which has more than three values.

Screen shot 2016-03-01 at 8.12.32 PM

I can see from the graph that there are three clusters in my data: 1) letters and treatise, 2) speech and hagio, and 3) narratives.