Data Explorarion – Conditional Trees

Classification tree, or conditional tree, is a simple non-parametric regression analysis, commonly used in social and psychological studies. While in linear regression all the information is combined linearly, in conditional trees this information is shown as recursive splitting (tree branches), thus providing more visual interpretation (Strobl et al. 2009). If the independent variable (predictor) is  significantly associated with our dependent variable (response), a split tree branch is created that splits values of independent variables (For more information, see R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R; S. Tagliamonte S. & R.H. Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice; A. Lohmann. 2013. Classification trees and Random forests in linguistic study).

Language Variation Suite  allows you to build a conditional tree from your data. After the csv data is uploaded, select Logistic Regression Analysis from  the top panel “Inferential Statistics”.inferencial

Next, you will need to select your variables from Modeling tab. First, make a selection for your dependent variable (response). Second, select your independent factors (predictors).

selectdependent                   selectindependent

Finally, go to Conditional Trees tab – it may take a couple of seconds to generate your plot, depending on how many factors you have selected. The following screenshot is the conditional tree from my sociolinguistic data.

tree

According to this plot, the distribution of intervocalic /d/  in Venezuelan Spanish is split by age into two groups (20-54) and (55+). In the first age group, the production of /d/ is further conditioned by the chronological period: more aspirated /s/ occurs in the corpus of 1987, especially with low economic class (here box plot is used with continuous data), but we also see a lot of outliers in the box plot. The second age group (+55) is also conditioned by period. In the 1987 corpus, the low-economic status shows the highest use of aspirated /s/. In this study, we calculate the intensity of /d/ production – higher or lower intensity of aspiration.

This analysis is used for exploratory purposes. For more robust predictions, random forest and mixed regression models should be used (more on them in the next post).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s