LVS (an interactive toolkit for sociolinguists) has been enhanced with new features: 1) Excel and CSV formats, 2) Recoding and factor modification, and 3) Token Frequency extraction from text files.
New menu Adjust Data (Data Panel) allows for an interactive data modification, such as excluding and recoding certain factors or adding logarithmic transformation to continuous data.
New menu Frequency (Data Panel) allows for frequency extraction from text files. This feature makes it possible to add a frequency column to your dataset. The dataset should have a column named token containing words (a single word per cell).
Feel free to explore LVS – it also works with SmartPhones and IPads. Please let us know your feedback and suggestions.
“How can I make my code faster?” If you write R code, then you’ve probably asked yourself this question. A profiler is an important tool for doing this: it records how the computer spends its time, and once you know that, you can focus on the slow parts to make them faster. The preview releases […]
via Profiling with RStudio and profvis — RStudio Blog
We just added a new feature to Interactive Text Mining Tool: Word and Sentence Length.
In Data Visualization Panel, select Word Frequency Tab and click on Length. You can select a specific document and explore its content.
If you are interested in Punctuation Visualization (for more information, read Adam Calhoun’s Blog), select Punctuation Analysis Tab and click Punctuation.
Interactive Text Mining Suite v.1.2 allows for more interactive data pre-processing.
ITMS uses qdapRegex, tm and RTextTools packages for data pre-processing. Feedback, suggestions and bug reports are greatly appreciated.
Our Language Variation Suite and Interactive Text-Mining Tool are now accessible via SmartPhone and iPad.
Use QR Scanner App to scan the following QR codes, open them in your browser on Smart Phone or iPad. Make sure you have your files (csv or text files) in your Dropbox. Navigate to the Descriptive Statistics in LVS or Data Preparation in ITMS, select choose files and upload them.
We always welcome any suggestions, feedback and bug reports!
Topic modeling refers to an algorithm that explains “an observed corpus with a small set of distributions over terms” and “models for uncovering underlying semantic structure of a document collection” (Blei et al. 2003, Blei et al. 2009, Blei 2012). Several algorithms have been put forth to build a probabilistic topic model, e.g mixture-of-unigram (Nigam et al. 2000), Latent Semantic Indexing (Deerwester et al. 1990; Hofmann 1999) and Latent Dirichlet Allocation LDA (Blei et al. 2003). For more information, see Matthew Jockers and David Blei.
Interactive Text Mining Suite applies various LDA algorithms (topicmodels, lda and stm R packages). In addition, it allows users interactively choose number of topics, iterations and select the best models.
We welcome suggestions and feedback.
ITMS integrates visual and statistical R with an interactive Shiny application to examine unstructured data (aka text documents). At present, ITMS provides several text-mining analyses for scholarly articles and literary texts (e.g. topic, frequency and cluster analyses).
ITMS is an ongoing project by interdisciplinary team of researchers from Indiana University (Olga Scrivner and Jefferson Davis). We are also developing an NEH proposal to advance this research.
Your feedback and suggestions as well as bug reports will be very appreciated (obscrivn AT indiana PERIOD edu).
Classification tree, or conditional tree, is a simple non-parametric regression analysis, commonly used in social and psychological studies. While in linear regression all the information is combined linearly, in conditional trees this information is shown as recursive splitting (tree branches), thus providing more visual interpretation (Strobl et al. 2009). If the independent variable (predictor) is significantly associated with our dependent variable (response), a split tree branch is created that splits values of independent variables (For more information, see R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R; S. Tagliamonte S. & R.H. Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice; A. Lohmann. 2013. Classification trees and Random forests in linguistic study).
Language Variation Suite allows you to build a conditional tree from your data. After the csv data is uploaded, select Logistic Regression Analysis from the top panel “Inferential Statistics”.
Next, you will need to select your variables from Modeling tab. First, make a selection for your dependent variable (response). Second, select your independent factors (predictors).
Finally, go to Conditional Trees tab – it may take a couple of seconds to generate your plot, depending on how many factors you have selected. The following screenshot is the conditional tree from my sociolinguistic data.
According to this plot, the distribution of intervocalic /d/ in Venezuelan Spanish is split by age into two groups (20-54) and (55+). In the first age group, the production of /d/ is further conditioned by the chronological period: more aspirated /s/ occurs in the corpus of 1987, especially with low economic class (here box plot is used with continuous data), but we also see a lot of outliers in the box plot. The second age group (+55) is also conditioned by period. In the 1987 corpus, the low-economic status shows the highest use of aspirated /s/. In this study, we calculate the intensity of /d/ production – higher or lower intensity of aspiration.
This analysis is used for exploratory purposes. For more robust predictions, random forest and mixed regression models should be used (more on them in the next post).
Cluster Analysis (Descriptive Statistics) examines how variables or individuals are grouped. The visual representation is often referred to as a dendrogram, as groups are clustered into tree branches.
To perform this analysis – select cluster from Descriptive Statistics:
Next step is to select your dependent variable (response) from your uploaded CSV file. Dependent variable can be binary (e.g. yes, no), continuous (e.g. 1,2,3) or multinomial (e.g. deletion, retention, aspiration). Final step is to choose your independent factor – make sure it has at least three values. For instance, I have my file with a dependent variable (Object-Verb/Verb-Object) from my data on Old French texts. I would like to examine how different genres are grouped according to their use of word order. So I select “genre” as my independent factor, which has more than three values.
I can see from the graph that there are three clusters in my data: 1) letters and treatise, 2) speech and hagio, and 3) narratives.