Qualtrics Data Management

PUBLISHED ON JUN 17, 2019 — R


This is for output from Qualtrics that is downloaded in non-legacy format (for scoring on the Author Recognition Task).

Before running

Before running the Qualtrics survey, make sure to name all your stimuli and corresponding questions so that they are distinguishable from other questions (i.e. consent, age, gender, etc.). If I have an item with two follow up questions, I name them as such:

  • S01_Q1: Where S01 is the item number (or T01), and Q1 is the first question
  • S01_Q2: Q2 is the second question

…and if you have three or more questions, S01_Q3 and so on.

This replaces the automatic Q# generated by Qualtrics. This is important because when the results are outputted, the columns will be these question labels.

If you did not format it this way, it’s ok. You can rename these directly in your Qualtrics output. As long as their is some reference to your design file, it should be fine. Another option is to copy your question labels from Qualtrics, and create a column in your design file that matches the label to each item.

Formatting the data

Setting up

These four packages need to be installed. Once they are installed, you do not need to install them again.

# Four packages needed to install, comment out once done (#)
# install.packages("reshape2") #only install once (requires R >= 3.1)
# install.packages("dplyr") #only install once (requires R >= 3.2)
# install.packages("tidyr") #only install once (requires R >= 3.1)
# install.packages("openxlsx") #only install once (requires R >= 3.3)

# You can check your version of R with the following line
version

# Need to run these lines every time you start a new R session
library(reshape2) 
library(dplyr)
library(tidyr)
library(openxlsx)

These packages allow you to use functions that help tidy up your data. Next, set the working directory to the folder where your output file is located. You do not need to change anything in the output file in Excel.

setwd("~/Documents/UNC Chapel Hill/Research/Second Year/TVNSF/Audio Only")
# I'm using a Mac, so this might differ on Windows
# The "~" is a shortcut that replaces the "Users" part of the path name. 
# You can put the full path name too.

Read the csv file into R. There should not be any errors, but if there are, make sure there are no weird symbols in the csv file.

Note: This file contains completely fake, generated data

a <- read.csv("Fake List A.csv", na.strings=c("")) 
# All blank cells will be read as NA

This object a now holds all the data, and can be viewed by either clicking on it in the Environment tab, or typing View(a). a can be named anything, I just chose it because the data in this file are from List A.

The Qualtrics output has three headers, which is unnecessary. We only need to keep the first one which has all our question labels. This line deletes the unneeded rows (which are the second and third).

a <- a[-c(1,2), ] # row 1 is not the header

# Note here that we are rewriting the object a, but you can create a new object too
# Also c() just indicates a vector. It's useful for selecting multiple values.

# Added in this code because R retains factors, even though data is deleted
a <- droplevels(a)

Delete participants who did not complete the study. The following line deletes rows for which participant who did not submit an MTurkCode at the end of the survey.

Or, if you have a specific column that determines who completed your study replace MTurkCode with that column (e.g. Finished).

a <- a[-which(is.na(a$MTurkCode)), ]

# Or if you used the Finished column
a <- a[-which(a$Finished == FALSE), ] 


Keeping relevant columns

If you labeled your questions informatively, you should be able to select the ones you want easily. For this particular data, anything that starts with “S”" is a stim, anything that starts with “F” is a filler.

Information on the select function

  • The select() is from the dplyr package
  • Includes (or excludes) columns of your choosing
  • Filter or add groups of columns that start the same way starts_with("column_name")
  • Or end the same way ends_with("column_name")
  • Or just pick and choose individual columns e.g. select(column_name), no quotes
  • Or use (-) and (+) signs. Here I have columns that I don’t want but start with “S” and “F”, so I have removed them manually using (-)
  • Column names depend on how you labeled them in Qualtrics. Or how you manually label them in Excel
  • I named mine by item names, S01 and F01, which distinguishes them from Q78457687…

a <- a %>% select(StartDate, Age, Gender, Education, MotherEduc, FatherEduc, 
                  Subj.SES, SES, starts_with("S"), starts_with("F"), ART,
                  -Finished, -Status, -Screen)

# You must store it in an object (here "a") for the change to take effect.
# The pipe (%>%) operator is used for efficiency, and lets you pass your data across functions

# This line lists out all the column names, just to check to see you have the columns you want
colnames(a)

Adding more lists

Because this is only List A (or 1), I need to create a column that indicates that.

a <- mutate(a, "List" = "A")

# You can also replace this with a number if your list is ordered that way.
a <- mutate(a, "List" = 1)

The column is now the last column, but this will be reordered and moved to the front later.

To read in more output files, do the above following steps for each list. Make sure to create a new object (e.g. b) and not overwrite the old one.

b <- read.csv("Fake List B.csv", na.strings=c(""))
b <- b[-c(1,2), ]
b <- droplevels(b)
b <- b[-which(is.na(b$MTurkCode)), ]
b <- b %>% select(StartDate, Age, Gender, Education, MotherEduc, FatherEduc, 
                  Subj.SES, SES, starts_with("S"), starts_with("F"), ART,
                  -Finished, -Status, -Screen)
b <- mutate(b, "List" = "B")

Now we have a List A and a List B, which are currently separate. Now we want to put them together in order to add the participant number.

IMPORTANT: You must have same number of columns and same column names.

# Creating a new object "d"
d <- rbind(a, b)

Now we can add a column for participant. I first sort it by the Start Date, then create a participant column. The numbering matches how many rows there are in total

d <- d[order(as.Date(d$StartDate, format="%m/%d/%y")),]
d <- mutate(d, "Participant" = 1:nrow(d))



Calculating ART Scores

Now we will create a new object that extracts only the participant number and their ART responses.

art <- d %>% select(Participant, ART) 

In non-legacy format, authors are all in one cell, separated by a comma. The following code separates out each author into their own cell, then assigns the fake authors -1, and the real authors 1.

art <- art %>% separate(ART, into = paste("A", 1:130, sep = "_"), sep = ",")

#Assign each foil -1
art[art=="Lauren Adamson" | art=="Eric Amsel" | art=="Carter Anvari" | art=="Margarita Azmitia" 
    | art=="Frank Bardin" | art=="Reuben Baron" | art=="Christopher Barr" 
    | art=="Gary Beauchamp" | art=="Lauren Benjamin" | art=="Thomas Bever" 
    | art=="Elliot Blass" | art=="Harrison Boldt" | art=="Hilda Borko" 
    | art=="Jennifer Butterworth" | art=="Katherine Carpenter" | art=="Devon Chang" 
    | art=="Charles Condie" | art=="John Condry" | art=="Julia Connerty" 
    | art=="Diane Cuneo" | art=="Denise Daniels" | art=="Geraldine Dawson" 
    | art=="W. Patrick Dickson" | art=="Aimee Dorr" | art=="Habib Farah" 
    | art=="Frances Fincham" | art=="Martin Ford" | art=="Howard Gardner" 
    | art=="Ryan Gilbertson" | art=="Carla Grinton" | art=="Mimi Hall" 
    | art=="Robert Inness" | art=="Lilly Jack" | art=="Lena Johns" 
    | art=="Kirby Kavanagh" | art=="Frank Kiel" | art=="Susan Kormer" 
    | art=="Reed Larson" | art=="Pricilla Levy" | art=="Lynn Liben" 
    | art=="Alex Lumsden" | art=="Hugh Lytton" | art=="Frank Manis" 
    | art=="Jennifer Marshal" | art=="Sophia Martin" | art=="Morton Mendelson" 
    | art=="James Morgan" | art=="Ryan Morris" | art=="Samuel Paige" 
    | art=="Scott Paris" | art=="Richard Passman" | art=="David Perry" 
    | art=="Peter Rigg" | art=="Miriam Sexton" | art=="K. Warner Shaie" 
    | art=="Robert Siegler" | art=="David Singer" | art=="Janice Taught" 
    | art=="Tracy Tomes" | art=="Nicole Waugh" | art=="Noah Whittington" 
    | art=="Ava Wight"| art=="Allister Younger"| art=="Steve Yussen"] <- -1

#Assign each real author 1
art[art=="Isabel Allende" | art=="Isaac Asimov" | art=="Margaret Atwood" | art=="Jane Austen" 
    | art=="Saul Bellow" | art=="Ray Bradbury" | art=="Dan Brown" 
    | art=="Willa Cather" | art=="Raymond Chandler" | art=="Tom Clancy" 
    | art=="Mary Higgins Clark" | art=="Suzanne Collins" | art=="Clive Cussler" 
    | art=="Nelson Demille" | art=="Sarah Dessen" | art=="Umberto Eco" 
    | art=="T. S. Eliot" | art=="Nora Ephron" | art=="William Faulkner" 
    | art=="F. Scott Fitzgerald" | art=="Sue Grafton" | art=="John Green" 
    | art=="John Grisham" | art=="Ernest Hemingway" | art=="Tony Hillerman" 
    | art=="Khaled Hosseini" | art=="Kazuo Ishiguro" | art=="James Joyce" 
    | art=="Jonathan Kellerman" | art=="Stephen King" | art=="Judith Krantz" 
    | art=="Wally Lamb" | art=="Harper Lee" | art=="C.S. Lewis" 
    | art=="Jack London" | art=="Robert Ludlum" | art=="Bernard Malamud" 
    | art=="Gabriel Garcia Marquez" | art=="Anne McCaffrey" | art=="Stephenie Meyer" 
    | art=="James Michener" | art=="Margaret Mitchell" | art=="Vladimir Nabokov" 
    | art=="Joyce Carol Oates" | art=="Michael Ondaatje" | art=="George Orwell" 
    | art=="James Patterson" | art=="Jodi Picoult" | art=="Thomas Pynchon" 
    | art=="Ayn Rand" | art=="Rick Riordan" | art=="Veronica Roth" 
    | art=="J. K. Rowling" | art=="Salman Rushdie" | art=="J. D. Salinger" 
    | art=="Nicholas Sparks" | art=="Danielle Steel" | art=="J. R. R. Tolkien" 
    | art=="Kurt Vonnegut" | art=="E. B. White" | art=="Virginia Woolf" 
    | art=="Herman Wouk"] <- 1

Now we can replace all the NAs with 0s, make sure the data is numeric, and sum up each row to calculate the ART score. The participant number is not included in this calculation, hence why it starts at column 2 and not 1.

#Gets rid of all NAs and changes them to 0s
art[is.na(art)] <- 0

#Changes data to numeric
art[] <- lapply(art, function(x) as.numeric(x))

#spit out total correct and total incorrect
art$Incorrect <- rowSums(art[2:130]<0)
art$Correct <- rowSums(art[2:130]>0)
art$Proportion <- art$Incorrect/art$Correct
art$ARTExclude <- ifelse(art$Proportion>0.5,1,0);

#Calculate ART Score! (and makes it part of art data)
art$ARTScore <- rowSums(art[2:130])

#Remove columns that had all the author names, leaving only particiant and score, and any exclusions
art <- art %>% select(Participant, ARTScore, ARTExclude)

The ART scores are in a separate object art, so they need to be added back to the rest of the data.

#add ART back into data
d <- merge(d,art,by="Participant")

#drop the column with all the author names
d <- d %>% select(-ART)



Wide to long format

The last thing to do is convert the data from wide to long format. There are a couple of functions that do this, but the one I use is the melt() function from the reshape package.

wideToLong <- melt(d, id.vars = c("StartDate", "Participant", "List", "Age", 
                                  "Gender", "Education", "MotherEduc","FatherEduc",
                                  "Subj.SES","SES", "ARTScore", "ARTExclude"),
             variable.name = "Item", value.name = "Response")
## Warning: attributes are not identical across measure variables; they will
## be dropped

The warning about attributes means that the number of factors differ across the items. For example, for some items, everyone agreed on the same item, but for other items it was varied. This warning is okay and is expected.

The variables in the id.vars are ones that stay consistent across the participants. The last remaining variables must be your item (S01_Q#) and participant responses, because those differ across participants. If this is not the case, something will go wrong.

Now to order by Participant

wideToLong <- wideToLong[order(wideToLong$Participant),]

Last thing is to create a question column, that will tell us whether a row is Question 1 or Question 2 (or 3 and so on). To do this, separate the current Item column into two distinct columns (S01_Q1 to S01 and Q1). If there are none, then skip this step. Again, if they are labeled correctly, this is easy to do.

#This separates the item column into Item and Question#
wideToLong <- wideToLong %>% separate(Item, into = c("Item","Question"), sep = "_")

#Ok but maybe I want to get rid of all the characters in that column, and leave just the number
#You do not have to 
wideToLong$Question <- gsub("[a-zA-Z ]", "", wideToLong$Question)



Merging with the design (in progress)

Using the same code as before, set the working directory to where the design file is. Then import the file.

setwd("~/Dropbox/Val and Jennifer/Transfer Verbs NSF")

design <- read.xlsx("TVNSF design.xlsx", sheet = 1)

Now we will combine any information we want from the design sheet and the results to create a master dataset. You can either include the entire design sheet, or just parts of it.

# Option 1: Just merge the two together.
master <- merge(design, wideToLong, by=c("List", "Item"))



Exporting to Excel

Last step is to export it as an Excel file

#Export the final object as an excel file
write.xlsx(wideToLong, file = "formatted.xlsx", colNames = TRUE, colWidths = "auto")