package 'widyr' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\tskam\AppData\Local\Temp\Rtmp2Dy600\downloaded_packages
package 'wordcloud' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\tskam\AppData\Local\Temp\Rtmp2Dy600\downloaded_packages
package 'ggwordcloud' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\tskam\AppData\Local\Temp\Rtmp2Dy600\downloaded_packages
package 'textplot' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\tskam\AppData\Local\Temp\Rtmp2Dy600\downloaded_packages
29.3 Importing Multiple Text Files from Multiple Folders
29.3.1 Creating a folder list
news20 <-"chap29/data/20news/"
29.3.2 Define a function to read all files from a folder into a data frame
read_lines() of readr package is used to read up to n_max lines from a file.
map() of purrr package is used to transform their input by applying a function to each element of a list and returning an object of the same length as the input.
unnest() of dplyr package is used to flatten a list-column of data frames back out into regular columns.
mutate() of dplyr is used to add new variables and preserves existing ones;
transmute() of dplyr is used to add new variables and drops existing ones.
read_rds() is used to save the extracted and combined data frame as rds file for future use.
29.5 Initial EDA
Figure below shows the frequency of messages by newsgroup.
Using tidy data principles in processing, analysing and visualising text data.
Much of the infrastructure needed for text mining with tidy data frames already exists in packages like ‘dplyr’, ‘broom’, ‘tidyr’, and ‘ggplot2’.
Figure below shows the workflow using tidytext approach for processing and visualising text data.
29.6.1 Removing header and automated email signitures
Notice that each message has some structure and extra text that we don’t want to include in our analysis. For example, every message has a header, containing field such as “from:” or “in_reply_to:” that describe the message. Some also have automated email signatures, which occur after a line like “–”.
str_detect() from stringr is used to detect the presence or absence of a pattern in a string.
filter() of dplyr package is used to subset a data frame, retaining all rows that satisfy the specified conditions.
29.6.3 Text Data Processing
In this code chunk below, unnest_tokens() of tidytext package is used to split the dataset into tokens, while stop_words() is used to remove stop-words.
Now that we’ve removed the headers, signatures, and formatting, we can start exploring common words. For starters, we could find the most common words in the entire dataset, or within particular newsgroups.
usenet_words %>%count(word, sort =TRUE)
# A tibble: 5,542 × 2
word n
<chr> <int>
1 people 57
2 time 50
3 jesus 47
4 god 44
5 message 40
6 br 27
7 bible 23
8 drive 23
9 homosexual 23
10 read 22
# ℹ 5,532 more rows
Instead of counting individual word, you can also count words within by newsgroup by using the code chunk below.
words_by_newsgroup <- usenet_words %>%count(newsgroup, word, sort =TRUE) %>%ungroup()
29.6.4 Visualising Words in newsgroups
In this code chunk below, wordcloud() of wordcloud package is used to plot a static wordcloud.
tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection of corpus.
29.7.1 Computing tf-idf within newsgroups
The code chunk below uses bind_tf_idf() of tidytext to compute and bind the term frequency, inverse document frequency and ti-idf of a tidy text dataset to the dataset.
In this code chunk below, a bigram data frame is created by using unnest_tokens() of tidytext.
bigrams <- cleaned_text %>%unnest_tokens(bigram, text, token ="ngrams", n =2)
bigrams
# A tibble: 28,827 × 3
newsgroup id bigram
<chr> <chr> <chr>
1 alt.atheism 54256 <NA>
2 alt.atheism 54256 <NA>
3 alt.atheism 54256 as i
4 alt.atheism 54256 i don't
5 alt.atheism 54256 don't know
6 alt.atheism 54256 know this
7 alt.atheism 54256 this book
8 alt.atheism 54256 book i
9 alt.atheism 54256 i will
10 alt.atheism 54256 will use
# ℹ 28,817 more rows
29.7.8 Counting bigrams
The code chunk is used to count and sort the bigram data frame ascendingly.
# A tibble: 19,888 × 2
bigram n
<chr> <int>
1 of the 169
2 in the 113
3 to the 74
4 to be 59
5 for the 52
6 i have 48
7 that the 47
8 if you 40
9 on the 39
10 it is 38
# ℹ 19,878 more rows
29.7.9 Cleaning bigram
The code chunk below is used to seperate the bigram into two words.