如何利用 Twitter 开放者平台爬取 Twitter 数据？第1页

kang-jie-feng-15 网友的相关建议:

当然有，但是抓取的历史数据数量有限制。这里分享一个用R语言抓取Twitter上气候变化相关言论的简单例子，涉及到推送时间趋势分析、推送位置分布和简单的情感分析。

在分析开始之前，你需要申请一个Twitter的开发者账号 (点击这里申请)。然后创建一个App，简要步骤是进入自己的Developer Portal，在Projects & Apps下创建一个App，记下自己的App名、key和secret，我们就可以开始分析啦！代码、说明和部分结果如下。

       #载入需要的包，其中： #rtweet用来抓取数据和分析，tidytext和dplyr整理数据，ggplot作图，syuzhet情感分析 library(rtweet) library(tidytext) library(dplyr) library(ggplot2) library(syuzhet) #先将Twitter抓取权限授权给R twitter_token <- create_token(   app = "你的App名",   consumer_key = "你的key",   consumer_secret = "你的secret",   set_renv = TRUE) #转到浏览器打开一个新的页面，点击授权成功后会提示： # “Authentication complete. Please close this page and return to R.”  #用search_tweets抓取包含“climate change”的最近1000条推文 #当然根据需要可以去除转发等，或者抓取特定用户的推文 #如果要抓取七天之前的历史数据，好像需要付费账号 climatetweet <- search_tweets("climate change", n= 1000) #查看一下抓下来的数据 head(climatetweet)

会看到类似这样的数据形式，包含了发推文用户的ID、创建时间、用户名、推文内容和地理位置等：

       # A tibble: 6 x 90   user_id status_id created_at          screen_name text  source display_text_wi~ reply_to_status~   <chr>   <chr>     <dttm>              <chr>       <chr> <chr>             <dbl> <chr>            1 109614~ 13321517~ 2020-11-27 02:38:24 mapgirl61   "Why~ Twitt~              140 NA               2 326555~ 13321517~ 2020-11-27 02:38:18 WAmediaGrl  "Con~ Twitt~              144 NA               3 940038~ 13321517~ 2020-11-27 02:38:14 heartofwor~ ""I~ Twitt~              140 NA               4 434215~ 13321517~ 2020-11-27 02:38:11 Kerbear_xo  "Cli~ Twitt~              140 NA               5 232376~ 13321516~ 2020-11-27 02:38:07 MrGragg     "Cli~ Twitt~               89 NA               6 164160~ 13321516~ 2020-11-27 02:38:07 Marisol_Ma~ "A 7~ Twitt~              140 NA               # ... with 82 more variables: reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>, #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>, quote_count <int>...

然后开始对数据进行分析：

       #看这1000条推送是什么时候发的，因为关于climate change的推文太多了，所以我们按分钟来作图 ts_plot(climatetweet, "mins")

       #发推文的地理位置分布 climatetweet %>%    count(location, sort = TRUE) %>% #计算各个地点发推文的数量，并且排序   subset(location != "") %>% #去除空值，很多人在发推文的时候会隐去自己的地理位置   mutate(location = reorder(location, n)) %>%    head(20) %>% #选出排名前20的地点   ggplot(aes(x = location, y = n)) + #以各发推地点计数对地点作图   geom_col () +    coord_flip()

       #看推文中最常出现的关键词 #通过正则表达式去掉一些不需要的字符，比如说“the”啊“to”啊之类的 climatetweet$text <- gsub("https\S*", "",climatetweet$text) climatetweet$text <- gsub("@\S*", "", climatetweet$text) climatetweet$text <- gsub("amp", "", climatetweet$text) climatetweet$text <- gsub("[
]", "", climatetweet$text) climatetweet$text <- gsub("[[:punct:]]", "", climatetweet$text) climatetext <- climatetweet %>% select(text) %>% unnest_tokens(word, text) climatetext <- climatetext %>% anti_join(stop_words) #作图 climatetext %>%    count(word, sort = TRUE) %>%   head(15) %>%   mutate(word = reorder(word, n)) %>%   ggplot(aes(x = word, y = n)) +   geom_col() +   coord_flip()

       #类似地，看推文中最常出现的标签 #通过正则表达式去掉一些不需要的字符，比如说“the”啊“to”啊之类的 climatetweet$hashtags <- gsub("c\(", "",climatetweet$hashtags) climatetweet %>%    count(hashtags, sort = TRUE) %>%    mutate(hashtags = reorder(hashtags, n)) %>%    na.omit() %>%    head(10) %>%    ggplot(aes(x = hashtags, y = n)) +   geom_col() +    coord_flip()

       #对推文内容进行情感分析 #把推文转换成ASCII码以进行分析 climatetweet_asc <- iconv(climatetweet, from="UTF-8", to="ASCII", sub="") #利用内置数据库对推文内容情感进行统计，看每条推文在各种情绪上的得分 score_df <- get_nrc_sentiment(climatetweet_asc) #统计各种情感的总得分并作图 score_stat <- data.frame("sentiment" = names(score_df), "score" = colSums(score_df)) ggplot(score_stat, aes(x = sentiment, y = score)) + geom_col()

这是一个简单的例子，更复杂的实际研究可以参考这方面的学术研究。本文参考资料如下：

[1] Lesson 2. Twitter Data in R Using Rtweet: Analyze and Download Twitter Data

[2] A Guide to Mining and Analysing Tweets with R

[3] Twitter Sentiment Analysis and Visualization using R

如何利用 Twitter 开放者平台爬取 Twitter 数据？的其他答案点击这里