Do all birds tweet the same.pdf

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Do All Birds Tweet the Same? Characterizing Twitter Around the World Barbara Poblete1,2 Ruth Garcia3,4 Marcelo Mendoza5,2 Alejandro Jaimes3 {bpoblete,ruthgavi,mendozam,ajaimes} 1 Department of Computer Science, Uni
  Do All Birds Tweet the Same?Characterizing Twitter Around the World Barbara Poblete 1 , 2 Ruth Garcia 3 , 4 Marcelo Mendoza 5 , 2 Alejandro Jaimes 3 {bpoblete,ruthgavi,mendozam,ajaimes} 1 Department of Computer Science, University of Chile, Chile 2 Yahoo! Research Latin-America, Chile 3 Yahoo! Research Barcelona, Spain 4 Universitat Pompeu Fabra, Spain 5 Universidad Técnica Federico Santa María, Chile ABSTRACT Social media services have spread throughout the world in just afew years. They have become not only a new source of informa-tion, but also new mechanisms for societies world-wide to organizethemselves and communicate. Therefore, social media has a verystrong impact in many aspects – at personal level, in business, andin politics, among many others. In spite of its fast adoption, lit-tle is known about social media usage in different countries, andwhether patterns of behavior remain the same or not. To providedeep understanding of differences between countries can be usefulin many ways, e.g.: to improve the design of social media systems(which features work best for which country?), and influence mar-keting and political campaigns. Moreover, this type of analysis canprovide relevant insight into how societies might differ. In this pa-per we present a summary of a large-scale analysis of Twitter foran extended period of time. We analyze in detail various aspectsof social media for the ten countries we identified as most active.We collected one year’s worth of data and report differences andsimilarities in terms of activity, sentiment, use of languages, andnetwork structure. To the best of our knowledge, this is the firston-line social network study of such characteristics. Categories and Subject Descriptors H.3.1 [ Information Storage and Retrieval ]: Content Analysisand Indexing General Terms Measurement Keywords Social Media Analytics, Social Networks, Twitter Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. CIKM’11,  October 24–28, 2011, Glasgow, Scotland, UK.Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00. 1. INTRODUCTION The use of social media has grown tremendously all over theworld in recent years, and the impact of such growth has expandedin unexpected ways. Twitter, in particular, has become the mostwidely used microblogging service, and messages posted on it,in many ways, reflected real life events– from the revolutions inTunisia and Egypt, to natural disasters such as the Chilean andJapanese earthquakes. Twitter users, however, post and share allkinds of information, ranging from personal opinions on importantpolitical issues, to mundane statements that may have little interestto most, except for their closest friends.The range and scope of the service, and the fact that most userprofiles and tweets are public, creates a huge opportunity for re-searchers. Using Twitter data, they can gain insights not just intohow that particular service is used, but also into questions that arerelevant in a social system for a particular point in time. This in-cludes how news propagates, how people communicate, and maybehow they influence each other. Given this context, two key ques-tions in the study of social media are how its use differs acrosscultures and countries, and whether any patterns revealed reflectbehavioral differences and similarities between different groups. Inspite of a long tradition and a lot of research in cultural anthropol-ogy, sociology, and other fields that address cultural differences,very little work has been carried out which takes into account largedata sets that specifically examine differences across countries 1 .In this paper we present a summary of some of our findings whenanalyzing a large data set from Twitter. We perform this analysisin order to examine possible differences and similarities in severalaspects of the use of the service. In particular, we focus on ex-amining a year’s worth of Twitter data for a large number of “ac-tive” users in the ten countries which tweet the most. We reporton differences in terms of level of activity (number of tweets peruser), languages used per country, the happiness levels of tweets,the content of tweets in terms of re-tweets, mentions, URLs, andthe use of hashtags. Additionally, we report differences and sim-ilarities in terms of the network structure. Our main contributionis to provide a series of insights on how tweeting behavior variesacross countries, and on possible explanations for such differences.To the best of our knowledge, this is the largest study done to date 1 It is out of the scope of this paper to provide an in-depth review,but here we refer specifically to analyzing differences in social me-dia  across different countries , as there has been of course a lot of work on social network analysis of large data sets.  on microblogging data, and the first one that specifically examinesdifferences across different countries.The rest of the paper is organized as follows. In Section 2 wegive a brief overview of related work and in Section 3 we describeour data set. In Section 4 we describe the distribution of languagesused in each of the ten most active countries in our data set, andthe main findings of our analysis on the level of happiness in eachcountry. Section 5 focuses on the content of the tweets and network structure, and we conclude by summarizing our main findings insections 6 and 7. 2. RELATED WORK The structure of social networks has been studied extensively be-cause structure is strongly related to the detection of communitiesand to how information propagates. Mislove et al. [10], for exam-ple, studied basic characteristics of the structure of Flickr, Orkut,LiveJournal, and YouTube, and found Power-law, small-world, andscale free properties. The authors argue that the findings are usefulin informing the design of social network-based systems. Kwak etal. [9], examined the Twitter network aiming to determine it’s ba-sic characteristics. One of their main findings is that Twitter doesnot properly exhibit a “traditional social network structure sinceit lacks reciprocity (only 22% of all connections on Twitter werefound to be reciprocal), so it behaves more like news media, fa-cilitating quick propagation of news. Java et al. [8], on the otherhand, studied the topological and geographical properties of Twit-ter’s social network and observed that there is high reciprocity andthe tendency for users to participate in communities of commoninterest, and to share personal information. Onnela et al. [11],present a study on a large-scale network of mobile calls and textmessages. They found no relationship between topological central-ity and physical centrality of nodes in the network, and examineddifferences among big and small communities.One of the key questions relating to communities and network structureisinfluence. DeChoudhuryetal. [4]examineTwitterdataand study how different sampling methods can influence the levelof diffusion of information. They found that sampling techniquesincorporating context (activity or location) and topology have bet-ter diffusion than if only context or topology are considered. Theyalso observed the presence of homophily, showing that users gettogether with “similar users, but that the diffusion of tweets alsodepends on topics. Cha et al. [3] studied the in-degree and out-degree of the Twitter network and observed that influence is in factnot related to the number of followers, but that having active follo-wers who retweet or mention the user is more important. 3. DATASET DESCRIPTION Twitter is a platform which allows users to choose between keep-ing their profiles and activity ( tweets ) public or private. Users withprivate profiles make their information available only to a selectedgroup of friends. For obvious reasons, we limit our research onlyto information provided by users with public profiles 2 .The focus of our research is mostly on characterizing large on-linesocialnetworks, basedonusergeographicallocation, forwhichTwitterprovideslimitedinformation. Therefore, weperformanini-tial filter of users based on activity and profile information. First,we choose users which we determine to be  active . For this we ex-amine a 10-day continuous time window of user activity, selectingday-1 randomly from the year 2010. Then, we consider  only  as 2 All processing was anonymous and aggregated. No personal userinformation was usedactive, users which generated tweets during this time frame. Sec-ondly, wefilteredtheresultinguserstokeeponlyactiveuserswhichhad also entered a  valid location  into their profiles during this sametime period. We considered as a valid location any text which couldbe parsed correctly into latitude and longitude (using the Yahoo!public PlaceMaker API 3 ). It should benoted, that we performed ba-sically a static analysis, so we did not consider user mobility duringthis period. We did not process location information which was au-tomatically generated for the user with a GPS device on their clientapplication, based on the fact that location changes continuouslywith the user. Since in this work we are interested more in charac-terizing geographical communities of users, we decided to use thelocation which reflects more accurately the user’s  home country .Using this criteria, we obtained a set of   6 , 263 , 457  active userswith valid location information, which were divided into  246  diffe-rent countries. For the rest of our analysis, we selected the Top- 10 countries with more activity and gathered all of the tweets gener-ated by these users for the entire duration of 2010. In total ourworking dataset consisted of   4 , 736 , 629  users ( 76%  of the initial10-day user sample), and  5 , 270 , 609 , 213  tweets. Figure 1 showsthe distribution of the users in our dataset into the top-10 countries,and the activity that they generated for 2010. Note that the amountof activity registered for each country is not necessarily propor-tional to the number of users. This is explicitly shown in Figure 2,which displays the tweet/user ratio for each country. This ratio isindependent of the number of users in each local network. 4. LANGUAGES AND SENTIMENT Languages.  To analyze the language in which tweets are writ-ten, we classify each tweet using proprietary software. As a result,99.05% of the tweets were classified into 69 languages. The 10most popular languages are shown in Figure 3. English is the mostpopular language, and it corresponds to nearly 53% of the tweets.Additionally, Figure 4 shows the three most common languages foreach of the top-10 countries, as well as the percentage of tweetswhich correspond to these languages. It is worth noting that En-glish is one of the three most frequently used languages for thesecountries, and for the Netherlands, Indonesia, and Mexico morethan 10% of tweets are in English, while for Brazil it is 9%. Addi-tionally, a special consideration should be taken for the languagesof Italian and Catalan, which appear in Figure 3 and Figure 4. Thisis strange finding given the fact that Italy is not considered in thetop-10 countries of our study and the number of people who speak Catalan world-wide is very small. By sampling the tweets for Cata-lan and Italian we find that many of them correspond to false posi-tives given by our classifier, since they actually correspond to Por-tuguese and Spanish. The high resemblance of these languages,in addition to the common use of slang, along with misspellings,makes automatic language identification particularly challenging. Sentiment Analysis.  We also analyzed the sentiment componentof tweets, for this we use the measure of   happiness  as coined byDodds et al. [5], which is also more commonly referred to as  va-lence . This value represents the psychological reaction which hu-mans have to a specific word, according to a scale which rangesfrom “happy” to “unhappy”. In particular, we analyze the happi-ness levels for each of the top-10 countries, considering only tweetsclassifiedasEnglishandSpanish. Toachievethis, weusedthe1999Affective Norms for English Words (ANEW) list by Bradley andLang [1] for English tweets, and for Spanish, we used its adapta-tion by Redondo et al. [12]. The ANEW list contains 1,034 words 3  Figure 1: Distribution of users ( % ) in the dataset for each Top-10country and their activity ( % ).Figure 2: tweet/user ratio for each Top-10 country.and each word has a score in a 1 to 9 range, which indicates itslevel of happiness. We computed the “weighted average happinesslevel , based on the algorithms of Dodds et al. [5], as follows: happiness  ( C  l ) =  N  l i =1  w i f  i,C  l  N  l i =1  f  i,C  l = N  l  i =1 w i  p i,C  l  (1)where  happiness  ( C  l )  represents the weighted average happi-ness level for a country  C  , based on all of its tweets in language l  (English or Spanish), during 2010. Therefore  C  l  represents allof the tweets registered for the country  C   which are expressed inthe language  l . Additionally,  N  l  represents the number of wordsin the ANEW list for the language  l , while  w i  is the score for the i -th word in the ANEW list for  l , and  f  i,C  l  corresponds to the fre-quency of this word in the collection  C  l . Finally, we denote  p i,C  l as the normalized frequency of each sentiment scored word in  C  l .The results of this sentiment analysis for (a) English and (b) Spa-nish, are shown in Figure 5. These results agree with those reportedby Dodds et al. [5]: the values are between 5 and 7 for both lan-guages and there is also a general increase in happiness towards theend of the year. It is interesting to note that Brazil has the high-est values almost every month, even though we are not particu-larly considering Portuguese. Nevertheless, after August happinessFigure 3: Most commonly used languages in each top-10 country.Figure 4: Three most popular languages for tweets in each top-10country.levels in Brazil decreases until November. Also, in December allcountries show an increase in their happiness level.Some differences can be appreciated in the results for Spanishtweets, Figure 5.b. The number of tweets in Spanish is dispropor-tional as 7 countries account for less than 1% of the tweets, whileMexico, USA and Brazil together account for almost 98% of the to-tal. Nevertheless, USAandMexicohavehappinesspatternsthataresimilar to most countries. Only Brazil and Indonesia results whichdiffer from the rest: there is a strong increase in happiness fromJune to July for Brazil and Indonesia. Interesting drops in levelshappen in Indonesia during the months of May and August. Brazilhas clearly the highest values for all months, but it also presentshigher ups and downs. 5. CONTENTANDNETWORKSTRUCTURE Tweet Contents.  In this part of our study, we analyze briefly cer-tain tweet features for each top-10 country. These features havealso been used in prior work, such as [2]: ã  #  : indicates if a tweet contains a “#” symbol, which denotes atweet with a particular topic. ã  RT  : indicates if a tweet has a “RT”, which indicates a  re-tweet  or re-post of a message of another user. ã  @ : indicates whether a tweet contains an “@” symbol, usedpreceding a user name and which indicates a user mention. ã  URL : Denotes whether a tweet contains a URL or not.We computed the average per user for each country as follows: AVG  ( symbol  ) =  N i =1 T  ( symbol  ) ui T  U i  N i =1  U  i (2)Where  AVG  ( symbol  )  is the average number of tweets per userof a particular country containing a feature denoted by  symbol  (e.g.#, RT, URL, @). Also,  N   is the total number of users for a partic-ular country and  T  ( symbol  ) U  i  is the total number of tweets con-taining that feature for user  U  i  (a)(b)Figure 5: Average  happiness  level (a) English and (b) Spanish. Country  TweetsUsers  ( URL )% (#)% (@)% ( RT  )% Indonesia 1813.53 14.95 7.63 58.24 9.71Japan 1617.35 16.30 6.81 39.14 5.65Brazil 1370.27 19.23 13.41 45.57 12.80Netherlands 1026.44 24.40 18.24 42.33 9.12UK 930.58 27.11 13.03 45.61 11.65US 900.79 32.64 14.32 40.03 11.78Australia 897.41 31.37 14.89 43.27 11.73Mexico 865.7 17.49 12.38 49.79 12.61S. Korea 853.92 19.67 5.83 58.02 9.02Canada 806 31.09 14.68 42.50 12.50 Table 1: Average usage of features per user for each countryTable 1 shows the average per country as well as the ratio  tweetsuser  .Countries are ordered according to the ratio  TweetsUser  . Results showthat Indonesia ranks first in tweets per user, followed by Japan andBrazil. It is interesting also to see that Indonesia and South Koreahave the highest percentage of mentions in contrast to Japan thathas the lowest, and it seems also to be the country with the fewestre-tweets in our data set. This indicates a higher use of Twitterfor conversation than in other countries. The Netherlands is thecountry with most hashtags per user, while the US seems to bethe country with most mentions of URLs per user. At first glance,this could indicate that the US uses Twitter more for formal newsdissemination, citing constantly external sources. Network.  Twitter also provides a social network structure for itsusers. This is, users connect to each other through directed links,therefore relationships are not necessarily reciprocal, as in Face-book  4 . Users can choose to  follow  other users, by subscribing to 4 Country Users Cov.(%) Links Cov.(%) Recip.(%)US 1,616,702 12.47 11,310,538 12.46 18.91Brazil 688,427 5.31 4,248,259 4.68 13.49UK 286,520 2.21 1,370,699 1.51 17.22Japan 133,536 1.03 408,486 0.45  32.01 Canada 132,240 1.02 553,726 0.61  26.11 Indonesia 130,943 1.01 199,704 0.22  26.97 Mexico 112,793 0.87 399,409 0.44 17.27Netherlands 86,863 0.67 354,021 0.39 22.11South Korea 80,381 0.62 499,261 0.55  28.14 Australia 67,416 0.52 299,556 0.33 23.51 Table 2: General summary of network statistics per country. Country Avg.  δ  Density Avg. Clus. Coef. Strongly CCUS 8.95 0.56 E-04 0.0645 9,667Brazil 7.55 1.09 E-04 0.0711 4,813Indonesia 2.12 1.62 E-04 0.0618 7,942United Kingdom 6.05 2.11 E-04 0.0933 14,818Japan 4.36 3.26 E-04 0.0603 6,052Mexico 4.44 3.91 E-04 0.0826 6,885Canada 5.73 4.33 E-04 0.1001 6,630South Korea 8.61 10.67 E-04 0.0879 3,864Netherlands 5.39 6.16 E-04 0.1017 4,626Australia 5.83 8.52 E-04 0.0959 3,423 Table 3: Summary of network density statistics per country.their updates. These connections between users can be viewed as alarge directed graph.In this section we focus on the analysis of the Twitter social net-work graph for each top-10 country and its active users (as definedin Section 3). In order to obtain this graph, we extracted user re-lationships using the public Twitter API (4J), collecting the list of followers/followees for each user. In this particular graph, con-nections between users are highly dynamic, so we worked with asnapshot of the graph, which was crawled between November 25to December 2, 2010. This crawl resulted in  12 , 964 , 735  users and 90 , 774 , 786  edges. We cleaned this dataset to keep only edges andusers which corresponded to our  active user set  . Prior work [10]has shown that analysis of partial crawls of social networks canunderestimate certain measures, such as degree distribution, butcontinue to preserve accuracy for other metrics, such as density,reciprocity and connectivity. Therefore, by preserving the activecomponent of the graph we are analyzing the most relevant part of the social structure.Table 2 shows a summary of each countries’ statistics. For eachlocalnetworkanalysis, weconsideronlyconnectionsbetweenusersin the same country. The second and third columns in Table 2show the node and edge coverage of each country in relation tothe complete graph. We also show the percent of reciprocity, whichis the fraction of ties between users which are symmetric. Overall,the top-10 most active countries cover 25.73% of the total of ac-tive users in the social graph. Additionally, these countries cover21.64% of the total number of edges in the global network. Ta-ble 2 shows that for some countries reciprocity is very significantin particular for Japan, South Korea, Indonesia and Canada. Thesymmetric nature of social ties affects network structure, increas-ing connectivity and reducing the diameter, as we show in the re-maining of this work.Table 3 shows a summary of graph density statistics, such as ave-rage degree ( δ  ), density and average clustering coefficient. The USand South Korea are the countries with the highest averaged degreeper node, meaning that users tend to concentrate more followersand followees than in other countries. Indonesia, on the other hand,presents a very low degree (only 2.12 edges per node on average)
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks