miércoles, 30 de enero de 2013

Los que se sientan adelante, sacan mejores notas


The rich club phenomenon in the classroom

Nature
Scientific Reports
 
3,
 
Article number:
 
1174
 
doi:10.1038/srep01174

We analyse the evolution of the online interactions held by college students and report on novel relationships between social structure and performance. Our results indicate that more frequent and intense social interactions generally imply better score for students engaging in them. We find that these interactions are hosted within a “rich-club”, mediated by persistent interactions among high performing students, which is created during the first weeks of the course. Low performing students try to engage in the club after it has been initially formed, and fail to produce reciprocity in their interactions, displaying more transient interactions and higher social diversity. Furthermore, high performance students exchange information by means of complex information cascades, from which low performing students are selectively excluded. Failure to engage in the rich club eventually decreases these students' communication activity towards the end of the course.

At a glance

Introduction




More than 1.2 million students drop out of school every year in the U.S., one every 26 seconds1. Year 2007 dropouts will cost more than $300 billion in lost wages, taxes and productivity to the U.S. Dropouts contribute about $60,000 less in federal and state income taxes. Each cohort of dropouts costs the U.S. $192 billion in lost income and taxes2. A dropout student is more than 8 times as likely to be in jail or prison as a high school graduate and nearly 20 times as likely as a college graduate3.
Early detection of poor performance will allow more time to take corrective actions and will likely help to reduce the number of dropouts. Therefore, it is of the utmost importance to be able to assess the performance of students in a continuous manner.
Computer science is not unaware of this need for close follow up of students. Computer Supported Collaborative Learning (CSCL) is a branch of computer science that intersects with pedagogy and social sciences. Indeed, one of the goals of CSCL is to explore appropriate methods/tools for evaluating collaboration so that more insight can be gained into the results of lecturing/teaching procedures4.
However, systematic gathering and analysis of educational data in-natura has only recently started. So far this analysis has mainly tried to determine static structural features of the social learning network formed by the students. For instance, Nurmela et al. looked at the structure of the interactions trying to determine the central actors in a CSCL environment5. In this social structure, “key communicators” were assumed to be the most connected individuals in time-aggregated networks6. Similar analyses were carried out by Martínez et al.7 and Chen and Watanabe, who focused on other structural parameters that are important for the final score: group structure, member's physical location distribution, and member's social position8.
Beyond this merely static structural analysis, the literature also highlights the key role of student interaction for effective learning. At a societal scale, Granovetter's pioneering work9 recognised the importance of interaction patterns and proposed his well-known “strength of weak ties” phenomenon, where he hypothesised that isolated social ties offer limited access to external prospects, while heterogeneous social ties diversify one's opportunities.
While the relevance of the social network structure and interactions has been widely recognised in the educational context10, some other factors have recently been under the spotlight, e.g. social acceptance or willingness to communicate11. In general, it is not just about knowing “who” the students interact with, but “how” and “when” they do it and, importantly, what is the result of these interactions with regards to the educational outcome12.
Preliminary answers to the “how” question come from different works. The effects of analysing the relationships between web forum users on the structure of the network (reconstructed from the messages sent) were studied in1314. Also, the type of interaction or content being exchanged have been considered616. However, these previous analyses were based on a static snapshot of the structure and interactions of the network at some point in time or included a reduced number of samples. For instance7, analysed these macroscopic metrics in the four different assignments the course was structured in (  once a month).
Acquiring full knowledge on “how” students interact would be facilitated by having access to dynamic interactions and their changes with time. Timing is a determinant element to understand the correspondence between student behaviour and performance. Therefore, this paper tries to determine the individual and group-level behavioural patterns that lead to low scoring and possible dropout. Gaining insight into these data could help in identifying “groups at risk”, enabling educators to act sooner and hopefully reduce dropout rates.
The rest of this paper is organised as follows. Next section presents the main results obtained from our analysis. This is followed by a broader discussion.

Results



We analysed a record of college student interactions and compared social interaction data with the academic scores of the students (see third paragraph of Course Details in Methods in theSupplementary Information (SI) for a concrete definition on what an interaction is in this context) and how this relationship evolves with time. To this end, we analysed records of 80, 000 interactions by 290 students - approximately 16 times more interactions with almost 3 times more students than previous studies on educational networks in natura5678101215. Even so the data can still be considered to be sparse (  4.6 interactions per person per day). This sparseness is partly due to the fact that our work does not include verbal in classroom interactions or other communication mechanisms, like discussion groups that are typical in most universities.
Figure 1A shows a snapshot of the social graph for one of the classes being analysed.Supplementary video S1 offers a complete weekly sequence of interactions between students in one of the courses we analysed.

Figure 1: Diversity and Assortativity Analysis.


(A) shows a graph of one of the analysed courses including 82 students at the end of the last week of the course. Continuous thick blue edges indicate persistent interactions while dotted thin grey edges indicate transient interactions. High performing students are shown in dark blue, mid performing ones in red and low performing ones in green. As can be observed, high performance students form a “core” where the highest density of persistent interactions can be observed. Low performance students remain in the periphery of the graph, mainly holding transient interactions. (B) Scatter plot and linear regression for one of the variables analysed (number of interactions) vs. scoring in one of the classes (R2 = 0.72). (C) Scatter plot and linear regression for social diversity vs. scoring in one of the classes (R2 = 0.12). (D): Ratio of transient to persistent interactions obtained for different groups of students with different levels of interaction (LOW, MID, HIGH).

Diversity and assortativity analysis

Our first finding is that, in this environment, social diversity is negatively correlated with performance. This is explained by our second finding: high performing students interact in groups of similarly performing peers. This effect is stronger the higher the performance of the student. Indeed, low performance students tend to initiate many transient interactions regardless of the performance of the students they interact with. These interactions held by low performance students start late in the course, allowing high performers to establish a closely knitted group. In the following, we give details of these findings.
We start by comparing the score of each student with diversity metrics associated with the interactions held by each member of the social network (as shown in the SI). We characterise the nature and diversity of interaction ties within an individual's social network. Specifically, social diversity is defined as Shannon's entropy associated with individual communication behaviour, normalised to the total number of interactions (see Methods in SI for more details). Since both Shannon's entropy and the total number of interactions depend on the degree (number of connections), this normalisation reduces the correlation between low degree and high social diversity (see Figure S1 in Supplementary material).
The number of connections (students that a student has interacted with) and number of interactions (times a student has contacted or been contacted with/by other students), (see Methods in SI) were all positively correlated with the final score of the student (Pearson's correlations of 0.81, 0.85, respectively; p < 0.01), as shown in Figure 1B. Principal component analysis of these metrics revealed that all of them were closely interrelated, resulting in a non-significant improvement when combined (see Methods in SI). However, social diversity negatively correlated with final scores (Pearson's correlation of –0.34, p < 0.01) (Figures 1C). The reader is reminded that correlation does not imply causation and that diversity cannot be regarded as the cause of low score from these results.
To further analyse the effects on score, students were grouped into high (> 6.5), mid (between 6.5 and 3.5) and low (< 3.5) scoring (scores in Spain are typically given in a 0–10 scale, being 10 the top score). To verify the suggested existence of less effective interactions, we also classified the type of interactions in two types: 1) persistent, those sustained over time, and 2) transient, those not reciprocated within a week. We found that at the end of the course up to 28 ± 12% of the interactions held by high performing students were persistent, which is statistically different to those held by mid (14 ± 5%) or low (1 ± 0.5%) performance students (n = 290, p < 0.05).
We analysed the average ratio of transient to persistent interactions per neighbour: a higher number indicated less targeted interactions. This is illustrated in Figure 1D for one of the three classes under analysis (results were similar for the other two classes).
The presence of more focused and sustained interactions did not stop high scoring students from interacting with colleague students with mid or low scores in a transient manner (similar number of transient interactions regardless of the score). An assortativity analysis17 on these persistent interactions with regards to score indicated the existence of preferential interaction initiation (r = 0.5, p < 0.05 by using the Jackknife method, see Methods in SI). In other words, similarly scoring students tended to keep persistent interactions only between themselves.
This assortative behaviour with regards to scoring is highly suggestive of a “rich club” phenomenon (see Methods in SI and1819). A “rich club” is defined as a set of nodes with degree larger than kthat tend to be more densely connected among themselves than the nodes with degree smaller than k. When we performed this analysis taking all the types of interaction into account, we could observe no “rich club” effect (  for the students with more links, indicating they also interacted with students outside the “rich club”). However, when only persistent interactions were taken into account, we obtained  , which is in line with the idea of high scoring students keeping persistent interactions between themselves as indicated by our assortativity analysis. The “rich club” phenomenon could not be observed during the first weeks, φ(r) ≪ 1, and it became apparent only after week 4–5 for the top performing students, remaining stable afterwards.

Temporal analysis

One interesting finding is that the total number of interactions per week (normalised to the maximum value in all weeks) for all groups increased over time and it saturated around week 6 for mid performing students and around week 4 for high performing students (Figure 2A). In both cases, the number of persistent and transient interactions increased until saturation as the weeks went by. However, the number of interactions for low scoring students behaved in a strikingly different manner. The number of total interaction increased until week 4, where it started to drop steadily until the end of the course (Figure 2A). We believe this may be due to a lack of incentives to interact as revealed by our reciprocity measurements (see two paragraphs below).


Figure 2: Persistent Interaction Analysis.
Persistent Interaction Analysis.
(A) Temporal Evolution of the total number of interactions in all groups. The y-axis indicates the number of interactions per group per week normalised to the value of the week when the maximum number of interactions was recorded for that group. This figure pools normalised data from all three courses available. High performing students start to interact before and keep interactions throughout the whole course. The same applies to mid performing students, although their interactions start a bit later in the course. Low performing students start interacting later than high performing ones and their interactions drop with time. The maximum values used for normalising these curves were 150, 36, 57 and 63 all, high, mid and low interactions, respectively. (B, C and D) Evolution of the % of persistent interactions (relative to the average total # of interactions of that group) per week and per student group (low, (B); mid, (C); and high, (D)) relative to the total number of interactions per group per week. Continuous lines represent the fit of a curve to the points as indicated in Methods. As can be observed, the % of persistent interaction increases as the course progresses for all groups of students. High performing students achieved a higher % of persistent interactions than mid and low performing ones.
A closer look at the data revealed that the percentage of persistent interactions increased in all groups, but with different timing, as shown in the persistent interaction analysis (see Figure 2B, C, D). As indicated in Table 2, the midpoint for the sigmoid function was 6.08, 4.81 and 3.2 weeks for low, mid and high performing students (p < 0.05). This suggested that high performing students on average established persistent interactions before mid and low performance students did (1 and 2 weeks earlier, respectively). Also, mid performing students started to establish persistent interactions 1 week before low performance students did. If one takes the slope of the sigmoid as a reference, it can be observed that there was no significant difference in the rate of change from a “low interaction mode” to a “high interaction mode” between mid and high performing students (0.58 vs. 0.4769). These data are in line with those on the number of connections, interactions and attendance (Figure 3 A, B and C), which showed that low performance students tried to engage later in the course, while mid and high performing students started their interactions earlier. These data are aligned with the number of students that stopped delivering their assignments and therefore did not pass the course. The average percentage of students dropping the course was 24.5%, 31.5% and 0% for low, mid and high performance students, respectively.  80% of these dropouts occurred after the 9th week of course. The higher attendance level by high performing students may also be causing the higher number of persistent interactions, although our analysis does not let us conclude any causality relationship.


Table 1: Summary of the cascade analysis performed across the three groups of students (p < 0.05 between any two groups)


Table 2: Sigmoid Fitting Results. Constants obtained on fitting a sigmoid curve to the data


Figure 3: Course Data Details.
Course Data Details.
(A) Shows the evolution of the degree of the nodes in the graph per week per scoring group for all three courses. (B) Number of actual communications held per day on a given week grouped per scoring group. (C) An estimation of the attendance of the students to the course, based on the number of log-ons performed on any day in that week in any of the systems available for them to communicate. As can be observed, the degree remained almost constant for mid and high performing students, while it started to increase around week 4 and slowly declined later on for low performance students. This same pattern is observed for the number of interactions held by the students. These data are consistent with our estimation of “attendance”, where log performing students have a significantly lower number of logins into the system. All panels show data from one of the courses under study only. The whiskers in the Figure show the estimated error in the mean.
Taking data on increasing percentage of persistent student interactions together with the assortativity analysis (students preferred to interact with those who have similar scores/performance), our results suggested that at some point reciprocity Ri,j (measured as the fraction of times a student i in any given group responds to a student j outside her same group) should start to drop. However, reciprocity remained unchanged with time and was similar between groups (  0.7). By analysing the direction of the initiation of the interaction we could see that persistent interactions held between members of different groups are highly symmetric (having almost even initiations starting from both ends). On the contrary, transient interactions between members of different groups are almost always initiated by the student with lower performance (with 0.87 probability). In addition, the timing of responses was different. While persistent interactions are responded in 8.1 ± 0.3 hours on average, the response time for transient interactions is delayed 7.21 ± 0.46 days.
This could be indicating that low performance was due to either a lack of interest of the students or just that no valuable content was conveyed in these delayed interactions. Since the content of these interactions was not logged, we restricted ourselves to find whether there was any differences in the way content flowed between students and groups of students.

Information cascades

Information cascades reveal spread mechanisms in which an action or idea becomes adopted due to the influence of others, typically, neighbours in some network. A well-known example are cascades in the context of large product recommendation networks21222324.
In order to detect the presence of information cascades and determine the actual value of the communication, we needed to gain insight on the content of the messages exchanged by students. Since this would be a clear violation of students' privacy, we decided to analyse another source of information: file exchange of students in their home directories and in their Moodle and collaborative workspace accounts (see “Information Cascades” in Methods in SI).
We defined as trivial cascades those implying a single transfer (a single originating source and a single destination) of information about the course, and non-trivial cascades, those with more complex patterns. We found a total of 845 cascades, and 53.37% of which were trivial cascades (T1in Figure 4), 25% were non-trivial cascades involving transfer from a single source to many destinations in the same time frame, and the remaining 11% of the cascades were topologically more complex.


Figure 4: Information Cascades.
Information Cascades.
Most Frequent Cascades for Low Performing (A) and High Performing (B) students. Students initiating, relaying or receiving a document were supposed to be part of the cascade. As can be observed high performance students keep more complex information cascades in sharing documents in the systems available. Low performing students use a more straightforward “relay” strategy, forwarding documents to other students.
The total number of cascades was significantly different across all three groups 51%, 35.97% and 13.03% for high, mid and low performance students, respectively (see Table 1).
Our data revealed that the length of the cascade (number of synchronous transfers) gradually increased as the average score of the students involved in the cascade increased. This is also supported by the fact that among non trivial cascades, the most common pattern for low performance students was star-like (T2 and T3 in Figure 4, 97.8%), while chained cascades (T4, T5 and T6 in Figure 4) were more common for mid (53.82%) and high (76.29%) performing students.

Discussion




Being limited to non-verbal interactions between students prevented us from capturing a wealth of valuable interactions and led to some sparseness in our data. We combined fine-grained educational data at unprecedented temporal resolution in educational settings (  4.6 events per student per day) and gained insight into the type of interaction patterns that are associated to lower performance.
The major finding is that a higher number of online interactions (independently of the number of distinct students involved) is usually an indicator of higher score.
Our data show that increased social diversity is negatively correlated with high scores; most diversity metrics are correlated with the degree of the vertices (e.g. Shannon's entropy or topological diversity as in25) and this may lead to think that social diversity is high in low performing students because their number of connections (degree) is low. We minimised this fact with the normalisation of Shannon's entropy to degree.
The results also show that the higher the score of the students, the higher the percentage of their interactions that were persistent. These results were independent of gender differences (correlation of gender to score was −0.04). As the score of the student increases, these persistent interactions are initiated with a reduced number of similarly performing colleagues (assortative interaction pattern). Low performance students have a larger number of transient interactions spread over a large number of neighbours.
The dynamics of these interactions reveal that once students start to establish persistent interactions they do it more and more until a maximum saturation point is reached. High performing students tend to initiate persistent interactions before low performance ones, suggesting more willingness to collaborate. A striking fact is that these high performance students still maintain more than  70% of transient interactions, mostly with mid performance students. Our reciprocity analysis shows that students try to contact high performance students and these respond although the latter do not usually initiate disassortative interactions with low performance students.
These early persistent interactions enable high performance students to build a “rich club”, while low performance students barely interact. Low performance students start to interact later (around week 4–5), when their “attendance” also increased just to decrease again towards the end of the course. This delay may help to explain why low performance students initiated more interactions that decreased after they failed to engage in persistent interactions with high performing students, since the “rich-club” had already been formed.
We could not monitor the content of the private message of students and decided to perform an information diffusion analysis that could help us gain insight on the value of the content actually being exchanged. Our results revealed that low performance students generally exchange documents in a trivial manner (i.e. in a forwarding manner that spans a single hop). On the contrary, more complex and longer cascades occur in high performing groups. This indicates the existence of a highly organised network where similarly performing students exchange information in a well-structured fashion, following characteristic patterns that are different across groups. While high performing students mainly exchange documents in a chained manner, low performance students spread the information to many other students at the same time, without this document apparently being relayed to other students beyond the recipient. Indeed, low performance students were not typically included in the information chains developed by high performing students. By this we do not mean to imply a deliberate behaviour of students, but it most likely indicates the presence of a benefit maximisation process by which students focus their efforts on potentially more fruitful connections.
Low performance students drastically reduce the number of interactions after week 5, which may be indicating a lack of motivation that leads them to drop the course and focus on other tasks. This per se does not let us conclude a lack of skills or motivation by low performance students. For instance, external factors may cause both less interactions and dropping the course (e.g. too many extracurricular activities). The lack of data that could enable causality inference in our analysis precludes us from concluding whether inefficient interactions, external factors or both are the cause of the dropout/reduced performance.
Even when we cannot directly build a causality chain, our empirical data suggest that: 1) low performing students engage later in the course; 2) this late engagement is related with their exclusion from the highly-structured and persistent information exchanges held by high performing students; 3) low performing students try to compensate by initiating larger number of weak interactions; 4) since this attempt to catch up is not successful low performance students drastically reduce the number of interactions.
Our study did not allow us to distinguish the root cause (initial delay in interacting, low degree or a combination of both) of the increased social diversity found in low performing students.
As part of our future work, we aim to perform a detailed causality analysis to detect the root cause of the low performance. This may help to get low performing students involved in high performing chains and hopefully increase their final score and reduce dropout rates. On the other hand, this may have a negative effect on high scoring students who will get many more interactions. We also plan to expand this analysis to non university environments.

Methods




The data consist of the interactions of 290 students at a Spanish university, during two consecutive years of a 12-week long course on Basic Computer Science Skills (in Linux such as OpenOffice, GIMP, or content licensing techniques such as Creative Commons) for freshmen students of journalism.
An interaction is defined as a communication attempt via the aforementioned systems. We logged the time and direction of the interaction in the Chat and the class IRC (see Table 3 for a detailed list of interactions and types). Confidentiality prevented us from performing an examination on the content of these interactions. Moodle and our collaborative workspace let us keep track of documents shared by students.


Table 3: Percentage of Interactions per Communication Channel. Average % of interactions taking place over the different communication channels employed in our study. No significant differences were found between different groups of students. Moodle interaction count was increased only if the post received an answer. The collaborative workspace let us include interactions from blog posts, document shares, reminders or messages in the collaborative space. Each chat and classroom IRC session (sequence of messages exchanged without stopping for more than 3 min) counted as a single interaction
These interactions were used to build a graph with a fine grained temporal granularity (see Communication Channels in the SI). Diversity, grouping and connectivity metrics were calculated on the graph (see SI)20. These metrics were analysed and compared throughout the course. A snapshot of the quality of the data set can be observed in Figure 5.


Figure 5: Quality of the Data.
Quality of the Data.
Probability density distribution of the number of iterations (A) and connections (B) per group in one of the courses being analysed.
Finally, we studied how files appeared and spread across the HOME directory students kept in the servers of the Lab (see SI).

References




  1. Diplomas Count 2007: Ready for What? Preparing Students for College Careers and Life after High School. . Education Week 26 (2007).
  2. Rouse, C. The Labor Market Consequences of an Inadequate Education. Princeton University and NBER. In: Equity Symposium on The Social Costs of Inadequate Education at Teachers College, Columbia University, edited by Clive Belfield and Henry M. Levin (Washington: Brookings Institution Press,2007). Available:http://devweb.tc.columbia.edu/manager/symposium/Files/77_Rouse_paper.pdf Last visited: 4-1-2013
  3. Harlow, C. Education and Correctional Populations. In: U.S. Department of Justice, Bureau of Justice, (Washington DC, 2003). Available: www.ojp.usdoj.gov/bjs/pub/pdf/ecp.pdf Last visited: 4-1-2013
  4. Nurmela, K.Lehtinen, E. & Palonen, T. Evaluating CSCL log files by social network analysis. In: Proceedings of the 1999 conference on Computer support for collaborative learningArticle 54 (International Society of the Learning Sciences,1999).
  5. Cho, H.Gay, G.Davidson, B. & Ingraffea, A. Social networks, communication styles, and learning performance in a CSCL communityComputers & Education 49309329 (2007).
  6. Martinez, A.Dimitriadis, Y.Rubia, B.Gomez, E. & De La Fuente P: Combining qualitative evaluation and social network analysis for the study of classroom social interactions.Computers & Education 41353368 (2003).
  7. Chen, Z. & Watanabe, S. A Case Study of Applying SNA to Analyze CSCL Social Network. In:ICALT 2007. Seventh IEEE International Conference On Advanced Learning Technologies18–20 (2007).
  8. Granovetter, M. The strength of weak tiesThe American Journal Of Sociology 78,13601380 (1973).
  9. Sundararajan, B. Impact of communication patterns, network positions and social dynamics factors on learning among students in a cscl environmentPhD thesis, (Troy, NY, USA, 2007).
  10. Yu, A. Y.Tian, S. W.Vogel, D. & Chi-Wai Kwok R: Can learning be virtually boosted? An investigation of online social networking impactsComputers Education 5514941503(2010).
  11. Ullrich, C.Borau, K. & Stepanyan, K. Who students interact with? a social network analysis perspective on the use of twitter in language learning. In: Proceedings of the 5th European Conference on Technology Enhanced Learning Conference on Sustaining TEL: from innovation to learning and practice 432–437 (Berlin, Heidelberg: Springer-Verlag, 2010).
  12. Yeung, Y. Y. Macroscopic study of the social networks formed in web-based discussion forums. In: Proceedings of the Conference on Computer Support for Collaborative Learning: the next 10 years! 727–731 (International Society of the Learning Sciences, 2005).
  13. Kepp, S. J. & Schorr, H. Analyzing collaborative learning activities in wikis using social network analysis. In: Proceedings of the 27th International Conference Extended Abstracts on Human Factors in Computing Systems 4201–4206 (New York, 2009).
  14. Cho, H.Gay, G.Davidson, B. & Ingraffea, A. Social networks, communication styles, and learning performance in a CSCL communityComputers And Education 49309329 (2007).
  15. Erlin, B.Yusof, N. & Rahman, A. Integrating Content Analysis and Social Network Analysis for analyzing Asynchronous Discussion Forum. In ITSim 2008International Symposium On Information Technology 318 (2008).
  16. Newman, M. Mixing patterns in networksPhysical Review E 67026126 (2003).
  17. Zhou, S. & Mondragon, R. J. The rich-club phenomenon in the Internet topologyIEEE Communications Letters 8180182 (2004).
  18. Colizza, V.Flammini, A.Serrano, M. A. & Vespignani, A. Detecting rich-club ordering in complex networksNature Physics 2110115 (2006).
  19. Wang, P.Hunter, T.Bayen, A. M.Schechtner, K.Gonzlez, M. C. . Understanding Road Usage Patterns in Urban Areas.Scientific Reports 21001 (2012).
  20. Leskovec, J.Singh, A. & Kleinberg, J. Patterns of influence in a recommendation network. In:Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining 380–389 (Springer-Verlag, 2006).
  21. Leskovec, J.Adamic, L. A. & Huberman, B. A. The dynamics of viral marketing. In:Proceedings of the 7th ACM conference on Electronic commerce 228–237 (New York, 2006).
  22. Yang, J. & Leskovec, J. Temporal Variation in Online Media. In: Proceeding of the ACM International Conference on Web Search and Data Mining 177–186 (New York, 2011).
  23. Leskovec, J.Adamic, L. A. & Huberman, B. A. The dynamics of viral marketingACM Transactions On The Web 15 (2007).
  24. Eagle, N.Macy, M. & Claxton, R. Network diversity and economic developmentScience 328,1029 (2010).

Acknowledgements




We would like to thank Charles Elkan, Miranda Mowbray, Nabeel Gillani, Suksant Sae Lor, and Kate Mallichan for their insightful comments on the manuscript and Yannis Dimitriadis and Eduardo Gomez for inspiring this work. Manuel Cebrian acknowledges support from the National Science Foundation under grant 0905645, from DARPA/Lockheed Martin Guard Dog Program under PO 4100149822, and the Army Research Office under Grant W911NF-11-1-0363.

Author information




Affiliations

  1. Hewlett-Packard Laboratories, Bristol BS34 8QZ, UK

    • Luis M. Vaquero
  2. NICTA, Melbourne, Victoria 3010, Australia

    • Manuel Cebrian
  3. Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA

    • Manuel Cebrian

Contributions

Conceived, designed and performed the experiments: L.M.V. Analysed the data: L.M.V., M.C. Wrote the paper: L.M.V., M.C.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to: 

sábado, 26 de enero de 2013

Twitter en las inundaciones australianas


Social Network Analysis of Tweets During Australia Floods

This study (PDF) analyzes the community of Twitter users who disseminated  information during the crisis caused by the Australian floods in 2010-2011. ”In times of mass emergencies, a phenomenon known as collective behavior becomes apparent. It consists of socio-behaviors that include intensified information search and information contagion.” The purpose of the Australian floods analysis is to reveal interesting patterns and features of this online community using social network analysis (SNA).
The authors analyzed 7,500 flood-related tweets to understand which users did the tweeting and retweeting. This was done to create nodes and links for SNA, which was able to “identify influential members of the online communities that emerged during the Queensland, NSW and Victorian floods as well as identify important resources being referred to. The most active community was in Queensland, possibly induced by the fact that the floods were orders of mag-nitude greater than in NSW and Victoria.”
The analysis also confirmed “the active part taken by local authorities, namely Queensland Police, government officials and volunteers. On the other hand, there was not much activity from local authorities in the NSW and Victorian floods prompting for the greater use of social media by the authorities concerned. As far as the online resources suggested by users are concerned, no sensible conclusion can be drawn as important ones identified were more of a general nature rather than critical information. This might be comprehensible as it was past the impact stage in the Queensland floods and participation was at much lower levels in the NSW and Victorian floods.”
Social Network Analysis is an under-utilized methodology for the analysis of communication flows during humanitarian crises. Understanding the topology of a social network is key to information diffusion. Think of this as a virus infecting a network. If we want to “infect” a social network with important crisis information as quickly and fully as possible, understanding the network’ topology is a requirement as is, therefore, social network analysis.

sábado, 19 de enero de 2013

Apariencias sociales enredadas


Redes sociales: ¿Todos mienten?

Un estudio asegura que en Facebook, Twitter y otras redes los hombres quieren parecer más inteligentes y las mujeres, más atractivas. Y que la imagen que todos muestran de sí mismos tiene poco que ver con la realidad. 

A solas

La sentencia es inapelable: en las redes sociales, casi todas las personas ofrecen imágenes mejoradas de sí mismas, proyectando en sus perfiles digitales lo que les gustaría ser, algo que poco se asemeja a la realidad. Las mujeres intentan mostrarse más atractivas y los hombres se preocupan por ofrecer una imagen divertida e inteligente. Lo revela una flamante encuesta realizada por Intel en Europa y Medio Oriente, bautizada UltraYou.
Sus conclusiones revelan que la conducta online de los usuarios ha cambiado mucho a partir de la explosión de las redes sociales. "En general, todo el mundo desea proyectar una versión mejorada de lo que son realmente", sostiene. Nuestra personalidad dentro del mundo digital es una proyección de los deseos y aspiraciones personales, que no siempre coincide con la realidad", explican.
La encuesta también registró diferencias de género. El 54 por ciento de las encuestadas reconoció retocar y manipular sus imágenes on line para verse más atractivas, algo que sólo hacen el 20 por ciento de los varones. Ellos, según los resultados del relevamiento, centran su esfuerzos en postear frases que les hagan parecer más interesantes e inteligentes, además de divertidos.
Otro dato interesante es que el estudio detectó diferencias entre países: en Holanda y la República Checa la gente presume en la red de su familia y de sus mascotas, mientras que en Egipto y en los Emiratos Árabes los encuestados intentan, en sus posteos y mensajes, que el personaje proyectado en la red tenga un aura más intelectual.

Sana desconfianza
Otro dato llamativo de la encuesta es que los usuarios son conscientes de que la gente, en las redes sociales, suele mentir: más de la mitad de los entrevistados dijeron que los mensajes que parecen ‘demasiado buenos para ser verdad’ delatan que hay detrás alguna exageración o alguna pose que no se condice con los hechos. Es más: 4 de cada 10 dicen que cuando la foto es muy pero muy buena y está claramente retocada con Photoshop, uno da por sentado que la persona miente.  


Entre mujeres

domingo, 9 de diciembre de 2012

Amargo Twitter de campañas políticas


Beyond "Bitter Twitter": Crowd-Photography for the Cyber-Tahrir Square

BY MICAH L. SIFRY | Tuesday, July 10 2012 - TechPresident



 A close-up view of @davidaxelrod's Twitter footprint from July 6, 2012



"This election may be remembered as the Bitter Twitter campaign," former Bush adviser Mark McKinnon said recently. With both campaigns avoiding offering any big new ideas, he predicted that "we are likely to see the next [few] months as a furious and relentless exchange of messages that aren't much longer or deeper than 140 characters."
Is that all that can be said about the daily Twitter wars between the Romney and Obama camps? To get a bigger picture — or rather, as you will see below, to get several bigger pictures — I turned to Marc A. Smith, chief social scientist with the Connected Action consulting group, who has long studied computer-mediated collective interaction. For the last few years, he and his colleagues have been doing what he chirpily calls "crowd photography for the cyber-Tahrir Square," using a sophisticated network mapping tool called NodeXL they've developed as a free extension to Excel. And the graphs they make offer a whole new way of seeing the connections among people, events and ideas as they coalesce online. Here, for example, is the conversation online when the "pink slime" controversy was at its height. Here's a chatter map for the controversy over "Fast and Furious"when Attorney General Eric Holder and Rep. Darrell Issa recently locked horns.
But how to read them? Take this snapshot of the public conversation on Twitter after the first day of this year's Personal Democracy Forum conference. Here, Smith got data on 711 Twitter users whose recent tweets contained the hashtag "#pdf12." Users are clustered together if they have lots of connection to each other, based on who follows whom, who replies to whom, and who mentions whom, and then each cluster is given its own rectangle based on those users having used similar hashtags in addition to #pdf12, with those listed in order of prominence.
Blue lines are links NodeXL builds when one user mentions, replies to or retweets another. Green lines represent people who follow each other. The "isolates" — in this case in the bottom right — are users who don't connect to others on the graph. "Reading" the graph (which I did with Smith's help), you can learn several things about the social landscape of the #PDF12 community.
There are three major subgroups that were tweeting from and about PDF12. First and largest, in the top left corner, are the open-government folks, whose top three hashtags after #PDF12 were #opengov, #opendata and #OGP (for the Open Government Partnership). At the center of this cluster, not surprising, you can see the Twitter avatars of power-tweeters like O'Reilly Media's Alex Howard (@digiphile), who covers open government and open data on a daily, even hourly, basis.
The second group in the top left are the PDF progressives, netroots folks who had likely just attended the annual Netroots Nation conference in Providence, RI. Their usage of hashtags like #NN12 and #ows (Occupy Wall Street) is a sure giveaway of their closeness to each other.
And third, in the bottom left corner of the graph, are the PDF netheads (you can recognize Zeynep Tufekci @techsoc, for example) who were paying closest attention to issues like online privacy, SOPA and PIPA.
A few additional observations are in order. First, there's not a lot of hierarchy to this community of individuals. There are lots of bilateral connections between people, as evinced by all the crosshatching green and blue lines. Second, the relatively large number of "isolates" in the bottom right corner is a good sign, Smith told me. It suggests that #PDF12 was also reaching beyond its core audience, and shows that we're not a completely "built-out" network. In other words, that there's room for growth. Network maps for other conferences, he says, often show that they're just speaking to themselves. And finally, the small cluster at the bottom of the middle of the graph indicates that there were probably some people using "pdf12" in their tweets during the conference who had absolutely nothing to do with the event, but maybe had some other common point of contact, such as linking to this document: http://www.eluniversal.com.mx/graficos/pdf12/recuento-eleccion.pdf.
Axelrod vs Fehrnstrom
Now take a look at how the Twitter conversation coalesces around two leading online surrogates for the Obama and Romney campaigns, David Axelrod and Eric Fehrnstrom. Axelrod is the Obama campaign's senior strategist; Fehrnstrom is one of Romney's senior campaign advisors. I asked Smith to run NodeXL against a recent list of Twitter users who either mentioned @davidaxelrod or @ericfehrn, and this is what he found.
Obviously, Axelrod's network is different than Fehrnstrom's. He has more than 103,000 followers on Twitter, compared to Fehrnstrom's 13,000. And he tweets more often, roughly 5-6 times a day, while Fehrnstrom tweets at most two or three times. As a result, their NodeXL graphs cover very different time periods. For Axelrod, it just took 13 hours on July 6th to accumulate 790 users who mentioned "davidaxelrod" in a tweet. It took 10 days, from June 27 thru July 6, to accumulate a similar number (876, actually) of users who mentioned "ericfehrn."
Here's the snapshot of Axelrod's Twittersphere around July 6th. You can view a high-definition "zoomable" version of the graph here.
What's immediately apparent from the network of Axelrod mentioners is a) how polarized it is, and b) how much the pro-Obama side looks like a broadcast network, while the anti-Obama side looks like a dense community. The polarization isn't surprising; Smith says this pattern repeats across many NodeXL scans on topics touching on American politics. The most active people in online politics are from the passionate poles of our two-party system; this is not news, but there are some hints in these particular graphs that the online right is different than the online left, at least on Twitter. More on that in the moment.
First, the pro-Obama side, which is distinguished by people using Obama-friendly hashtags like "yeswewill" and "bettingonAmerica" as well as anti-Romney terms like "bainmitt" and "whatsmitthiding." If you zoom in closely on the people near Axelrod who are interrelated (the green clump in the upper right quadrant around him), you'll find other media players like Ben Smith (@buzzfeedben), Michael Scherer of Time (@michaelscherer) and Ben LaBolt, an Obama spokesman (@benlabolt). But the rest of the network of mentioners around Axelrod aren't part of the media insider circle--there are just lots of Twitter users who follow the Obama campaign's senior adviser who talk him up.
There's a lot of blue lines in this graph — remember, those are links NodeXL builds when one user mentions, replies to, or retweets another. Green lines represent people who follow each other. Given how well known Axelrod is on Twitter, it's not surprising to find so many people who mention him but don't follow each other — hence the prominence of the blue over the green. This is another indication that his reach online is more like a broadcast network or a brand than an interlaced community.
"There is an 'audience' cluster and two 'community' clusters," Smith explains. "Most blue lines (replies, mentions, retweets) point in to the 'community' clusters in which there is significant amounts of mutual connection. The community clusters are dense, while the broadcast or audience cluster is sparse with little interconnection. Axelrod has a powerful broadcast ability and his messages reverberate within the community clusters."
On the top right we have people who are good bets to be anti-Obama, given their prominent use of the "top conservatives on Twitter" hashtag #tcot, as well as anti-Obama terms like "fastandfurious" — a reference to the Justice Department scandal — and "obamatax." This group has no obvious hub, but it's also densely inter-connected. This is typical of conservatives who are heavy Twitter users; a lot of people joined the service starting in early 2007 who were looking for new leadership on the right, and along the way they all found each other.
Now let's look at Eric Fehrnstrom's Twittersphere. A zoomable version is here.
As noted above, comparing Fehrnstrom's online footprint to Axelrod's isn't apples-to-apples. He's not as well-known, he's not on TV as often, and he isn't as big a Twitter user. Unfortunately for the purposes of this article, there isn't an exact parallel to Axelrod in the Romneyverse, at least when it comes to their online activities. So, caveat emptor, this isn't meant to suggest that what one man is doing online is somehow "better" than the other. They're each interesting but for somewhat different reasons.
Again, the most obvious finding from looking at people mentioning "ericfehrn" on Twitter is that they're a polarized group. The ones on the left hand side of the graphic are mostly pro-Romney; the ones on the right are mostly pro-Obama. We know this from the hashtags they're using in common: "obamacare," "obamatax" "fullrepeal" and "teaparty" for the Romneyites; and "p2" "aca" and "whatisromneyhiding" for the Obamanauts. (Smith has noticed that few Republicans ever say "Affordable Care Act," by the way.) On the Obama side of the split, you can zoom in and find individuals like Donna Brazile and Paul Begala, well-known Democratic partisans; on the other side you can find GOP stalwarts like Matt Lewis and Justin Hart.
Interestingly, you can see some signs of Fehrnstrom having a small "broadcast" footprint in the arc-like groups of Twitter users on the outer edge of the left-hand cluster. These are people like the much bigger circle around Axelrod, who mention or retweet things that "@ericfehrn" says, but who aren't really connected in a more intensive way as a community of people who know each other.
Don't Think of an Echo Chamber
When I first reached out to Smith, I thought that mapping the Twitter relationships of political power players like Axelrod and Fehrnstrom might demonstrate just how much Twitter was intensifying the insider culture of Washington and elite politics, to the chagrin of people like Mark McKinnon. But after looking at these graphs, I'm not so sure that that is the only thing going on with Twitter and the presidential campaigns.
Yes, there are a lot of A-list types showing up on these two graphs, the sort of people who have given the Washington-centric White House Correspondents Dinner its current reputation as the "nerdprom" for the media-politics elite, plus some hardworking journalists who are just doing their jobs. But at the same time, there's a much bigger network of influential voices in the political Twittersphere around the Romney and Obama camps, as is reflected by centrality in the graphs of people like @ttjemery ("Christian and conservative" with 29,000 followers), @daggy1 (a self-described "enemy of socialism" with 27,000 followers), or @gaypatriot ("America's leading gay conservative voice"). And this active and highly networked set of people clearly leans right; none of the top independent progressives on Twitter happen to show up as significant participants in these two snapshots.
What that suggests is that after you get past the daily "Bitter Twitter" back-and-forth games that these campaigns surrogates are playing with each other and the Beltway media, there's a large secondary group of political activists, most of them conservatives, who are also engaged in chewing on the daily give-and-take of the presidential campaign. A lot of these people know and relate to each other via Twitter. If progressive activists were as engaged on Twitter, then presumably we'd see the same kind of dense green cloud in interconnections around Axelrod on the pro-Obama side of his graph as we do for Fehrnstrom on the pro-Romney side of his. Arguably, the fact that Axelrod is much more of a "celebrity" online may explain some of this difference too.
NodeXL is a powerful tool for exploring the social landscapes of public conversations about all kinds of topics, but it takes time and effort to develop the visual literacy needed to make sense of these graphs. Or, as Smith puts it, "it takes time to go from seeing something that looks like bug splatter on my windshield to 'oh, that looks interesting.'" Over the coming months, we're going to keep exploring this terrain with Smith's help, as he has kindly offered to run some queries on techPresident's behalf. (He also does customized research for a modest fee.) What would you like to explore? We can visualize the network around a person or group of people, a term or hashtag, even a url. The zeitgeist awaits.



sábado, 8 de diciembre de 2012

Visualizando Facebook


Super Fan Facebook Graph




facebook_graph_notes
It’s pretty common to have a Facebook graph that consists of a few unconnected or only loosely connected clusters. In the visualization above there are distinct “constellations” of people I play cards with, people I worked with at Microsoft, people I am related to, people I went to high school with, and people I went to college with. Most of the people in those groups are fairly interconnected with each other but not at all connected to people in the other groups, or perhaps only loosely connected. For example the two nodes between 3 and 4 are my brothers who went to the same high school as me and therefore bring the high school and family clusters into proximity, but they aren’t connected to anyone I went to college with or have worked with in the past 10 years. Those clusters are the type of graphs I’d expect to see from the organic use patterns of a typical Facebook users, more or less.
So what’s with the really big blob? That’s the social network of someone who behaves like a very, very passionate fan of reality TV, in particular Big Brother and Survivor. I’ve been managing social media around theBig Brother Live Feeds for a while, and like my customers, I’ve connected to lots of stars and tons of fans. And all of those folks are connected to each other through multiple connections themselves. Over in constellation 6 are people I’ve worked with at RealNetworks, and the nodes connecting that group to super fan cluster are folks who work directly with customers, like our chat room moderators, my social media team, editors and video hosts, as well as some key folks in program management and operations.
Here’s the key if you’re curious
  1. Bridge players
  2. Microsoft friends (past gig)
  3. High school friends
  4. Family
  5. Knox College friends
  6. RealNetworks (current gig)
  7. Reality TV fans
Aside from the density of connections, you can see by the size of the circle which nodes are highly connected. The average number of connections in the graph is 55 but that average is thrown off by the folks represented by the very large circles, who have between 150 and 200 connections each. Some of those are stars who have been on the shows–and reality stars tend to be very accessible and connected on both Facebook and Twitter. They don’t just accept friend requests from fans, they will actively and frequently engage. However, it’s worth pointing out that some of the most connected members of my graph are just fans. They are the type of fans who will travel to Las Vegas after the season is over to party with the stars. They know each other very well and are as connected and influential in the fan community (and among my customer base) as any of the stars or casting directors.
Here’s a close up of the Reality TV fans
fan_cluster_close
And here are the high school / family clusters so you can see just how much sparser the connections are
family_highschool
Note: This post was an extension of the first week’s lesson of the Coursera course Social Network Analysis taught by Lada Adamic. Thanks, Lada!  The visualization tool I used is Gephi.

Performing Statements


martes, 27 de noviembre de 2012

Redes de amor adolescente


Love is a Battlefield Spanning-Tree Network with no 4-Cycles

by KIERAN HEALY



Quick, in high school were you ever told not to date your old girlfriend’s current boyfriend’s old girlfriend? Or your old boyfriend’s current girlfriend’s old boyfriend? Probably not. But I bet you never did, either. This month’s American Journal of Sociology has a very nice paper (subscription only, alas) by Peter BearmanJim Moody and Katherine Stovel about the structure of the romantic and sexual network in a population of over 800 adolescents at “Jefferson High” in a midsized town in the midwestern United States. They got a pretty well-bounded population (a high school included in the AddHealth study) and mapped out all the connections between the students. Read on for the lurid details.
The authors found that the observed network isn’t well-represented by existing models, which are mainly concerned with predicting how STDs propagate through populations and have often been based on ego-centered network data. These are surveys where you ask the respondents about their sexual networks, but the respondents aren’t necessarily in the same network. Here’s a picture of four kinds of network:

Core models posit a small group of very sexually-active individuals who occasionally come into contact with (and infect) those outside the core. Bridge models think in terms of an infected component and an uninfected component which join at some point. The biggest network component observed at Jefferson High turned out to be the fourth type, however: a “spanning tree” structure. This is “a long chain of interconnections that stretches across a population, like rural phone wires running from a long trunk line to individual houses … characterized by a graph with few cycles, low redundancy, and consequently very sparse overall density.” When they tried to simulate this bit of the graph structure, the authors found they could get most of the way there using a simple model where the probability of a tie depended on individuals having a preference for others with the same amount of sexual experience as themselves.[1] But simulated networks based on this model didn’t quite match the properties of the observed network. In particular, while the simulations had cycles of length 4, the Jefferson High network did not.
What’s a cycle? If you start at Crooked Timber and click over to Dan Drezner and then click Dan’s link to Mark Kleiman and then return to Crooked Timber via Mark’s link to us, you’ve completed a cycle of length 3: a walk through the network that starts and ends with the same actor and where all the other actors are different and not repeated along the way. Cycles of length 3 are the smallest possible cycles. When it comes to tracing paths through heterosexual relationships, though, the smallest possible cycles are of length 4. In order to make a cycle beginning and ending with yourself, you need two members of the opposite sex plus one intervening individual the same sex as you. It turns out that this kind of cycle is just not found in the Jefferson High network. Although there’s no explicit taboo or social norm against that kind of pattern, nevertheless people just don’t date their old partner’s current partner’s old partner.
From the perspective of males or females (and independent of the pattern of “rejection”), a relationship that completes a cycle of length 4 can be thought of as a “seconds partnership,” and therefore involves a public loss of status. Most adolescents would probably stare blankly at the researcher who asked boys: Is there a prohibition in your school against being in a relationship with your old girlfriend’s current boyfriend’s old girlfriend? It is a mouthful, but it makes intuitive sense. … For adolescents, the consequence of this prohibition is of little interest: what concerns them is avoiding status loss. But from the perspective of those interested in understanding the determinants of disease diffusion, the significance of a norm against relationships that complete short cycles is profound. The structural impact of the norm is that it induces a spanning tree, as versus a structure characterized by many densely connected pockets of activity (i.e., a core structure).
Individuals constitute social structures, yet those structures have properties that the members do not know about and can’t easily grasp—our vast amount of folk knowledge about our social relations notwithstanding. These properties can have all kinds of serious consequences. The “No 4-cycles” rule is interesting because on the one hand it reflects a very simple bit of structure and it’s not something that’s prohibited in any strong normative sense. I’m not sure I buy the authors’ status-based explanation for it, though. They suggest some alternatives—“’jealousy’ or the avoidance of too much ‘closeness,’ a sentiment perhaps best described unscientifically as the ‘yuck factor.’” I find the yuck-factor idea more intriquing: I wonder whether it’s more likely to show up at the limits of easily-described network structures. Bigger cycles defy easy verbal description altogether and are also subject to lack of information because some of the ties will be in the past or far away, so they’re not subject to avoidance. Dyadic ties are easy to keep track of. Short cycles are still tricky to grasp, but it’s not that hard, so being able to trace them triggers the taboo-like “yuck” response.
As for consequences, the spanning-tree structures created by experience homophily plus the 4-cycle rule are very effective at propagating diseases along their chains. But they are also easy to break in a way that core-type networks are not:
Under core and inverse core structures, it matters enormously which actors are reached, while under a spanning tree structure the key is not so much which actors are reached, just that some are. This is because given the dynamic tendency for unconnected dyads and triads to attach to the main component, the structure is equally sensitive to a break (failure to transmit disease) at any site in the graph. In this way, relatively low levels of behavior changeeven by low-risk actors, who are perhaps the easiest to influence can easily break a spanning tree network into small disconnected components, thereby fragmenting the epidemic and radically limiting its scope.
fn1. Homophily, or the tendency do associate with others with similar traits to oneself, is a powerful social force that explains a great deal about the structure of social networks—in this case, homophily on experience.

Crooked Timber