Mostrando entradas con la etiqueta cascada. Mostrar todas las entradas
Mostrando entradas con la etiqueta cascada. Mostrar todas las entradas

martes, 26 de mayo de 2020

Modelo de agresión entre usuarios en línea

Sobre un modelo de difusión de agresión y minimización en redes sociales en línea


Marinos Poiitis, Athena Vakali  y Nicolas Kourtellis
ARXiv



La agresión en las redes sociales en línea se ha estudiado hasta ahora, principalmente con varios métodos de aprendizaje automático que detectan tal comportamiento en un contexto estático. Sin embargo, la forma en que la agresión se difunde en la red ha recibido poca atención ya que incorpora desafíos de modelado. De hecho, modelar cómo se propaga la agresión de un usuario a otro es un tema de investigación importante, ya que puede permitir un monitoreo efectivo de la agresión, especialmente en plataformas de medios que ahora aplican técnicas de bloqueo de usuario simplistas. En este documento, nos centramos en cómo modelar propagación de la agresión en Twitter, ya que es una plataforma de microblogging popular en la que la agresión tuvo varios comienzos. Proponemos varios métodos de construcción en dos modelos de difusión conocidos, Cascada Independiente (IC) y Umbral Lineal (LT), para estudiar la evolución de la agresión en la red social. Investigamos experimentalmente cómo el método welleach puede modelar la propagación de la agresión utilizando datos reales de Twitter, mientras que varían los parámetros, como la selección de usuarios para la siembra de modelos, sopesar los bordes de los usuarios, el tiempo de activación de los usuarios, etc. Según el enfoque propuesto, las estrategias de mejor desempeño son las únicas para seleccionar usuarios de semillas con un enfoque basado en grados, sopesan los valores de uso basados ​​en la superposición de sus círculos sociales y activan a los usuarios considerando sus niveles de agresión. Además, empleamos los mejores modelos de rendimiento para predecir qué usuarios reales comunes podrían volverse agresivos (y viceversa) en el futuro, y lograr hasta AU C = 0.89 en esta tarea de predicción. Finalmente, investigamos métodos para minimizar la agresión, lanzando cascadas competitivas para "informar" y "curar" a los agresores. Mostramos que los modelos IC y LT pueden usarse en la minimización de la agresión, proporcionando así alternativas menos intrusivas a las técnicas de bloqueo actualmente empleadas por las populares plataformas de redes sociales en línea.



sábado, 28 de julio de 2018

Amplificación de cascadas por asortatividad

Las correlaciones de grado amplifican el crecimiento de las cascadas en las redes

Xin-Zeng Wu, Peter G. Fennell, Allon G. Percus, Kristina Lerman

Las redes facilitan la propagación de cascadas, permitiendo que una perturbación local se filtre a través de interacciones entre nodos y sus vecinos. Investigamos cómo la estructura de la red afecta la dinámica de una cascada en expansión. Contabilizando la distribución conjunta de grados de una red dentro de un marco de funciones generadoras, podemos cuantificar cómo las correlaciones de grados afectan tanto el inicio de las cascadas globales como la propensión de los nodos de clase de grado específica para desencadenar grandes cascadas. Sin embargo, no todas las correlaciones de grados son igualmente importantes en un proceso de expansión. Presentamos una nueva medida de grado de surtido que da cuenta de las correlaciones entre los nodos relevantes para una cascada de propagación. Mostramos que el punto crítico que define el inicio de las cascadas globales tiene una relación monótona con esta nueva medida de surtido. Además, mostramos que la elección de los nodos para sembrar las cascadas más grandes se ve fuertemente afectada por las correlaciones de grados. Contrariamente a la sabiduría tradicional, cuando la sutitud de grados es positiva, es más probable que los nodos de bajo grado generen las cascadas más grandes. Nuestro trabajo sugiere que puede ser posible adaptar los procesos de difusión manipulando la estructura de orden superior de las redes.



(o arXiv:1807.05472v1 [physics.soc-ph] para esta versión)



jueves, 24 de mayo de 2018

Protocolos de difusión afectan las cascadas de Facebook

¿Los protocolos de difusión gobiernan el crecimiento en cascada en Facebook?

Justin Cheng, Jon Kleinberg, Jure Leskovec, David Liben-Nowell, Bogdan State, Karthik Subbian, Lada Adamic

Figura 1: El árbol de difusión de una cascada con un protocolo de difusión voluntaria, donde los individuos publicaron música de un artista cuyo nombre coincidía con la letra que les asignó un amigo. Los enlaces se colorean de rojo (temprano) a azul (tarde).

Las grandes cascadas pueden desarrollarse en las redes sociales en línea a medida que las personas comparten información entre sí. Aunque las cascadas de compartición simples se han estudiado ampliamente, el alcance completo de comportamientos en cascada en las redes sociales es mucho más diverso. Aquí estudiamos cómo los protocolos de difusión, o los intercambios sociales que permiten la transmisión de información, afectan el crecimiento en cascada, de forma análoga a la forma en que los protocolos de comunicación definen cómo se transmite la información de un punto a otro. Estudiando 98 de las cascadas de información más grandes en Facebook, encontramos una amplia gama de protocolos de difusión, desde la distribución en cascada de imágenes, que usan un protocolo simple de tocar un solo botón para la propagación, hasta ALS Ice Bucket Challenge, cuyo protocolo de difusión involucraba individuos crear y publicar un video, y luego nominar a otros para que hagan lo mismo. Encontramos clases recurrentes de protocolos de difusión e identificamos dos factores clave de contrapeso en la construcción de estos protocolos, con implicaciones para el crecimiento de una cascada: el esfuerzo requerido para participar en la cascada y el costo social de permanecer al margen. Los protocolos que requieren un mayor esfuerzo individual ralentizan la propagación de una cascada, mientras que los que imponen un mayor costo social de no participar aumentan la probabilidad de adopción de la cascada. La previsibilidad de la transmisión también varía con el protocolo. Pero independientemente del mecanismo, todas las cascadas de nuestro análisis tienen un número de reproducción similar (≈ 1,8), lo que significa que las menores tasas de exposición se pueden compensar con mayores tasas de adopción por exposición. Por último, mostramos cómo la estructura de una cascada no solo puede diferenciar estos protocolos, sino también modelarse a través de procesos de ramificación. Juntos, estos hallazgos proporcionan un marco para comprender cómo una amplia variedad de cascadas de información puede lograr una adopción sustancial a través de una red.






ARXiv

miércoles, 7 de febrero de 2018

Detectando cascadas en Facebook

Detectando cascadas de gran tamaño para compartir en redes sociales

International Conference on World Wide Web
Por: Karthik Subbian, B. Aditya Prakash, Lada Adamic
Facebook Research


Resumen

La detección de grandes cascadas de compartición es un problema importante en las redes sociales en línea. Hay una variedad de intentos para modelar este problema, desde el uso de métodos de análisis de series de tiempo hasta procesos estocásticos. La mayoría de estos enfoques dependen en gran medida de las características de la red subyacente y utilizan la información de la red para detectar la viralidad de las cascadas. En la mayoría de los casos, sin embargo, obtener información de red tan detallada puede ser difícil o incluso imposible.

Por el contrario, en este trabajo, proponemos SansNet, un enfoque de red independiente. Nuestro método se puede utilizar para responder dos preguntas importantes: (1) ¿Será una cascada viral? y (2) ¿Qué tan temprano podemos predecirlo? Usamos técnicas del análisis de supervivencia para construir un clasificador supervisado en el espacio de probabilidades de supervivencia y mostramos que el límite de decisión óptimo es una función de supervivencia. Una característica notable de nuestro enfoque es que no utiliza ninguna función basada en la red para las tareas de predicción, por lo que es muy barato de implementar. Finalmente, evaluamos nuestro enfoque en varios conjuntos de datos de la vida real, incluidas redes sociales populares como Facebook y Twitter, sobre métricas como recuperación, medición F y cobertura de apertura. Encontramos que el clasificador SansNet agnóstico de red supera a varios competidores no triviales y líneas base que utilizan información de red.



sábado, 29 de marzo de 2014

¿Pueden predecirse las cascadas en Facebook?

La naturaleza curiosa de las cascadas de compartir en Facebook
La mayoría del contenido en Facebook se comparte un par de veces, pero algunos pueden ser millones de veces compartida. Ahora científicos de la computación están empezando a comprender la diferencia.




Una de las características definitorias de contenido social es la forma en imágenes, vídeo y texto es compartida entre muchos usuarios. Inevitablemente, parte del contenido se hace más popular que otros y esto lleva a las cascadas en el que el número de publicaciones compartidas puede ser enorme. Aunque la mayoría de las piezas de los medios de comunicación tienen sólo unas pocas acciones, algunas se ha compartido muchos millones de veces.

Así que hay mucho interés en saber cómo predecir algo que es probable que sea muy popular en comparación con algo que no es. En la cara de ella, es fácil pensar que la predicción de la popularidad de los contenidos es casi imposible. Eso es porque depende de muchos factores que son difíciles de medir, como la naturaleza de los contenidos y la conectividad de las personas que lo ven.

Sin embargo, varios equipos han afirmado haber encontrado la manera de predecir de un puesto eventual de popularidad mediante el análisis de su popularidad poco después de su publicación. Sin embargo, dada la ausencia de una manera confiable de hacer esto en la web, se puede juzgar por sí mismo lo bien que estos mecanismos deben trabajar.

Hoy en día, tenemos una opinión diferente sobre el tema de la previsibilidad gracias al trabajo de Justin Cheng de la Universidad de Stanford en California, así como un par de amigos en Facebook y la Universidad de Cornell. Estos chicos muestran por qué su popularidad es tan difícil de predecir en el abordaje convencional de estudiar las primeras etapas de la popularidad.

Pero al mismo tiempo, que muestran que diversas características de una cascada se pueden predecir con exactitud notable y que esto se puede utilizar para hacer juicios exitosos sobre el comportamiento futuro de cascadas, una vez que han comenzado. El resultado es una visión mucho más profunda de la naturaleza de las cascadas de lo que podría pensarse inicialmente posible.

Cheng y colegas llegan a sus conclusiones mediante el análisis de la forma en que las fotografías fueron compartidos en Facebook durante un período de 28 días después de su carga inicial en junio de 2013. Los miró por encima de 150 000 fotos que fueron juntos ha compartido más de 9 millones de veces. Los datos les dijeron que las personas (nodos) volvió a compartir cada fotografía y en qué momento y esto les permitió reconstruir exactamente las redes por las que se produjeron las publicaciones compartidas.

En el pasado, los investigadores han observado cómo comienzan las grandes cascadas y luego trató de utilizar esa información para detectar grandes cascadas en el futuro, con resultados mixtos.

Cheng y colegas adoptan un enfoque diferente. Comienzan con una foto que se ha vuelto a compartir un cierto número de veces, digamos k. A continuación, determinar la probabilidad de que esta foto será compartida dos veces tantas veces. En otras palabras, su tarea consiste en predecir si la cascada se duplicará en tamaño.

Eso es una buena opción de la pregunta porque la distribución de tamaño de la cascada sigue un cierto tipo de ley de energía. Esta ley asegura que para cascadas de un dado tamaño, la mitad será más del doble en tamaño, mientras que la otra mitad no. Así que para decidir si una cascada dado se duplicará, una estimación aleatoria recibirá la respuesta correcta alrededor de la mitad de las veces.

La pregunta es si es posible distinguir características del conjunto de datos que permiten un algoritmo de aprendizaje de máquina para hacer algo mejor que esto. Así Cheng y sus amigos usan una porción de sus datos para entrenar a un algoritmo de aprendizaje automático para buscar características de cascadas que hacen predecible.

Estas características incluyen el tipo de imagen, ya sea un primer plano o al aire libre o tener un título y así sucesivamente, el número de seguidores del cartel original tiene, la forma de la cascada que se forma, ya sea un gráfico simple estrella o estructuras más complejas ; y, finalmente, la rapidez con la cascada tiene lugar, su velocidad.

Después de haber entrenado su algoritmo, la usaban para ver si se podía hacer predicciones sobre otras cascadas. Comenzaron con imágenes que habían sido compartidos sólo cinco veces, así que la pregunta era si finalmente se compartirían más de 10 veces.

Resulta que este es sorprendentemente predecible. " Para esta tarea, adivinar al azar obtendría un rendimiento de 0,5, mientras que nuestro método logra un rendimiento sorprendentemente fuerte : la precisión de clasificación de 0.795 ", dicen.

Y algunas de las características de la cascada de unos mucho mejores predictores y otros. De hecho, el rendimiento temporal de la cascada, la rapidez con que se propaga, es el mejor indicador de todos. Así que algo se propaga rápidamente, para empezar, es probable que se propague más.

Otro factor importante son los temas mencionados en el título asociado con una imagen, por ejemplo si el interés periodístico o asociado con un meme actual.

Cheng y coautores también dicen que es más fácil hacer una predicción que el número de re-acciones aumenta." Esto demuestra que más información es siempre mejor : cuanto mayor es el número de publicaciones compartidas observadas, mejor es la predicción ", dicen.

Y es por eso que los esfuerzos anteriores han fracasado - que en gran parte siempre comienzan con muy poca información.

Existen limitaciones para el trabajo, por supuesto. La más obvia es que se hizo sólo con las fotos compartidas en su totalidad dentro de Facebook. Puede ser que compartidas en Facebook son algo diferentes de los que ocurren en otros lugares en la web y que las fotos son tratados de manera diferente de los enlaces de la historia, por ejemplo.

Pero Cheng y coautores confían en que gran parte de lo que encontraron serán útiles en otros lugares. "A pesar de estas limitaciones, creemos que los resultados dan ideas generales que serán de utilidad en otros ámbitos ", dicen.

Y deja mucho de interés para otros investigadores a seguir. Cheng y colegas han tropezado con una rica veta de información sobre la naturaleza de las cascadas en las redes sociales. Y hay más oro que las colinas de Thar.

Ref : arxiv.org/abs/1403.4608 : ¿Puede predecirse las Cascadas?

MIT Technology Review


viernes, 8 de marzo de 2013

Cascadas virales en Twitter

Viral Search: Identifying & Visualizing Viral Content
Video of the Week: What does it mean for online content to "go viral"? An analysis of almost a billion information cascades on Twitter news, videos, and photos has produced the first quantitative notion of whether something has indeed gone viral, thereby enabling further research into topic experts, trending topics, and viral-incident metrics.

miércoles, 30 de enero de 2013

Los que se sientan adelante, sacan mejores notas


The rich club phenomenon in the classroom

Nature
Scientific Reports
 
3,
 
Article number:
 
1174
 
doi:10.1038/srep01174

We analyse the evolution of the online interactions held by college students and report on novel relationships between social structure and performance. Our results indicate that more frequent and intense social interactions generally imply better score for students engaging in them. We find that these interactions are hosted within a “rich-club”, mediated by persistent interactions among high performing students, which is created during the first weeks of the course. Low performing students try to engage in the club after it has been initially formed, and fail to produce reciprocity in their interactions, displaying more transient interactions and higher social diversity. Furthermore, high performance students exchange information by means of complex information cascades, from which low performing students are selectively excluded. Failure to engage in the rich club eventually decreases these students' communication activity towards the end of the course.

At a glance

Introduction




More than 1.2 million students drop out of school every year in the U.S., one every 26 seconds1. Year 2007 dropouts will cost more than $300 billion in lost wages, taxes and productivity to the U.S. Dropouts contribute about $60,000 less in federal and state income taxes. Each cohort of dropouts costs the U.S. $192 billion in lost income and taxes2. A dropout student is more than 8 times as likely to be in jail or prison as a high school graduate and nearly 20 times as likely as a college graduate3.
Early detection of poor performance will allow more time to take corrective actions and will likely help to reduce the number of dropouts. Therefore, it is of the utmost importance to be able to assess the performance of students in a continuous manner.
Computer science is not unaware of this need for close follow up of students. Computer Supported Collaborative Learning (CSCL) is a branch of computer science that intersects with pedagogy and social sciences. Indeed, one of the goals of CSCL is to explore appropriate methods/tools for evaluating collaboration so that more insight can be gained into the results of lecturing/teaching procedures4.
However, systematic gathering and analysis of educational data in-natura has only recently started. So far this analysis has mainly tried to determine static structural features of the social learning network formed by the students. For instance, Nurmela et al. looked at the structure of the interactions trying to determine the central actors in a CSCL environment5. In this social structure, “key communicators” were assumed to be the most connected individuals in time-aggregated networks6. Similar analyses were carried out by Martínez et al.7 and Chen and Watanabe, who focused on other structural parameters that are important for the final score: group structure, member's physical location distribution, and member's social position8.
Beyond this merely static structural analysis, the literature also highlights the key role of student interaction for effective learning. At a societal scale, Granovetter's pioneering work9 recognised the importance of interaction patterns and proposed his well-known “strength of weak ties” phenomenon, where he hypothesised that isolated social ties offer limited access to external prospects, while heterogeneous social ties diversify one's opportunities.
While the relevance of the social network structure and interactions has been widely recognised in the educational context10, some other factors have recently been under the spotlight, e.g. social acceptance or willingness to communicate11. In general, it is not just about knowing “who” the students interact with, but “how” and “when” they do it and, importantly, what is the result of these interactions with regards to the educational outcome12.
Preliminary answers to the “how” question come from different works. The effects of analysing the relationships between web forum users on the structure of the network (reconstructed from the messages sent) were studied in1314. Also, the type of interaction or content being exchanged have been considered616. However, these previous analyses were based on a static snapshot of the structure and interactions of the network at some point in time or included a reduced number of samples. For instance7, analysed these macroscopic metrics in the four different assignments the course was structured in (  once a month).
Acquiring full knowledge on “how” students interact would be facilitated by having access to dynamic interactions and their changes with time. Timing is a determinant element to understand the correspondence between student behaviour and performance. Therefore, this paper tries to determine the individual and group-level behavioural patterns that lead to low scoring and possible dropout. Gaining insight into these data could help in identifying “groups at risk”, enabling educators to act sooner and hopefully reduce dropout rates.
The rest of this paper is organised as follows. Next section presents the main results obtained from our analysis. This is followed by a broader discussion.

Results



We analysed a record of college student interactions and compared social interaction data with the academic scores of the students (see third paragraph of Course Details in Methods in theSupplementary Information (SI) for a concrete definition on what an interaction is in this context) and how this relationship evolves with time. To this end, we analysed records of 80, 000 interactions by 290 students - approximately 16 times more interactions with almost 3 times more students than previous studies on educational networks in natura5678101215. Even so the data can still be considered to be sparse (  4.6 interactions per person per day). This sparseness is partly due to the fact that our work does not include verbal in classroom interactions or other communication mechanisms, like discussion groups that are typical in most universities.
Figure 1A shows a snapshot of the social graph for one of the classes being analysed.Supplementary video S1 offers a complete weekly sequence of interactions between students in one of the courses we analysed.

Figure 1: Diversity and Assortativity Analysis.


(A) shows a graph of one of the analysed courses including 82 students at the end of the last week of the course. Continuous thick blue edges indicate persistent interactions while dotted thin grey edges indicate transient interactions. High performing students are shown in dark blue, mid performing ones in red and low performing ones in green. As can be observed, high performance students form a “core” where the highest density of persistent interactions can be observed. Low performance students remain in the periphery of the graph, mainly holding transient interactions. (B) Scatter plot and linear regression for one of the variables analysed (number of interactions) vs. scoring in one of the classes (R2 = 0.72). (C) Scatter plot and linear regression for social diversity vs. scoring in one of the classes (R2 = 0.12). (D): Ratio of transient to persistent interactions obtained for different groups of students with different levels of interaction (LOW, MID, HIGH).

Diversity and assortativity analysis

Our first finding is that, in this environment, social diversity is negatively correlated with performance. This is explained by our second finding: high performing students interact in groups of similarly performing peers. This effect is stronger the higher the performance of the student. Indeed, low performance students tend to initiate many transient interactions regardless of the performance of the students they interact with. These interactions held by low performance students start late in the course, allowing high performers to establish a closely knitted group. In the following, we give details of these findings.
We start by comparing the score of each student with diversity metrics associated with the interactions held by each member of the social network (as shown in the SI). We characterise the nature and diversity of interaction ties within an individual's social network. Specifically, social diversity is defined as Shannon's entropy associated with individual communication behaviour, normalised to the total number of interactions (see Methods in SI for more details). Since both Shannon's entropy and the total number of interactions depend on the degree (number of connections), this normalisation reduces the correlation between low degree and high social diversity (see Figure S1 in Supplementary material).
The number of connections (students that a student has interacted with) and number of interactions (times a student has contacted or been contacted with/by other students), (see Methods in SI) were all positively correlated with the final score of the student (Pearson's correlations of 0.81, 0.85, respectively; p < 0.01), as shown in Figure 1B. Principal component analysis of these metrics revealed that all of them were closely interrelated, resulting in a non-significant improvement when combined (see Methods in SI). However, social diversity negatively correlated with final scores (Pearson's correlation of –0.34, p < 0.01) (Figures 1C). The reader is reminded that correlation does not imply causation and that diversity cannot be regarded as the cause of low score from these results.
To further analyse the effects on score, students were grouped into high (> 6.5), mid (between 6.5 and 3.5) and low (< 3.5) scoring (scores in Spain are typically given in a 0–10 scale, being 10 the top score). To verify the suggested existence of less effective interactions, we also classified the type of interactions in two types: 1) persistent, those sustained over time, and 2) transient, those not reciprocated within a week. We found that at the end of the course up to 28 ± 12% of the interactions held by high performing students were persistent, which is statistically different to those held by mid (14 ± 5%) or low (1 ± 0.5%) performance students (n = 290, p < 0.05).
We analysed the average ratio of transient to persistent interactions per neighbour: a higher number indicated less targeted interactions. This is illustrated in Figure 1D for one of the three classes under analysis (results were similar for the other two classes).
The presence of more focused and sustained interactions did not stop high scoring students from interacting with colleague students with mid or low scores in a transient manner (similar number of transient interactions regardless of the score). An assortativity analysis17 on these persistent interactions with regards to score indicated the existence of preferential interaction initiation (r = 0.5, p < 0.05 by using the Jackknife method, see Methods in SI). In other words, similarly scoring students tended to keep persistent interactions only between themselves.
This assortative behaviour with regards to scoring is highly suggestive of a “rich club” phenomenon (see Methods in SI and1819). A “rich club” is defined as a set of nodes with degree larger than kthat tend to be more densely connected among themselves than the nodes with degree smaller than k. When we performed this analysis taking all the types of interaction into account, we could observe no “rich club” effect (  for the students with more links, indicating they also interacted with students outside the “rich club”). However, when only persistent interactions were taken into account, we obtained  , which is in line with the idea of high scoring students keeping persistent interactions between themselves as indicated by our assortativity analysis. The “rich club” phenomenon could not be observed during the first weeks, φ(r) ≪ 1, and it became apparent only after week 4–5 for the top performing students, remaining stable afterwards.

Temporal analysis

One interesting finding is that the total number of interactions per week (normalised to the maximum value in all weeks) for all groups increased over time and it saturated around week 6 for mid performing students and around week 4 for high performing students (Figure 2A). In both cases, the number of persistent and transient interactions increased until saturation as the weeks went by. However, the number of interactions for low scoring students behaved in a strikingly different manner. The number of total interaction increased until week 4, where it started to drop steadily until the end of the course (Figure 2A). We believe this may be due to a lack of incentives to interact as revealed by our reciprocity measurements (see two paragraphs below).


Figure 2: Persistent Interaction Analysis.
Persistent Interaction Analysis.
(A) Temporal Evolution of the total number of interactions in all groups. The y-axis indicates the number of interactions per group per week normalised to the value of the week when the maximum number of interactions was recorded for that group. This figure pools normalised data from all three courses available. High performing students start to interact before and keep interactions throughout the whole course. The same applies to mid performing students, although their interactions start a bit later in the course. Low performing students start interacting later than high performing ones and their interactions drop with time. The maximum values used for normalising these curves were 150, 36, 57 and 63 all, high, mid and low interactions, respectively. (B, C and D) Evolution of the % of persistent interactions (relative to the average total # of interactions of that group) per week and per student group (low, (B); mid, (C); and high, (D)) relative to the total number of interactions per group per week. Continuous lines represent the fit of a curve to the points as indicated in Methods. As can be observed, the % of persistent interaction increases as the course progresses for all groups of students. High performing students achieved a higher % of persistent interactions than mid and low performing ones.
A closer look at the data revealed that the percentage of persistent interactions increased in all groups, but with different timing, as shown in the persistent interaction analysis (see Figure 2B, C, D). As indicated in Table 2, the midpoint for the sigmoid function was 6.08, 4.81 and 3.2 weeks for low, mid and high performing students (p < 0.05). This suggested that high performing students on average established persistent interactions before mid and low performance students did (1 and 2 weeks earlier, respectively). Also, mid performing students started to establish persistent interactions 1 week before low performance students did. If one takes the slope of the sigmoid as a reference, it can be observed that there was no significant difference in the rate of change from a “low interaction mode” to a “high interaction mode” between mid and high performing students (0.58 vs. 0.4769). These data are in line with those on the number of connections, interactions and attendance (Figure 3 A, B and C), which showed that low performance students tried to engage later in the course, while mid and high performing students started their interactions earlier. These data are aligned with the number of students that stopped delivering their assignments and therefore did not pass the course. The average percentage of students dropping the course was 24.5%, 31.5% and 0% for low, mid and high performance students, respectively.  80% of these dropouts occurred after the 9th week of course. The higher attendance level by high performing students may also be causing the higher number of persistent interactions, although our analysis does not let us conclude any causality relationship.


Table 1: Summary of the cascade analysis performed across the three groups of students (p < 0.05 between any two groups)


Table 2: Sigmoid Fitting Results. Constants obtained on fitting a sigmoid curve to the data


Figure 3: Course Data Details.
Course Data Details.
(A) Shows the evolution of the degree of the nodes in the graph per week per scoring group for all three courses. (B) Number of actual communications held per day on a given week grouped per scoring group. (C) An estimation of the attendance of the students to the course, based on the number of log-ons performed on any day in that week in any of the systems available for them to communicate. As can be observed, the degree remained almost constant for mid and high performing students, while it started to increase around week 4 and slowly declined later on for low performance students. This same pattern is observed for the number of interactions held by the students. These data are consistent with our estimation of “attendance”, where log performing students have a significantly lower number of logins into the system. All panels show data from one of the courses under study only. The whiskers in the Figure show the estimated error in the mean.
Taking data on increasing percentage of persistent student interactions together with the assortativity analysis (students preferred to interact with those who have similar scores/performance), our results suggested that at some point reciprocity Ri,j (measured as the fraction of times a student i in any given group responds to a student j outside her same group) should start to drop. However, reciprocity remained unchanged with time and was similar between groups (  0.7). By analysing the direction of the initiation of the interaction we could see that persistent interactions held between members of different groups are highly symmetric (having almost even initiations starting from both ends). On the contrary, transient interactions between members of different groups are almost always initiated by the student with lower performance (with 0.87 probability). In addition, the timing of responses was different. While persistent interactions are responded in 8.1 ± 0.3 hours on average, the response time for transient interactions is delayed 7.21 ± 0.46 days.
This could be indicating that low performance was due to either a lack of interest of the students or just that no valuable content was conveyed in these delayed interactions. Since the content of these interactions was not logged, we restricted ourselves to find whether there was any differences in the way content flowed between students and groups of students.

Information cascades

Information cascades reveal spread mechanisms in which an action or idea becomes adopted due to the influence of others, typically, neighbours in some network. A well-known example are cascades in the context of large product recommendation networks21222324.
In order to detect the presence of information cascades and determine the actual value of the communication, we needed to gain insight on the content of the messages exchanged by students. Since this would be a clear violation of students' privacy, we decided to analyse another source of information: file exchange of students in their home directories and in their Moodle and collaborative workspace accounts (see “Information Cascades” in Methods in SI).
We defined as trivial cascades those implying a single transfer (a single originating source and a single destination) of information about the course, and non-trivial cascades, those with more complex patterns. We found a total of 845 cascades, and 53.37% of which were trivial cascades (T1in Figure 4), 25% were non-trivial cascades involving transfer from a single source to many destinations in the same time frame, and the remaining 11% of the cascades were topologically more complex.


Figure 4: Information Cascades.
Information Cascades.
Most Frequent Cascades for Low Performing (A) and High Performing (B) students. Students initiating, relaying or receiving a document were supposed to be part of the cascade. As can be observed high performance students keep more complex information cascades in sharing documents in the systems available. Low performing students use a more straightforward “relay” strategy, forwarding documents to other students.
The total number of cascades was significantly different across all three groups 51%, 35.97% and 13.03% for high, mid and low performance students, respectively (see Table 1).
Our data revealed that the length of the cascade (number of synchronous transfers) gradually increased as the average score of the students involved in the cascade increased. This is also supported by the fact that among non trivial cascades, the most common pattern for low performance students was star-like (T2 and T3 in Figure 4, 97.8%), while chained cascades (T4, T5 and T6 in Figure 4) were more common for mid (53.82%) and high (76.29%) performing students.

Discussion




Being limited to non-verbal interactions between students prevented us from capturing a wealth of valuable interactions and led to some sparseness in our data. We combined fine-grained educational data at unprecedented temporal resolution in educational settings (  4.6 events per student per day) and gained insight into the type of interaction patterns that are associated to lower performance.
The major finding is that a higher number of online interactions (independently of the number of distinct students involved) is usually an indicator of higher score.
Our data show that increased social diversity is negatively correlated with high scores; most diversity metrics are correlated with the degree of the vertices (e.g. Shannon's entropy or topological diversity as in25) and this may lead to think that social diversity is high in low performing students because their number of connections (degree) is low. We minimised this fact with the normalisation of Shannon's entropy to degree.
The results also show that the higher the score of the students, the higher the percentage of their interactions that were persistent. These results were independent of gender differences (correlation of gender to score was −0.04). As the score of the student increases, these persistent interactions are initiated with a reduced number of similarly performing colleagues (assortative interaction pattern). Low performance students have a larger number of transient interactions spread over a large number of neighbours.
The dynamics of these interactions reveal that once students start to establish persistent interactions they do it more and more until a maximum saturation point is reached. High performing students tend to initiate persistent interactions before low performance ones, suggesting more willingness to collaborate. A striking fact is that these high performance students still maintain more than  70% of transient interactions, mostly with mid performance students. Our reciprocity analysis shows that students try to contact high performance students and these respond although the latter do not usually initiate disassortative interactions with low performance students.
These early persistent interactions enable high performance students to build a “rich club”, while low performance students barely interact. Low performance students start to interact later (around week 4–5), when their “attendance” also increased just to decrease again towards the end of the course. This delay may help to explain why low performance students initiated more interactions that decreased after they failed to engage in persistent interactions with high performing students, since the “rich-club” had already been formed.
We could not monitor the content of the private message of students and decided to perform an information diffusion analysis that could help us gain insight on the value of the content actually being exchanged. Our results revealed that low performance students generally exchange documents in a trivial manner (i.e. in a forwarding manner that spans a single hop). On the contrary, more complex and longer cascades occur in high performing groups. This indicates the existence of a highly organised network where similarly performing students exchange information in a well-structured fashion, following characteristic patterns that are different across groups. While high performing students mainly exchange documents in a chained manner, low performance students spread the information to many other students at the same time, without this document apparently being relayed to other students beyond the recipient. Indeed, low performance students were not typically included in the information chains developed by high performing students. By this we do not mean to imply a deliberate behaviour of students, but it most likely indicates the presence of a benefit maximisation process by which students focus their efforts on potentially more fruitful connections.
Low performance students drastically reduce the number of interactions after week 5, which may be indicating a lack of motivation that leads them to drop the course and focus on other tasks. This per se does not let us conclude a lack of skills or motivation by low performance students. For instance, external factors may cause both less interactions and dropping the course (e.g. too many extracurricular activities). The lack of data that could enable causality inference in our analysis precludes us from concluding whether inefficient interactions, external factors or both are the cause of the dropout/reduced performance.
Even when we cannot directly build a causality chain, our empirical data suggest that: 1) low performing students engage later in the course; 2) this late engagement is related with their exclusion from the highly-structured and persistent information exchanges held by high performing students; 3) low performing students try to compensate by initiating larger number of weak interactions; 4) since this attempt to catch up is not successful low performance students drastically reduce the number of interactions.
Our study did not allow us to distinguish the root cause (initial delay in interacting, low degree or a combination of both) of the increased social diversity found in low performing students.
As part of our future work, we aim to perform a detailed causality analysis to detect the root cause of the low performance. This may help to get low performing students involved in high performing chains and hopefully increase their final score and reduce dropout rates. On the other hand, this may have a negative effect on high scoring students who will get many more interactions. We also plan to expand this analysis to non university environments.

Methods




The data consist of the interactions of 290 students at a Spanish university, during two consecutive years of a 12-week long course on Basic Computer Science Skills (in Linux such as OpenOffice, GIMP, or content licensing techniques such as Creative Commons) for freshmen students of journalism.
An interaction is defined as a communication attempt via the aforementioned systems. We logged the time and direction of the interaction in the Chat and the class IRC (see Table 3 for a detailed list of interactions and types). Confidentiality prevented us from performing an examination on the content of these interactions. Moodle and our collaborative workspace let us keep track of documents shared by students.


Table 3: Percentage of Interactions per Communication Channel. Average % of interactions taking place over the different communication channels employed in our study. No significant differences were found between different groups of students. Moodle interaction count was increased only if the post received an answer. The collaborative workspace let us include interactions from blog posts, document shares, reminders or messages in the collaborative space. Each chat and classroom IRC session (sequence of messages exchanged without stopping for more than 3 min) counted as a single interaction
These interactions were used to build a graph with a fine grained temporal granularity (see Communication Channels in the SI). Diversity, grouping and connectivity metrics were calculated on the graph (see SI)20. These metrics were analysed and compared throughout the course. A snapshot of the quality of the data set can be observed in Figure 5.


Figure 5: Quality of the Data.
Quality of the Data.
Probability density distribution of the number of iterations (A) and connections (B) per group in one of the courses being analysed.
Finally, we studied how files appeared and spread across the HOME directory students kept in the servers of the Lab (see SI).

References




  1. Diplomas Count 2007: Ready for What? Preparing Students for College Careers and Life after High School. . Education Week 26 (2007).
  2. Rouse, C. The Labor Market Consequences of an Inadequate Education. Princeton University and NBER. In: Equity Symposium on The Social Costs of Inadequate Education at Teachers College, Columbia University, edited by Clive Belfield and Henry M. Levin (Washington: Brookings Institution Press,2007). Available:http://devweb.tc.columbia.edu/manager/symposium/Files/77_Rouse_paper.pdf Last visited: 4-1-2013
  3. Harlow, C. Education and Correctional Populations. In: U.S. Department of Justice, Bureau of Justice, (Washington DC, 2003). Available: www.ojp.usdoj.gov/bjs/pub/pdf/ecp.pdf Last visited: 4-1-2013
  4. Nurmela, K.Lehtinen, E. & Palonen, T. Evaluating CSCL log files by social network analysis. In: Proceedings of the 1999 conference on Computer support for collaborative learningArticle 54 (International Society of the Learning Sciences,1999).
  5. Cho, H.Gay, G.Davidson, B. & Ingraffea, A. Social networks, communication styles, and learning performance in a CSCL communityComputers & Education 49309329 (2007).
  6. Martinez, A.Dimitriadis, Y.Rubia, B.Gomez, E. & De La Fuente P: Combining qualitative evaluation and social network analysis for the study of classroom social interactions.Computers & Education 41353368 (2003).
  7. Chen, Z. & Watanabe, S. A Case Study of Applying SNA to Analyze CSCL Social Network. In:ICALT 2007. Seventh IEEE International Conference On Advanced Learning Technologies18–20 (2007).
  8. Granovetter, M. The strength of weak tiesThe American Journal Of Sociology 78,13601380 (1973).
  9. Sundararajan, B. Impact of communication patterns, network positions and social dynamics factors on learning among students in a cscl environmentPhD thesis, (Troy, NY, USA, 2007).
  10. Yu, A. Y.Tian, S. W.Vogel, D. & Chi-Wai Kwok R: Can learning be virtually boosted? An investigation of online social networking impactsComputers Education 5514941503(2010).
  11. Ullrich, C.Borau, K. & Stepanyan, K. Who students interact with? a social network analysis perspective on the use of twitter in language learning. In: Proceedings of the 5th European Conference on Technology Enhanced Learning Conference on Sustaining TEL: from innovation to learning and practice 432–437 (Berlin, Heidelberg: Springer-Verlag, 2010).
  12. Yeung, Y. Y. Macroscopic study of the social networks formed in web-based discussion forums. In: Proceedings of the Conference on Computer Support for Collaborative Learning: the next 10 years! 727–731 (International Society of the Learning Sciences, 2005).
  13. Kepp, S. J. & Schorr, H. Analyzing collaborative learning activities in wikis using social network analysis. In: Proceedings of the 27th International Conference Extended Abstracts on Human Factors in Computing Systems 4201–4206 (New York, 2009).
  14. Cho, H.Gay, G.Davidson, B. & Ingraffea, A. Social networks, communication styles, and learning performance in a CSCL communityComputers And Education 49309329 (2007).
  15. Erlin, B.Yusof, N. & Rahman, A. Integrating Content Analysis and Social Network Analysis for analyzing Asynchronous Discussion Forum. In ITSim 2008International Symposium On Information Technology 318 (2008).
  16. Newman, M. Mixing patterns in networksPhysical Review E 67026126 (2003).
  17. Zhou, S. & Mondragon, R. J. The rich-club phenomenon in the Internet topologyIEEE Communications Letters 8180182 (2004).
  18. Colizza, V.Flammini, A.Serrano, M. A. & Vespignani, A. Detecting rich-club ordering in complex networksNature Physics 2110115 (2006).
  19. Wang, P.Hunter, T.Bayen, A. M.Schechtner, K.Gonzlez, M. C. . Understanding Road Usage Patterns in Urban Areas.Scientific Reports 21001 (2012).
  20. Leskovec, J.Singh, A. & Kleinberg, J. Patterns of influence in a recommendation network. In:Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining 380–389 (Springer-Verlag, 2006).
  21. Leskovec, J.Adamic, L. A. & Huberman, B. A. The dynamics of viral marketing. In:Proceedings of the 7th ACM conference on Electronic commerce 228–237 (New York, 2006).
  22. Yang, J. & Leskovec, J. Temporal Variation in Online Media. In: Proceeding of the ACM International Conference on Web Search and Data Mining 177–186 (New York, 2011).
  23. Leskovec, J.Adamic, L. A. & Huberman, B. A. The dynamics of viral marketingACM Transactions On The Web 15 (2007).
  24. Eagle, N.Macy, M. & Claxton, R. Network diversity and economic developmentScience 328,1029 (2010).

Acknowledgements




We would like to thank Charles Elkan, Miranda Mowbray, Nabeel Gillani, Suksant Sae Lor, and Kate Mallichan for their insightful comments on the manuscript and Yannis Dimitriadis and Eduardo Gomez for inspiring this work. Manuel Cebrian acknowledges support from the National Science Foundation under grant 0905645, from DARPA/Lockheed Martin Guard Dog Program under PO 4100149822, and the Army Research Office under Grant W911NF-11-1-0363.

Author information




Affiliations

  1. Hewlett-Packard Laboratories, Bristol BS34 8QZ, UK

    • Luis M. Vaquero
  2. NICTA, Melbourne, Victoria 3010, Australia

    • Manuel Cebrian
  3. Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA

    • Manuel Cebrian

Contributions

Conceived, designed and performed the experiments: L.M.V. Analysed the data: L.M.V., M.C. Wrote the paper: L.M.V., M.C.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to: