jueves, 10 de octubre de 2013

Coeficiente de agrupamiento

Coeficiente de agrupamiento 

En la teoría de grafos, un coeficiente de agrupación es una medida del grado en el que los nodos en un gráfico que tienden a agruparse juntos. La evidencia sugiere que en la mayoría de redes del mundo real , y en particular las redes sociales, los nodos tienden a crear grupos muy unidos que se caracterizan por una densidad relativamente alta de enlaces; esta probabilidad tiende a ser mayor que la probabilidad media de un lazo establecido al azar entre dos nodos (Holland y Leinhardt de 1971, [1] Watts y Strogatz, 1998 [2]).

Existen dos versiones de esta medida: la global y la local. La versión global fue diseñado para dar una indicación general de la agrupación en la red, mientras que el local, da una indicación de la incrustación de los nodos individuales .

Coeficiente de agrupamiento global 

El coeficiente de agrupación mundial se basa en tripletes de nodos. Un triplete consiste en tres nodos que están conectados por cualquiera de los otros dos (triplete abierto) o tres (triplete cerrado) lazos no dirigidos. Un triángulo se compone de tres tripletes cerrados, uno centrado en cada uno de los nodos. El coeficiente de agrupación global es el número de tripletes cerrados (o 3 x triángulos) sobre el número total de tríos (tanto abierto como cerrado). El primer intento de medir que fue hecha por Luce y Perry (1949). [3] Esta medida da una indicación de la agrupación en el conjunto de la red (global), y se puede aplicar tanto a las redes no dirigidas como dirigidas (a menudo llamadas transitividad, ver Wasserman y Faust, 1994, página 243 [4]).


Watts y Strogatz definen el coeficiente de agrupamiento como sigue, "Supongamos que un vértice tiene  vecinos; entonces un máximo de hasta  enlaces pueden existir entre ellos (esto ocurre cuando cada vecino de  está conectado a cada otro vecino de  ). Sea la fracción de estos enlaces permitidos que existen actualmente. Definamos  como el promedio de   sobre todos los ."

Cociente de transitividad

El cociente de transitividad es un concepto relacionado. Se define como:

El coeficiente de agrupamiento asigna mayor peso a los nodos con menor grado, mientras que el cociente de transitividad asigna mayor peso a los nodos con mayor centralidad de grado.

Una generalización de redes ponderados fue propuesto por Opsahl y Panzarasa (2009), [5] y una redefinición de las redes de dos modos (tanto binaria y ponderadas) por Opsahl (2009). [6]

Coeficiente de agrupación local



Ejemplo: El coeficiente de la agrupación local de un grafo no dirigido. El coeficiente de agrupación local del nodo azul se calcula como la proporción de las conexiones entre sus vecinos que en realidad se realizan en comparación con el número de todas las conexiones posibles. En la figura, el nodo azul tiene tres vecinos, que pueden tener un máximo de 3 conexiones entre ellos. En la parte superior de la figura se observan las tres conexiones posibles (segmentos negros gruesos), dando un coeficiente de la agrupación local de 1. En la parte media de la figura se realiza sólo una conexión (línea de color negro de espesor) y 2 conexiones se echa en falta (líneas de puntos rojos), dando un coeficiente de clúster local de 1/3. Por último, ninguna de las posibles conexiones entre los vecinos del nodo azul se realizan, la producción de un valor de coeficiente de la agrupación local de 0.

El coeficiente de la agrupación local de un vértice (nodo) en un gráfico cuantifica qué tan cerca sus vecinos están de ser una camarilla (gráfico completo). Duncan J. Watts y Steven Strogatz introdujeron la medida en 1998 para determinar si un grafo es una red de mundo pequeño.

Un grafo  formalmente consiste de un conjunto de vértices  y un conjunto de enlaces entre ellos. Un enlace   conecta al vértice  con el vértice .

El vecindario  para un vértice  es definido como sus vecinos inmediatamente conectados como sigue:

Definimos  como el número de vértices, , en el vecindario, , de un vértice.

El coeficiente de agrupamiento local   para un vértice   es entonces dado por la proporción de enlaces entre los vértices dentro de su vecindario dividido por el número de enlaces que podría potencialmente existir entre ellos. Para un grafo dirigido, es distinto de , y por lo tanto para cada vecindario  hay  enlaces que podrían existir entre los vértices dentro del vecindario  ( es el número de vecinos de un vértice). Entonces, el coeficiente de agrupamiento local para grafos dirigidos está dado por  [2]



Dado que un grafo no dirigido tiene la propiedad que   y son idénticos. Por ello, si un vértice  tiene  vecinos, enlaces podrían existir entre los vértices dentro de su vecindario. Por ello, el coeficiente de agrupación local para grafos no dirigidos puede ser definido como

Sea  el número de triángulos sobre  para un grafo no dirigido . Esto es, es el número de subgrafos de  con 3 enlaces y 3 vértices, uno de los cuales es . Sea  el número de of tripletes sobre . Esto es, es el número de subgrafos (no necesariamente inducidos) con 2 enlaces y 3 vértices, uno de los cuales es y tal que es incidente a ambos enlaces. Entonces podemos también definir al coeficiente de agrupamiento como


Es simple demostrar que las dos definiciones precedentes son lo mismo, dado que

Estas medidas son iguales a 1 si cada vecino conectado a   se encuentra también conectado a cada otro vértice dentro del vecindario, y 0 si ningún vértice conectado a   se conecta con ningún otro vértice que está conectado a .

Coeficiente de agrupamiento medio de red 


El coeficiente de agrupamiento para toda la red está dada por Watts y Strogatz [2] como la media de los coeficientes de la agrupación locales de todos los vértices  :[7]

Un grafo es considerado un mundo pequeño, si el coeficiente de agrupamiento medio es significativamente más alto que un grafo aleatorio construido con el mismo conjunto de vértices y si el grafo tiene aproximadamente la misma longitud media de camíno más corto que si correspondiente grafo aleatorio.

Una generalización para grafos ponderados fue propuesta por Barrat et al. (2004),[8] y una redefinición para grafos bipartitos (también llamadas redes de dos modos) por Latapy et al. (2008)[9] y Opsahl (2009).[6]

Esta fórmula no es, por definición, definida para grafos con vértices aislados; ver Kaiser, (2008)[10] y Barmpoutis et al.[11] Las redes con los coeficientes de agrupamiento medio más grandes posible se encuentran que tienen estructuras modulares y al mismo tiempo, tiene la menor distancia media posible entre los diferentes nodos.[11]

Referencias

  1. P. W. Holland and S. Leinhardt (1971). "Transitivity in structural models of small groups".Comparative Group Studies 2: 107–124.
  2. Jump up to:D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks"Nature 393 (6684): 440–442. Bibcode:1998Natur.393..440W.doi:10.1038/30918PMID 9623998.
  3. Jump upR. D. Luce and A. D. Perry (1949). "A method of matrix analysis of group structure". Psychometrika 14 (1): 95–116.doi:10.1007/BF02289146PMID 18152948.
  4. Jump upStanley Wasserman, Kathrine Faust, 1994. Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press.
  5. Jump upTore Opsahl and Pietro Panzarasa (2009). "Clustering in Weighted Networks"Social Networks 31 (2): 155–163.doi:10.1016/j.socnet.2009.02.002.
  6. Tore Opsahl (2009). "Clustering in Two-mode Networks"Conference and Workshop on Two-Mode Social Analysis (Sept 30-Oct 2, 2009).
  7. Jump upKemper, Andreas (2009). Valuation of Network Effects in Software Markets: A Complex Networks Approach. Springer. p. 142.ISBN 9783790823660.
  8. Jump upA. Barrat and M. Barthelemy and R. Pastor-Satorras and A. Vespignani (2004). "The architecture of complex weighted networks".Proceedings of the National Academy of Sciences 101 (11): 3747–3752. arXiv:cond-mat/0311416Bibcode:2004PNAS..101.3747B.doi:10.1073/pnas.0400087101PMC 374315PMID 15007165.
  9. Jump upM. Latapy and C. Magnien and N. Del Vecchio (2008). "Basic Notions for the Analysis of Large Two-mode Networks". Social Networks 30(1): 31–48. doi:10.1016/j.socnet.2007.04.006.
  10. Jump upMarcus kaiser (2008). "Mean clustering coefficients: the role of isolated nodes and leafs on clustering measures for small-world networks".New Journal of Physics 10 (8): 083042. arXiv:0802.2512Bibcode:2008NJPh...10h3042Kdoi:10.1088/1367-2630/10/8/083042.
  11. D. Barmpoutis and R. M. Murray (2010). "Networks with the Smallest Average Distance and the Largest Average Clustering".arXiv:1007.4031 [q-bio.MN].

miércoles, 9 de octubre de 2013

Centralidad en redes sociales: Depende de la métrica

Who is central to a social network? It depends on your centrality measure.




One important feature of networks is the relative centrality of individuals in them.  Centrality is a structural characteristic of individuals in the network, meaning a centrality score tells you something about how that individual fits within the network overall.  Individuals with high centrality scores are often more likely to be leaders, key conduits of information, and be more likely to be early adopters of anything that spreads in a network. 
Low centrality individuals can be termed peripheral.  Being peripheral can have advantages as well by protecting individuals from negative contagion and influence (think of the flu spreading or the ricocheting effects of that really deflating coworker).  Sometimes lower centrality is associated with less work overload in an organization.
The confusing thing about centrality is there is not just one kind.  Social network scientists have invented literally dozens of centrality measures to characterize slightly different aspects of structural positions in networks.  Here we will explain three of the most commonly used centrality measures.

Closeness Centrality

This measure expresses the average social distance from each individual to every other individual in the network.  The concept of social distance is easily understood by considering the common parlor and roadtrip game, six degrees of Kevin Bacon, in which players try to find the shortest set of connections from any actor to Kevin Bacon based on movies that actors have starred in together.  An actor who has costarred with Kevin Bacon is considered to be 1 degree away, while an actor who costarred with a Kevin Bacon costar is 2 degrees away, and so forth.  The same concept can be applied to any social network.
 If we divide 1 by the average shortest path from an individual to all other individuals in the network, then we have calculated their closeness centrality.  In this way an individual with a direct tie to everyone else ends up with a closeness score of 1.  Individuals who connect to most others through many intermediaries get closeness scores that are increasingly nearer to zero.  One property of closeness centrality is that it tends to give high scores to individuals who are near the center of local clusters (aka network communities) in an overall larger network (Figure 1).

Figure 1: Individuals who are highly connected to others within their own cluster will have a high closeness centrality.
Applications: High closeness centrality individuals tend to be important influencers within their local network community.  They may often not be public figures to the entire network of a corporation or profession, but they are often respected locally and they occupy short paths for information spread within their network community.

Betweenness Centrality

Betweenness is another measure that is derived from the concept of counting the shortest paths between individuals in a network.  It has very different properties, however, from closeness centrality.  To calculate betweenness centrality, one starts by finding all the shortest paths between any two individuals in the network.  You then count the number of these shortest paths that go through each individual.  This number is betweenness centrality.
The result of this calculation is finding the individuals who are necessary conduits for information that must traverse disparate parts of the network.  These are usually very different individuals from those with high closeness.  High betweenness individuals often do not have the shortest average path to everyone else, but they have the greatest number of shortest paths that necessarily have to go through them. 
One example of this would be the highway map of the United States.  Cities in the Midwest like St. Louis have higher betweenness centrality than do New York or LA because so many shortest paths that combine any cities on the east and west coast have to pass through them. 
In a social network, high betweenness individuals are often found at the intersections of more densely connected network communities (Figure 2).  They are well positioned to perform brokering roles across these clusters in the sense that brokers connect otherwise disconnected people who yet may benefit from an exchange of information.  The term broker applies quite naturally in this case, as actual brokers, whether real estate, mortgage, or pawn brokers, make a living by connecting otherwise disconnected sets of individuals.  That’s what high betweenness people do within a social network.

Figure 2: Those who act as bridges between clusters in the network have high betweenness centrality.
Applications:  High betweenness individuals are often critical to collaboration across departments and to maintaining the spread of a new product through an entire network.  Because of their locations between network communities, they are natural brokers of information and collaboration.  One difference between high betweenness individuals in a network and actual brokers is the latter usually have a public profile as part of their business, whereas high betweenness individuals often are overlooked.  This occurs because they are not central to any single social clique, and instead reside on the periphery of several such cliques each of which all engender more trust and admiration within rather than outside of the clique.   

Eigenvector Centrality
This measure has a complicated name, but it basically denotes the extent to which an individual is a big fish connected with other big fish in a big pond.  Eigenvector centrality is calculated by assessing how well connected an individual is to the parts of the network with the greatest connectivity (Figure 3).  Individuals with high eigenvector scores have many connections, and their connections have many connections, and their connections have many connections … out to the end of the network. 

Figure 3: Highly connected individuals within highly interconnected clusters, or ‘big fish in big ponds’, have high eigenvector centrality.
There is another way to think about eigenvector centrality that relates more to its name.  The name ‘eigen’ come from German, and can be translated to mean ‘characteristic’.  In a social network, we can observe directly whether individuals are connected or not.  The mathematical notion behind eigenvectors is that we can re-express the matrix of person-to-person connections with a set of ‘eigen (characteristic)’ values that are assigned to each individual.  So, eigenvector centrality scores correspond to the score you get for individuals if you start by constructing the pairwise connections between all individuals in a network (1 for connected, 0 for not), and then assign a single number to each individual while attempting to keep the distances between these new values equal to the distances observed in the social network matrix of connections.  Of course this can’t be done with a single numeric value per individual, but in fact you can always represent a set of social connections or distances by assigning as many vectors (strings of numbers for each individual) as there are individuals in the network.
Applications:  High eigenvector centrality individuals are leaders of the network.  They are often public figures with many connections to other high-profile individuals.  Thus, they often play roles of key opinion leaders and shape public perception.  A related example of this is Google’s page rank algorithm, which is closely related to eigenvector centrality calculated on websites based on links to them. 
High eigenvector centrality individuals, however, cannot necessarily perform the roles of high closeness and betweenness.  They do not always have the greatest local influence and may have limited brokering potential.  Like an aloof king in his court or CEO in her boardroom, they may at times be isolated from peripheral individuals and smaller network communities that have limited connectivity with the most densely connected parts of the network. 

jueves, 3 de octubre de 2013

Software: Commetrix CMX para redes sociales dinámicas

Commetrix CMX Analyzer: Dynamic Social Network Visualization
Commetrix CMX Analyzer is a social network analysis platform from a German company Trilexis (www.trilexis.com) which originated in a research group at the Technical University of Berlin. (Note: the website, user interface and documentation are all in English.) What is interesting about this particular tool is its emphasis on the dynamics of social interactions over time. It achieves this through a data format that captures information about each individual link event including not only originator, destination and time but also user specified attributes which could include communication mode (email, IM, twitter), type of exchange (social, work, ecommerce), topic (e.g. keywords extracted from the subject).

Commetrix CMX Analyzer User Interface

A small subset of the Enron Email dataset –from the size and the individuals referenced we are guessing a single custodian - is provided for demonstration purposes. Part of our interest in this particular software is that we are familiar with the Enron dataset and had researched it using the social network analysis functionality of an eDiscovery system called MetaLINCS. We were curious to see what additional insights CMX Analyzer might provide. 

CMX Analyzer is a desktop tool built in Java incorporating the 3D graphical capabilities of Java 3D and the Java Media Framework. Once we had obtained the license key, the application was straightforward to install and comes with a user guide. To date we have only been able to try it out on the sample data set provided as the process of creating new data sets requires end-user coding (of link attributes) followed by a data transformation process that requires as separate tool (Commetrix Producer) or the data being sent to Trilexis for processing by their systems.

Commetrix Data Preparation Process:

Commetrix is not as functionally or visually rich as some of the other tools we have investigated and reported on in previous blogs (e.g. Gephi, nodeXL). However, where it comes into its own is in the dynamic visualization of email communications over time. The MetaLINCs software we had used in the past had provided a “time-slider” but was essentially a “snapshot” approach. Commetrix has time-sliders too but also animates the traffic creating a unique perspective on what is, after all, a time-based series of events. (We should also warn readers that the resulting animations make for highly addictive viewing. We were totally captivated!) The start-end of the time period can be set, as can the intervals and speed of animation. It is also possible to run the time line backwards as well as forwards. This makes it possible to identify “hot spots” of communication activity between group subsets at particular points in time. In other types of communication e.g. twitter or facebook – we can see how this would provide valuable insight into the evolution of a topic of discussion or a social group.

Snapshot of Communications: Jan 2000

Snapshot of Communications: Dec 2000
Visually, Commetrix is more limited than some of the other packages we have used e.g. it is not possible to pan or zoom. There are options to change node size and color to represent parameters such as communications sent, communications received, number of direct contacts. Color schemes cannot be chosen directly but can be set to show selected attributes e.g. the following screenshot shows nodes color coded by the ‘function’ attribute where dark blue represents employees, pale blue represents directors, green represents traders, wholly purple circles represent managers and purple circles with yellow centers represent in-house lawyers. (Note: we found the use of full and semi colored circles to be somewhat confusing).

Colorcoding by Function

Included in Commetrix is an “egoview” option which allows you to select a particular node and investigate communication to and from that individual node. Links can be filtered to include only direct communications (a 1-step link) or communications involving two or more steps. The image below, for example, shows communications to and from Sara Shackleton. While this capability is helpful focusing down on traffic to and from a node, in the case of email communications if the data set is from only one custodian, the egoview has limited value when used outside that custodian as it will show only those communications that happen to have been referenced in emails sent to and from the primary custodian i.e. it is an imperfect sample. 

Screenshot Showing Ego View - Tana Jones

Commetrix also comes with a Keyword filter. The intent is to allow the user to focus on interactions “about” the selected keywords. The interface is less obvious than some of the other areas and we confess to wondering if there was a bug until – rereading the manual – we realized that “In” didn’t mean “inbound” but include and “Out” meant exclude. Selecting the terms was also rather tedious as it meant scrolling through a long list of options. To validate the filtering, we took ‘california’ related terms and looked to see if Jeff Dasovitch was included, which he is – see screenshot below. It would be interesting to see this concept better developed with better keyword lists, more complex keyword filtering options and possibly the employment of automated topic determination techniques such as keyword clustering.

Screenshot Showing Use of Keyword Filter
Although the enron data set was provided only for demo purposes – having worked with this data, we were curious about two things: firstly how were the keywords derived (we guessed email subject but some of the keywords were email domains – indicating other metadata might have been used as well – and some phrases had been concatenated (e.g. ‘californiaattached’) or include a leading article (e.g.thenumber), or word fragments (e.g.’t’, ‘e’). Secondly, and more importantly, how were the “identities” of the individuals represented by the nodes resolved? This is always a major issue in email communications if the only information about senders and recipients is an email address. Most individuals have multiple email addresses – even within companies – and the names on email addresses may be difficult to resolve to a single individual. We raise this question because MetaLINCS included functionality that attempted to link individuals with their email accounts based not only on email address but also on communications patterns. Even then, many individuals/email accounts that a human would identify as probably being connected, could not be automatically linked. We are guessing that the identity of individuals was manually coded since the node table has a clean one-to-one mapping between individuals and a single email address.

In summary, while we think some of the other software we have used and researched offer better social network visualization options, we really liked the time-line animation Commetrix provides and believe it could be very helpful when studying the evolution of a network or communication patterns over time. While the keyword filtering option was disappointing in both the implementation and the demo dataset provided, we think it has obvious potential – particularly when analyzing large data sets of email, IM and twitter – in enabling users to focus in on only those communications “about” a particular topic. Of course, with that come all the provisos of using keywords as a substitute for “aboutness” but if it was combined with stemming, a better stop word list, and some form of thesaurus (to apply synonyms automatically) it would be very powerful.

Chroma Scope

miércoles, 2 de octubre de 2013

La demografía de redes sociales digitales

Social Media Demographics: The Surprising Identity Of Each Major Social Network



Each social media platform has cultivated a unique identity thanks to the demographics of the people who participate in the network. Some platforms are preferred by young adults, who are most active in the evening, others by high-income professionals, who are posting throughout the workday.
We explained in a recent report why many brands and businesses need platform-focused social media strategies, rather than a diluted strategy that aims to be everywhere at once.
In a new report from BI Intelligence, we break down the demographics of each major social media platform to help brands and businesses decide which networks they should prioritize. Being able to identify the demographics of social media audiences at a granular level is the basis for all targeted marketing and messaging. The report also spotlights the opportunities that lie ahead for each social network, how demographics affect usage patterns, and why some platforms are better for brands than others. 
Here are some of our surprising findings: 
  • Facebook still skews young, but the 45- to 54-year-old age bracket has seen 45% growth since year-end 2012. Among U.S. Internet users, 73% with incomes above $75,000 are on Facebook (compared to 17% who are on Twitter). Eight-six percent of Facebook's users are outside the U.S.
  • Instagram: Sixty-eight percent of Instagram's users are women.
  • Twitter has a surprisingly young user population for a large social network — 27% of 18 to 29-year-olds in the U.S. use Twitter, compared to only 16% of people in their thirties and forties. 
  • LinkedIn is international and skews toward male users. 
  • Google+ is the most male-oriented of the major social networks. It's 70% male.
  • Pinterest is dominated by tablet users. And, according to Nielsen data, 84% of U.S. Pinterest users are women.
  • Tumblr is strong with teens and young adults interested in self-expression, but only 8% of U.S. Internet users with incomes above $75,000 use Tumblr.

Business Insider