domingo, 29 de septiembre de 2013

Visualizando las comunicaciones de email usando Node XL

Visualizing Email Communications using NodeXL

Email has become an integral part of communication in both the business and personal spheres. Given its centrality, it is surprising how few tools are generally available for analyzing it outside specialist areas such as Early Case Assessment tools within the litigation area:Xobni being a notable exception at the individual level. However, the rise of social network analysis, and the tools that support it, may change that. Graph theory is remarkably neutral as to whether it is applied to Facebook Friend networks or email communications within a Sales and Marketing division. 

In a previous post, we reported on using Gephi – an open source tool for graphing social networks – to visualize email communications. In this post, we look at using NodeXL for the same purpose. We used the same email data set before – the ‘Godfather Sample’ – in which an original email data set was processed to extract the metadata (e.g. sender, recipient, date sent, subject) and subsequently anonymized using fictional names. 

NodeXL is a free and open source template for Microsoft Excel 2007 that provides a range of basic network analysis and visualization features intended for use on modest-sized networks of several thousand nodes/vertices. It is targeted at non-programmers and builds upon the familiar concepts and features within Excel. Information about the network, e.g. node data and edge lists, is all contained within worksheets. 


Data can be simply loaded by cutting and pasting an edge list from another Excel worksheet but there are also a wide range of other options including the ability to import network data from Twitter (Search and User networks), YouTube and Flickr and from files in GraphML, Pajek and UCINET Full Matrix DL file formats. There is also an option to import directly from an Email PST file which we will discuss a following post. In addition to the basics of an edge list, attribute information can be associated with each edge and node. In our “Godfather” email sample, we added a weighting for communication strength (i.e. the number of emails between the two individuals) to each edge and the affiliation with the Corleone family to each node.

Once an edge list has been added, the vertices/node list is automatically created and a variety of graphical representations can be produced depending on the layout option selected, (Fruchterman Riengold is the default but Harel-Koren Fast Multiscale as well as Grid, Polar, Sugiyama and Sine Wave options are also available), and by mapping data attributes to the visual properties of nodes and vertices. For example, in the graph shown below, nodes were color coded and sized with respect to the individual’s connections with the Corleone family: blue for Corleone family members, green for Corleone allies, orange for Corleone enemies and Pink for individuals with no known associations with the family.



The width of the edges/links was then set to vary in relation to the degree of communication between the two nodes i.e. the number of emails sent between the two individuals concerned.


Labels can be added to both nodes and links showing either information about the node/link or its attributes, as required.






Different graph layout options are available which may be used to generate alternative perspectives and/or easier to view graphs.

Harel-Koren Layout


Circle Layout


Because even a small network can generate a complex, dense graph, NodeXL has a wide range of options for filtering and hiding parts of the graph, the better to elucidate others. The visibility of an edge/vertex for example, can be linked to a particular attribute e.g. degree of closeness. We found the dynamic filters particularly useful for rapidly focusing on areas of interest without altering the properties of the graph themselves. For example, in the following screenshot we are showing only those links where the number of emails between the parties is greater than 40. This allows us to focus on individuals who have been emailing each other more frequently than the average.


In addition to graphical display, NodeXL can be used to calculate key network metrics including: Degree (the number of links on a node and a reflection of the number of relationships an individual has with other members of the network) with In-Degree and Out-Degree options for directed graphs, Betweenness Centrality (the extent to which a node lies between other nodes in the network and a reflection of the number of people an individual is connecting to indirectly), Closeness Centrality (a measure of the degree to which a node is near all other nodes in a network and reflects the ability of an individual to access information through the "grapevine" of network members) and Eigenvector Centrality (a measure of the importance of an individual in the network). In an analysis of email communications, these can be used to identify the degree of connectedness between individuals and their relative importance in the communication flow. 

For example, in our Godfather sample, we have sized the nodes in the graph below by their Degree Centrality. While Vito Corleone is, as expected, shown to be highly connected, Ritchie Martin – an individual not thought to have business associations with the Corleone family, is shown to be more connected than supposed.

Node Sized by Degree Centrality


When we look at the same data from the perspective of betweenness, we see that Vito, Connie and Ritchie all have a high degree of indirect connections.

Nodes Sized by Betweenness Centrality


And the Eignevector Centrality measure confirms Vito Corleone's signficance in the network as well as Connie's, two "allies" - Hyman Roth and Salvatore Tessio and two individuals  Ritchie Martin.

Nodes Sized by Eigenvector Centrality


Last but not least, it is also possible to use NodeXL to visualize clusters of nodes to show or identify subgroups within a network. Clusters can be added manually or generated automatically. Manually creating clusters requires first assigning nodes to an attribute or group membership and then determining the color and shape of the nodes for each subgroup/cluster. In our GodFather example, we used “Family” affiliation to create clusters within the network but equally one could use organization/company, country, language, date etc.
"Family Affiliation" Clusters Coded by Node Color

Selected Cluster (Corleone Affiliates)

NodeXL will also generate clusters automatically using a clustering algorithm developed specifically for large scale social network analysis which works by aggregating closely interconnected groups of nodes. The results for the Godfather sample are shown below. We did not find the automated clustering helpful but this is probably a reflection of the relatively small size of the sample. 

In the next post, we will look at importing email data directly into NodeXL and compare approaches based on analyzing processed vs unprocessed email data. 

Larger Email Network Visualization

To download NodeXL, go to http://nodexl.codeplex.com//. We would also recommend working though the NodeXL tutorial which can be downloaded from:http://casci.umd.edu/images/4/46/NodeXL_tutorial_draft.pdf


A top level overview of social network analysis and the basic concepts behind graph metrics can be found on Wikipedia e.g.http://en.wikipedia.org/wiki/Social_network andhttp://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality

Chroma Scope

viernes, 27 de septiembre de 2013

Facebook llega a ser la mayor red social de la historia

Facebook mapea 1.110.000.000 amistades para lucir el alcance global de red social
Fundador y CEO Mark Zuckberg comparten la imagen en su página personal de Facebook
James Vincent -  The Independent



Facebook ha lanzado una visualización actualizada mostrando las conexiones globales entre sus usuarios.

Con más de 1,11 millones de personas se inscribieron a Facebook en todo el mundo, el resultado es un mapa del trazador de líneas de color azul brillante arco, conectando puntos geográficos de cada usuario con el de sus amigos.

Facebook fundador y CEO, Mark Zuckerberg publicado la nueva visualización como la foto de la portada de su página de perfil, comentando que: "This is a map of all of the friendships formed on Facebook across the world [Este es un mapa de todas las amistades formadas en Facebook en todo el mundo]. "

La primera de esas imágenes fue creado por un interno en la red social en el 2010 cuando el sitio tenía 500 millones de usuarios: " Yo estaba interesado en ver cómo la geografía y la política de las fronteras afectadas, donde vivía la gente en relación con sus amigos. Yo quería una visualización que muestre que las ciudades tenían una gran cantidad de amistades entre ellos."

"Cuando compartí la imagen con otras personas dentro de Facebook, le resonó a mucha gente", dijo el creador del mapa en el tiempo." No es sólo una imagen bonita, es una reafirmación del impacto que tenemos en la conexión de la gente, incluso a través de los océanos y las fronteras. "

Incluso estos usuarios no son suficientes para Facebook, sin embargo, con la empresa anunció una nueva iniciativa el mes pasado llamado internet.org: un consorcio mundial de empresas que tiene como objetivo aumentar el acceso a la web en todo el mundo.

"Hay enormes barreras en los países en desarrollo para conectarse y unirse a la economía del conocimiento", dijo Zuckerberg en el lanzamiento del proyecto ".Internet.org reúne a una alianza mundial que trabajará para superar estos retos, incluyendo lo que el acceso a Internet a disposición de los que actualmente no puede permitir."

Este tipo de proyectos, encabezados por las empresas de tecnología y, a menudo comercializados como iniciativas de caridad, no han recibido el elogio universal.



En respuesta al proyecto Loon algo similar de Google (un plan para ampliar el acceso web utilizando globos meteorológicos a gran altitud) Bill Gates comentó en una entrevista con Businessweek que "cuando un niño tiene diarrea, no, no hay sitio web que alivie eso."

miércoles, 25 de septiembre de 2013

Homofilia: Introducción

Homofilia

Homofilia (es decir, "el amor hacia lo similar") es la tendencia de los individuos a asociarse y vincularse con otros similares. La presencia de homofilia se ha descubierto en una amplia gama de estudios de redes. Más de 100 de estos estudios que han observado homofilia de una forma u otra y establecer esa conexión entre razas o conjuntos de personas que comparten similitudes. [1] Estos incluyen la edad, el género, la clase y el papel institucional. [2]

Esto se expresa a menudo en el adagio "pájaros del mismo plumaje vuelan juntos".

Las personas en relaciones homofílicas comparten características comunes (creencias, valores, educación, etc.) lo que hace que la comunicación y la formación sea una relación más fácil. La homofilia a menudo conduce a la homogamia - matrimonio entre personas con características similares. [1]


Tipos de homofilia

En su formulación original de homofilia, Lazarsfeld y Merton (1954) distinguieron entre homofilia de estatus y homofilia de valores. La homofilia de estatus describe individuos con características similares de estatus social que tienen más probabilidades de asociarse entre sí que por casualidad. La homofilia de valores se refiere a la tendencia a asociarse con otros que piensan de manera similar, independientemente de las diferencias en el estatus. [3]

Para probar la relevancia de la homofilia los investigadores han distinguido entre homofilia de línea básica y homofilia endogámica. El primero es simplemente la cantidad de homofilia que se esperaría por casualidad y la segunda es la cantidad de homofilia sobre y por encima de este valor esperado.

Lo contrario de la homofilia es heterophily.

Referencias


  1. McPherson, M., Smith-Lovin, L., & Cook, JM (2001). "Birds of a Feather : Homophily in Social Networks". Annual Review of Sociology . 27:415-444 .
  2. Retica, Aaron (10 diciembre 2006) . "Homofilia" . New York Times.
  3. Lazarsfeld, P. F. and Merton, R. K. RONKEYLAF (1954). "Friendship as a Social Process: A Substantive and Methodological Analysis". InFreedom and Control in Modern Society, Morroe Berger, Theodore Abel, and Charles H. Page, eds. New York: Van Nostrand, 18–66.

domingo, 22 de septiembre de 2013

Redes: Sirven? Es ciencia real? Dan respuestas a la ciencias políticas?

Three Hard Questions about Network Science

Mark Lubell - Center for Environmental Policy and Behavior


I just returned from a nice junket to beautiful (at least in the fall…) Maine where I gave talks at Bowdoin College and University of Maine. Both institutions were impressive for different reasons, and I met a lot of fun people. The Q&A periods of the talks highlighted three important and hard questions about network science that I think all of us should thinking about how to answer.
A first caveat is that these three questions are interrelated and at times there will be overlap in this write-up. A second caveat is that I will be answering these mostly from the perspective of environmental policy, policy science, political science, and social science more broadly. But the questions apply equally to the physical and biological sciences, and I will be forwarding this blog around in an effort to start a dialog. Who knows, there might be a collaborative paper in this somewhere. A third caveat is that this blog is going to be long—sorry. Bottom-line, for those who trust me and don’t want to read anymore: Network science is a real science that is useful to society and can answer core questions about social systems and behavior.
Why is network science useful?
To answer this question, first we have to decide what “useful” means. Network science can be useful for answering basic research questions in social science, and is also for basic research questions about networks as a unique field of inquiry (see next section). I think it is easy to make the case that network science is useful for pursuing basic understanding of intellectual research questions.
But when I was asked this question at Bowdoin, I really think the person was asking about the applied or public usefulness of network science. A colleague at Maine suggested that the media and public rely on two “frames” for translating science, the “oh wow that’s cool” frame and the “this will make society better off frame”. Networks have done pretty good with the “oh wow” frame with work like Fowler and Christakis suggesting that your friends make you fat, and things like “small world” networks that seem to explain why you end knowing people in random places (at least that is how the public thinks about it).
The “better off” frame is maybe a little harder although other network scientists might disagree. But here are a few potential “better off” factoids, which I could dig up citations for if I have time but I won’t have time so I won’t do it! So call the following my “best professional judgment” about some uses of networks science for social behavior:
•Identifying central individuals and brokers who can help spread ideas and behaviors.
•Identifying disconnected individuals who need to be brought into social communities.
•Identifying key social relationships that should be established in order to integrate diverse communities.
•Restructure organizations enhance economic performance.
•Develop best practices for leveraging social influence to improve human health and welfare, including avoiding disease.
•Early warning systems for disease outbreak and other good or bad contagion processes.
•Identifying bad guys—whether you think the security agencies are the evil empire or American heroes, they have used network tools to achieve national security goals. I don’t have to make any moral judgments about the tool users to make a claim that the tool has some usefulness.
•Intervening in governance systems and policies in order to make them more effective at solving social problems, or achieving other normative goals like democracy, fairness etc. This includes providing policy-makers with network-smart best practices(self-promotion: this is where my work is focused, and I still think we have a long way to go here).
•Provide indicators for empirical measurement of the effectiveness of policies and social programs that aim to build communities of various types.
I’m sure my social science colleagues will greatly expand this list, and the bio-physical scientists will only add their own big list.
Is network science a real“science”?
To answer this question, first we have to decide what “science” is. I refer you to Thomas Kuhn , Karl Popper, and the field of epistemology for real answers; for now I will give my own hack definition. A “science” or scientific discipline is a collection of core research questions, competing theoretical frameworks/hypotheses, and methodological approaches. Of course these things are always in flux but among any community of scientists there some core set of questions, theories, and methods that most people will recognize. Once there the communities and associated knowledge has coalesced enough, a scientific discipline will acquire its own set of institutional arrangements: professional societies, journals, conferences, listservers, Facebook pages, Twitter handles, academic departments, and funding sources.
I definitely think this type of convergence is happening for network science and therefore it qualifies as a very young scientific discipline with the potential to emerge into an enduring scientific discipline. Just a decade ago, the study of networks remained fragmented across many different disciplines including sociology, political science, economics, physics, mathematics, and computer science among others. Now the people in those disciplines who work on networks are in frequent communication. Core questions about the structure and function of networks are beginning to emerge. Methodological and theoretical approaches are cross-fertilizing. It is happening very fast too—so fast I wonder if any other science has witnessed such a rapid evolution. There is a new journal called Network Science, the term is being used by most people, and it has many of the other institutional accoutrements of a scientific discipline.
Where is the political science (insert your home discipline here) in network science?
When I gave my presentations at Maine, one of the political scientists in the audience had a hard time recognizing the political questions embedded in the work. Of course I thought the political science was obvious, because the work focuses on public policy, governance of collective action problems, and involves lots of different Federal, state, and local bureaucracies, interest groups etc. In her presidential address to the American Political Science Association, Lin Ostrom wrote that collective-action and institutions are the core question of the discipline and of course I agree (Jane Mansbridge’s 2013 APSA presidential address reiterated this point).
But some of the more traditional disciplinary political scientists only recognize their discipline when they hear words like “presidents”, “courts”, “elections”, “voting” etc. All disciplines have their purists (probably necessary from an epistemology standpoint but it sure is annoying…) and therefore it is the responsibility of people working in an interdisciplinary area like network science to demonstrate its usefulness to political science. We have to make it explicit and obvious, and it would be naïve for us to expect the disciplinary purists to make the realization on their own. They are the status quo and have no incentive to change their views. So as a sub-community within the discipline, the political scientists who work on networks have to continue to press the case, or network science may be viewed as another fad. I think this would be really sad (rhyme..nice), because in my opinion networks are enduring features of social systems and we are missing a huge part of the story if we don’t study them.
A related issue was recently brought up by John Padgett (U of Chicago) at the 2013 business meeting of the political networks section in APSA. Padgett, who is famous for his analysis of social networks in Renaissance Italy, hypothesized that much of the discipline viewed political networks as a “method” or “technique”, rather than a fundamental scientific approach that could crack open core questions in political science. I believe this image needs to change, and again it is the responsibility of network scientists to do it.
Note there is an uneasy tension here between network science as a separate discipline, versus an approach that is applied to political science questions. As a stand-alone discipline, network science needs to have its own core questions, theories and methods; these can easily exist (and already do) without ever mentioning the word politics. Call this basic research if you want. But when these ideas are “applied” to political science, then it becomes harder to claim that network science is a stand-alone discipline rather than just a branch of political science.
Well I’ve written something that epistemologists and logicians would probably nail me for. But I think these three questions are important. If you don’t like my answers, write your own, or let’s try to figure out as a community how to answer them.

viernes, 20 de septiembre de 2013

Encuentran la masa crítica particular necesaria para un proceso viral en redes

Científicos militares estadounidenses resuelven el problema fundamental en el marketing viral

Teóricos de redes trabajando para los militares de EE.UU. han operado en cómo identificar el grupo pequeño "semilla" de personas que pueden difundir un mensaje a través de toda una red...





Los mensajes virales comienzan su vida mediante la infección de unos cuantos individuos y comienzan a extenderse a través de una red. El extremo más infecciosos contaminando es más o menos todo el mundo.

Cómo y por qué sucede esto es objeto de mucho estudio y debate. Los científicos de la red saben que los factores principales son la velocidad a la cual las personas se infectan, la "conexión " de la red y cómo el grupo de individuos seminales (semillas), que por primera vez se infectan, se vinculan con el resto.

Este es el grupo de semillas que fascina a todos, desde los vendedores que quieren vender Viagra hasta los epidemiólogos que quieran estudiar la propagación del VIH.

Así que una manera de encontrar grupos de semillas en una red social dada, sin duda, puede ser un truco útil, sin mencionar muy valioso. Un paso adelante lo llevaron a cabo Paulo Shakarian, Sean Eyre y Damon Paulo desde el Centro de Ciencias de Red Point West en la Academia Militar de EE.UU. en West Point.

Estos chicos han encontrado una manera de identificar a un grupo de semillas que, cuando se infecta, puede difundir un mensaje a través de toda una red. Y dicen que se puede hacer rápidamente y fácilmente, incluso en redes relativamente grandes.

Su método es relativamente sencillo. Se basa en la idea de que un individuo con el tiempo recibirá un mensaje si una determinada parte de su o sus amigos ya tienen ese mensaje. Esta proporción es de un umbral crítico y es fundamental en su enfoque.

Una vez determinado el umbral, estos chicos examinar la red y buscan todas aquellas personas que tienen más amigos que este número crítico. A continuación, eliminan los que superan el umbral de la cantidad más grande.

En el siguiente paso, repiten este proceso, en busca de todos los que tienen más amigos que el umbral crítico y la poda de los de mayor exceso. Y así sucesivamente.

Este proceso termina cuando ya no queda nadie en la red que tiene más amigos que el umbral. Cuando esto sucede, el que queda es el grupo de semillas. Un mensaje enviado a cada miembro de este grupo puede y debe extenderse a toda la red.

Eso es un enfoque toca a un problema bien conocido. Lo que consiguieron científicos de red se ha empantanado en el pasado es que ellos siempre han redactados este enigma en términos de encontrar el grupo semilla más pequeña. Demostrando que el grupo que ha encontrado es el más pequeño que realmente es un problema complicado.

Pero Shakarian y co no hacen ninguna reclamación al respecto. "Se presenta un método garantizado para encontrar un conjunto de nodos que provoca que toda la población activa -, pero no es necesariamente el tamaño mínimo", dicen.

Que de repente hace que el problema mucho más fácil. De hecho, estos chicos han probado en un gran número de redes de ver lo bien que funciona. Sus redes de prueba incluyen Flickr, FourSquare, Frienster, Last.FM, Digg (desde diciembre de 2010), Yelp, YouTube, etc.

Y el algoritmo funciona bien. "En una red social Friendster consta de 5,6 millones de nodos y 28 millones de aristas que encontramos un conjunto de semillas en menos de 3,6 horas", dicen Shakarian y co. Para ello utilizan un procesador Intel Xeon X5677 procesador que funciona a 3,46 GHz con una caché de 12 MB ​​ejecutando Red Hat Enterprise Linux versión 6.1 y equipados con 70 GB de memoria física.

Eso es un resultado prometedor y que muchas personas encontrarán valiosa. Shakarian y compañía dicen que el uso de su método para encontrar un grupo de semillas de la red social en línea FourSquare, un comercializador viral podría esperar un rendimiento 297 veces la inversión. No está mal !

Por esta razón, Shakarian y co podrían, y probablemente, se encuentran y su algoritmo de la demanda de la legión de vendedores que desean hacer su producto viral. No menos importante de ellos podrían ser empresas grandes de Internet como Amazon y Apple, que ambos tienen grandes redes de clientes y un montón de productos para vender.

¡Esperen a oir más descubrimientos al respecto!

Ref: arxiv.org/abs/1309.2963 :A Scalable Heuristic for Viral Marketing Under the Tipping Model

miércoles, 18 de septiembre de 2013

Redes de producción: MIP de Australia

Los clusters de la Matriz Insumo-Producto de Australia


Hecho con el MapEquation usando datos procesado en Node XL
Tipo de grafo Dirigido
Vértices 106
Enlaces únicos 10042


Total de enlaces 10042
Self-Loops 102
Ratio de pares de nodos recíprocos 0,810564663
Ratio de enlaces recíprocos 0,895372233
Componentes conectados 1
Componentes conectados de un solo nodo 0
Nodos máximo en un componente conectado 106
Enlaces máximos en un componente conectado 10042
Máxima distancia geodésica (Diámetro) 2
Distancia geodésica promedio 1,003916
Densidad del grafo 0,893081761

Elaboración propia.