jueves, 3 de octubre de 2013

Software: Commetrix CMX para redes sociales dinámicas

Commetrix CMX Analyzer: Dynamic Social Network Visualization
Commetrix CMX Analyzer is a social network analysis platform from a German company Trilexis (www.trilexis.com) which originated in a research group at the Technical University of Berlin. (Note: the website, user interface and documentation are all in English.) What is interesting about this particular tool is its emphasis on the dynamics of social interactions over time. It achieves this through a data format that captures information about each individual link event including not only originator, destination and time but also user specified attributes which could include communication mode (email, IM, twitter), type of exchange (social, work, ecommerce), topic (e.g. keywords extracted from the subject).

Commetrix CMX Analyzer User Interface

A small subset of the Enron Email dataset –from the size and the individuals referenced we are guessing a single custodian - is provided for demonstration purposes. Part of our interest in this particular software is that we are familiar with the Enron dataset and had researched it using the social network analysis functionality of an eDiscovery system called MetaLINCS. We were curious to see what additional insights CMX Analyzer might provide. 

CMX Analyzer is a desktop tool built in Java incorporating the 3D graphical capabilities of Java 3D and the Java Media Framework. Once we had obtained the license key, the application was straightforward to install and comes with a user guide. To date we have only been able to try it out on the sample data set provided as the process of creating new data sets requires end-user coding (of link attributes) followed by a data transformation process that requires as separate tool (Commetrix Producer) or the data being sent to Trilexis for processing by their systems.

Commetrix Data Preparation Process:

Commetrix is not as functionally or visually rich as some of the other tools we have investigated and reported on in previous blogs (e.g. Gephi, nodeXL). However, where it comes into its own is in the dynamic visualization of email communications over time. The MetaLINCs software we had used in the past had provided a “time-slider” but was essentially a “snapshot” approach. Commetrix has time-sliders too but also animates the traffic creating a unique perspective on what is, after all, a time-based series of events. (We should also warn readers that the resulting animations make for highly addictive viewing. We were totally captivated!) The start-end of the time period can be set, as can the intervals and speed of animation. It is also possible to run the time line backwards as well as forwards. This makes it possible to identify “hot spots” of communication activity between group subsets at particular points in time. In other types of communication e.g. twitter or facebook – we can see how this would provide valuable insight into the evolution of a topic of discussion or a social group.

Snapshot of Communications: Jan 2000

Snapshot of Communications: Dec 2000
Visually, Commetrix is more limited than some of the other packages we have used e.g. it is not possible to pan or zoom. There are options to change node size and color to represent parameters such as communications sent, communications received, number of direct contacts. Color schemes cannot be chosen directly but can be set to show selected attributes e.g. the following screenshot shows nodes color coded by the ‘function’ attribute where dark blue represents employees, pale blue represents directors, green represents traders, wholly purple circles represent managers and purple circles with yellow centers represent in-house lawyers. (Note: we found the use of full and semi colored circles to be somewhat confusing).

Colorcoding by Function

Included in Commetrix is an “egoview” option which allows you to select a particular node and investigate communication to and from that individual node. Links can be filtered to include only direct communications (a 1-step link) or communications involving two or more steps. The image below, for example, shows communications to and from Sara Shackleton. While this capability is helpful focusing down on traffic to and from a node, in the case of email communications if the data set is from only one custodian, the egoview has limited value when used outside that custodian as it will show only those communications that happen to have been referenced in emails sent to and from the primary custodian i.e. it is an imperfect sample. 

Screenshot Showing Ego View - Tana Jones

Commetrix also comes with a Keyword filter. The intent is to allow the user to focus on interactions “about” the selected keywords. The interface is less obvious than some of the other areas and we confess to wondering if there was a bug until – rereading the manual – we realized that “In” didn’t mean “inbound” but include and “Out” meant exclude. Selecting the terms was also rather tedious as it meant scrolling through a long list of options. To validate the filtering, we took ‘california’ related terms and looked to see if Jeff Dasovitch was included, which he is – see screenshot below. It would be interesting to see this concept better developed with better keyword lists, more complex keyword filtering options and possibly the employment of automated topic determination techniques such as keyword clustering.

Screenshot Showing Use of Keyword Filter
Although the enron data set was provided only for demo purposes – having worked with this data, we were curious about two things: firstly how were the keywords derived (we guessed email subject but some of the keywords were email domains – indicating other metadata might have been used as well – and some phrases had been concatenated (e.g. ‘californiaattached’) or include a leading article (e.g.thenumber), or word fragments (e.g.’t’, ‘e’). Secondly, and more importantly, how were the “identities” of the individuals represented by the nodes resolved? This is always a major issue in email communications if the only information about senders and recipients is an email address. Most individuals have multiple email addresses – even within companies – and the names on email addresses may be difficult to resolve to a single individual. We raise this question because MetaLINCS included functionality that attempted to link individuals with their email accounts based not only on email address but also on communications patterns. Even then, many individuals/email accounts that a human would identify as probably being connected, could not be automatically linked. We are guessing that the identity of individuals was manually coded since the node table has a clean one-to-one mapping between individuals and a single email address.

In summary, while we think some of the other software we have used and researched offer better social network visualization options, we really liked the time-line animation Commetrix provides and believe it could be very helpful when studying the evolution of a network or communication patterns over time. While the keyword filtering option was disappointing in both the implementation and the demo dataset provided, we think it has obvious potential – particularly when analyzing large data sets of email, IM and twitter – in enabling users to focus in on only those communications “about” a particular topic. Of course, with that come all the provisos of using keywords as a substitute for “aboutness” but if it was combined with stemming, a better stop word list, and some form of thesaurus (to apply synonyms automatically) it would be very powerful.

Chroma Scope

miércoles, 2 de octubre de 2013

La demografía de redes sociales digitales

Social Media Demographics: The Surprising Identity Of Each Major Social Network



Each social media platform has cultivated a unique identity thanks to the demographics of the people who participate in the network. Some platforms are preferred by young adults, who are most active in the evening, others by high-income professionals, who are posting throughout the workday.
We explained in a recent report why many brands and businesses need platform-focused social media strategies, rather than a diluted strategy that aims to be everywhere at once.
In a new report from BI Intelligence, we break down the demographics of each major social media platform to help brands and businesses decide which networks they should prioritize. Being able to identify the demographics of social media audiences at a granular level is the basis for all targeted marketing and messaging. The report also spotlights the opportunities that lie ahead for each social network, how demographics affect usage patterns, and why some platforms are better for brands than others. 
Here are some of our surprising findings: 
  • Facebook still skews young, but the 45- to 54-year-old age bracket has seen 45% growth since year-end 2012. Among U.S. Internet users, 73% with incomes above $75,000 are on Facebook (compared to 17% who are on Twitter). Eight-six percent of Facebook's users are outside the U.S.
  • Instagram: Sixty-eight percent of Instagram's users are women.
  • Twitter has a surprisingly young user population for a large social network — 27% of 18 to 29-year-olds in the U.S. use Twitter, compared to only 16% of people in their thirties and forties. 
  • LinkedIn is international and skews toward male users. 
  • Google+ is the most male-oriented of the major social networks. It's 70% male.
  • Pinterest is dominated by tablet users. And, according to Nielsen data, 84% of U.S. Pinterest users are women.
  • Tumblr is strong with teens and young adults interested in self-expression, but only 8% of U.S. Internet users with incomes above $75,000 use Tumblr.

Business Insider

domingo, 29 de septiembre de 2013

Visualizando las comunicaciones de email usando Node XL

Visualizing Email Communications using NodeXL

Email has become an integral part of communication in both the business and personal spheres. Given its centrality, it is surprising how few tools are generally available for analyzing it outside specialist areas such as Early Case Assessment tools within the litigation area:Xobni being a notable exception at the individual level. However, the rise of social network analysis, and the tools that support it, may change that. Graph theory is remarkably neutral as to whether it is applied to Facebook Friend networks or email communications within a Sales and Marketing division. 

In a previous post, we reported on using Gephi – an open source tool for graphing social networks – to visualize email communications. In this post, we look at using NodeXL for the same purpose. We used the same email data set before – the ‘Godfather Sample’ – in which an original email data set was processed to extract the metadata (e.g. sender, recipient, date sent, subject) and subsequently anonymized using fictional names. 

NodeXL is a free and open source template for Microsoft Excel 2007 that provides a range of basic network analysis and visualization features intended for use on modest-sized networks of several thousand nodes/vertices. It is targeted at non-programmers and builds upon the familiar concepts and features within Excel. Information about the network, e.g. node data and edge lists, is all contained within worksheets. 


Data can be simply loaded by cutting and pasting an edge list from another Excel worksheet but there are also a wide range of other options including the ability to import network data from Twitter (Search and User networks), YouTube and Flickr and from files in GraphML, Pajek and UCINET Full Matrix DL file formats. There is also an option to import directly from an Email PST file which we will discuss a following post. In addition to the basics of an edge list, attribute information can be associated with each edge and node. In our “Godfather” email sample, we added a weighting for communication strength (i.e. the number of emails between the two individuals) to each edge and the affiliation with the Corleone family to each node.

Once an edge list has been added, the vertices/node list is automatically created and a variety of graphical representations can be produced depending on the layout option selected, (Fruchterman Riengold is the default but Harel-Koren Fast Multiscale as well as Grid, Polar, Sugiyama and Sine Wave options are also available), and by mapping data attributes to the visual properties of nodes and vertices. For example, in the graph shown below, nodes were color coded and sized with respect to the individual’s connections with the Corleone family: blue for Corleone family members, green for Corleone allies, orange for Corleone enemies and Pink for individuals with no known associations with the family.



The width of the edges/links was then set to vary in relation to the degree of communication between the two nodes i.e. the number of emails sent between the two individuals concerned.


Labels can be added to both nodes and links showing either information about the node/link or its attributes, as required.






Different graph layout options are available which may be used to generate alternative perspectives and/or easier to view graphs.

Harel-Koren Layout


Circle Layout


Because even a small network can generate a complex, dense graph, NodeXL has a wide range of options for filtering and hiding parts of the graph, the better to elucidate others. The visibility of an edge/vertex for example, can be linked to a particular attribute e.g. degree of closeness. We found the dynamic filters particularly useful for rapidly focusing on areas of interest without altering the properties of the graph themselves. For example, in the following screenshot we are showing only those links where the number of emails between the parties is greater than 40. This allows us to focus on individuals who have been emailing each other more frequently than the average.


In addition to graphical display, NodeXL can be used to calculate key network metrics including: Degree (the number of links on a node and a reflection of the number of relationships an individual has with other members of the network) with In-Degree and Out-Degree options for directed graphs, Betweenness Centrality (the extent to which a node lies between other nodes in the network and a reflection of the number of people an individual is connecting to indirectly), Closeness Centrality (a measure of the degree to which a node is near all other nodes in a network and reflects the ability of an individual to access information through the "grapevine" of network members) and Eigenvector Centrality (a measure of the importance of an individual in the network). In an analysis of email communications, these can be used to identify the degree of connectedness between individuals and their relative importance in the communication flow. 

For example, in our Godfather sample, we have sized the nodes in the graph below by their Degree Centrality. While Vito Corleone is, as expected, shown to be highly connected, Ritchie Martin – an individual not thought to have business associations with the Corleone family, is shown to be more connected than supposed.

Node Sized by Degree Centrality


When we look at the same data from the perspective of betweenness, we see that Vito, Connie and Ritchie all have a high degree of indirect connections.

Nodes Sized by Betweenness Centrality


And the Eignevector Centrality measure confirms Vito Corleone's signficance in the network as well as Connie's, two "allies" - Hyman Roth and Salvatore Tessio and two individuals  Ritchie Martin.

Nodes Sized by Eigenvector Centrality


Last but not least, it is also possible to use NodeXL to visualize clusters of nodes to show or identify subgroups within a network. Clusters can be added manually or generated automatically. Manually creating clusters requires first assigning nodes to an attribute or group membership and then determining the color and shape of the nodes for each subgroup/cluster. In our GodFather example, we used “Family” affiliation to create clusters within the network but equally one could use organization/company, country, language, date etc.
"Family Affiliation" Clusters Coded by Node Color

Selected Cluster (Corleone Affiliates)

NodeXL will also generate clusters automatically using a clustering algorithm developed specifically for large scale social network analysis which works by aggregating closely interconnected groups of nodes. The results for the Godfather sample are shown below. We did not find the automated clustering helpful but this is probably a reflection of the relatively small size of the sample. 

In the next post, we will look at importing email data directly into NodeXL and compare approaches based on analyzing processed vs unprocessed email data. 

Larger Email Network Visualization

To download NodeXL, go to http://nodexl.codeplex.com//. We would also recommend working though the NodeXL tutorial which can be downloaded from:http://casci.umd.edu/images/4/46/NodeXL_tutorial_draft.pdf


A top level overview of social network analysis and the basic concepts behind graph metrics can be found on Wikipedia e.g.http://en.wikipedia.org/wiki/Social_network andhttp://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality

Chroma Scope

viernes, 27 de septiembre de 2013

Facebook llega a ser la mayor red social de la historia

Facebook mapea 1.110.000.000 amistades para lucir el alcance global de red social
Fundador y CEO Mark Zuckberg comparten la imagen en su página personal de Facebook
James Vincent -  The Independent



Facebook ha lanzado una visualización actualizada mostrando las conexiones globales entre sus usuarios.

Con más de 1,11 millones de personas se inscribieron a Facebook en todo el mundo, el resultado es un mapa del trazador de líneas de color azul brillante arco, conectando puntos geográficos de cada usuario con el de sus amigos.

Facebook fundador y CEO, Mark Zuckerberg publicado la nueva visualización como la foto de la portada de su página de perfil, comentando que: "This is a map of all of the friendships formed on Facebook across the world [Este es un mapa de todas las amistades formadas en Facebook en todo el mundo]. "

La primera de esas imágenes fue creado por un interno en la red social en el 2010 cuando el sitio tenía 500 millones de usuarios: " Yo estaba interesado en ver cómo la geografía y la política de las fronteras afectadas, donde vivía la gente en relación con sus amigos. Yo quería una visualización que muestre que las ciudades tenían una gran cantidad de amistades entre ellos."

"Cuando compartí la imagen con otras personas dentro de Facebook, le resonó a mucha gente", dijo el creador del mapa en el tiempo." No es sólo una imagen bonita, es una reafirmación del impacto que tenemos en la conexión de la gente, incluso a través de los océanos y las fronteras. "

Incluso estos usuarios no son suficientes para Facebook, sin embargo, con la empresa anunció una nueva iniciativa el mes pasado llamado internet.org: un consorcio mundial de empresas que tiene como objetivo aumentar el acceso a la web en todo el mundo.

"Hay enormes barreras en los países en desarrollo para conectarse y unirse a la economía del conocimiento", dijo Zuckerberg en el lanzamiento del proyecto ".Internet.org reúne a una alianza mundial que trabajará para superar estos retos, incluyendo lo que el acceso a Internet a disposición de los que actualmente no puede permitir."

Este tipo de proyectos, encabezados por las empresas de tecnología y, a menudo comercializados como iniciativas de caridad, no han recibido el elogio universal.



En respuesta al proyecto Loon algo similar de Google (un plan para ampliar el acceso web utilizando globos meteorológicos a gran altitud) Bill Gates comentó en una entrevista con Businessweek que "cuando un niño tiene diarrea, no, no hay sitio web que alivie eso."

miércoles, 25 de septiembre de 2013

Homofilia: Introducción

Homofilia

Homofilia (es decir, "el amor hacia lo similar") es la tendencia de los individuos a asociarse y vincularse con otros similares. La presencia de homofilia se ha descubierto en una amplia gama de estudios de redes. Más de 100 de estos estudios que han observado homofilia de una forma u otra y establecer esa conexión entre razas o conjuntos de personas que comparten similitudes. [1] Estos incluyen la edad, el género, la clase y el papel institucional. [2]

Esto se expresa a menudo en el adagio "pájaros del mismo plumaje vuelan juntos".

Las personas en relaciones homofílicas comparten características comunes (creencias, valores, educación, etc.) lo que hace que la comunicación y la formación sea una relación más fácil. La homofilia a menudo conduce a la homogamia - matrimonio entre personas con características similares. [1]


Tipos de homofilia

En su formulación original de homofilia, Lazarsfeld y Merton (1954) distinguieron entre homofilia de estatus y homofilia de valores. La homofilia de estatus describe individuos con características similares de estatus social que tienen más probabilidades de asociarse entre sí que por casualidad. La homofilia de valores se refiere a la tendencia a asociarse con otros que piensan de manera similar, independientemente de las diferencias en el estatus. [3]

Para probar la relevancia de la homofilia los investigadores han distinguido entre homofilia de línea básica y homofilia endogámica. El primero es simplemente la cantidad de homofilia que se esperaría por casualidad y la segunda es la cantidad de homofilia sobre y por encima de este valor esperado.

Lo contrario de la homofilia es heterophily.

Referencias


  1. McPherson, M., Smith-Lovin, L., & Cook, JM (2001). "Birds of a Feather : Homophily in Social Networks". Annual Review of Sociology . 27:415-444 .
  2. Retica, Aaron (10 diciembre 2006) . "Homofilia" . New York Times.
  3. Lazarsfeld, P. F. and Merton, R. K. RONKEYLAF (1954). "Friendship as a Social Process: A Substantive and Methodological Analysis". InFreedom and Control in Modern Society, Morroe Berger, Theodore Abel, and Charles H. Page, eds. New York: Van Nostrand, 18–66.

domingo, 22 de septiembre de 2013

Redes: Sirven? Es ciencia real? Dan respuestas a la ciencias políticas?

Three Hard Questions about Network Science

Mark Lubell - Center for Environmental Policy and Behavior


I just returned from a nice junket to beautiful (at least in the fall…) Maine where I gave talks at Bowdoin College and University of Maine. Both institutions were impressive for different reasons, and I met a lot of fun people. The Q&A periods of the talks highlighted three important and hard questions about network science that I think all of us should thinking about how to answer.
A first caveat is that these three questions are interrelated and at times there will be overlap in this write-up. A second caveat is that I will be answering these mostly from the perspective of environmental policy, policy science, political science, and social science more broadly. But the questions apply equally to the physical and biological sciences, and I will be forwarding this blog around in an effort to start a dialog. Who knows, there might be a collaborative paper in this somewhere. A third caveat is that this blog is going to be long—sorry. Bottom-line, for those who trust me and don’t want to read anymore: Network science is a real science that is useful to society and can answer core questions about social systems and behavior.
Why is network science useful?
To answer this question, first we have to decide what “useful” means. Network science can be useful for answering basic research questions in social science, and is also for basic research questions about networks as a unique field of inquiry (see next section). I think it is easy to make the case that network science is useful for pursuing basic understanding of intellectual research questions.
But when I was asked this question at Bowdoin, I really think the person was asking about the applied or public usefulness of network science. A colleague at Maine suggested that the media and public rely on two “frames” for translating science, the “oh wow that’s cool” frame and the “this will make society better off frame”. Networks have done pretty good with the “oh wow” frame with work like Fowler and Christakis suggesting that your friends make you fat, and things like “small world” networks that seem to explain why you end knowing people in random places (at least that is how the public thinks about it).
The “better off” frame is maybe a little harder although other network scientists might disagree. But here are a few potential “better off” factoids, which I could dig up citations for if I have time but I won’t have time so I won’t do it! So call the following my “best professional judgment” about some uses of networks science for social behavior:
•Identifying central individuals and brokers who can help spread ideas and behaviors.
•Identifying disconnected individuals who need to be brought into social communities.
•Identifying key social relationships that should be established in order to integrate diverse communities.
•Restructure organizations enhance economic performance.
•Develop best practices for leveraging social influence to improve human health and welfare, including avoiding disease.
•Early warning systems for disease outbreak and other good or bad contagion processes.
•Identifying bad guys—whether you think the security agencies are the evil empire or American heroes, they have used network tools to achieve national security goals. I don’t have to make any moral judgments about the tool users to make a claim that the tool has some usefulness.
•Intervening in governance systems and policies in order to make them more effective at solving social problems, or achieving other normative goals like democracy, fairness etc. This includes providing policy-makers with network-smart best practices(self-promotion: this is where my work is focused, and I still think we have a long way to go here).
•Provide indicators for empirical measurement of the effectiveness of policies and social programs that aim to build communities of various types.
I’m sure my social science colleagues will greatly expand this list, and the bio-physical scientists will only add their own big list.
Is network science a real“science”?
To answer this question, first we have to decide what “science” is. I refer you to Thomas Kuhn , Karl Popper, and the field of epistemology for real answers; for now I will give my own hack definition. A “science” or scientific discipline is a collection of core research questions, competing theoretical frameworks/hypotheses, and methodological approaches. Of course these things are always in flux but among any community of scientists there some core set of questions, theories, and methods that most people will recognize. Once there the communities and associated knowledge has coalesced enough, a scientific discipline will acquire its own set of institutional arrangements: professional societies, journals, conferences, listservers, Facebook pages, Twitter handles, academic departments, and funding sources.
I definitely think this type of convergence is happening for network science and therefore it qualifies as a very young scientific discipline with the potential to emerge into an enduring scientific discipline. Just a decade ago, the study of networks remained fragmented across many different disciplines including sociology, political science, economics, physics, mathematics, and computer science among others. Now the people in those disciplines who work on networks are in frequent communication. Core questions about the structure and function of networks are beginning to emerge. Methodological and theoretical approaches are cross-fertilizing. It is happening very fast too—so fast I wonder if any other science has witnessed such a rapid evolution. There is a new journal called Network Science, the term is being used by most people, and it has many of the other institutional accoutrements of a scientific discipline.
Where is the political science (insert your home discipline here) in network science?
When I gave my presentations at Maine, one of the political scientists in the audience had a hard time recognizing the political questions embedded in the work. Of course I thought the political science was obvious, because the work focuses on public policy, governance of collective action problems, and involves lots of different Federal, state, and local bureaucracies, interest groups etc. In her presidential address to the American Political Science Association, Lin Ostrom wrote that collective-action and institutions are the core question of the discipline and of course I agree (Jane Mansbridge’s 2013 APSA presidential address reiterated this point).
But some of the more traditional disciplinary political scientists only recognize their discipline when they hear words like “presidents”, “courts”, “elections”, “voting” etc. All disciplines have their purists (probably necessary from an epistemology standpoint but it sure is annoying…) and therefore it is the responsibility of people working in an interdisciplinary area like network science to demonstrate its usefulness to political science. We have to make it explicit and obvious, and it would be naïve for us to expect the disciplinary purists to make the realization on their own. They are the status quo and have no incentive to change their views. So as a sub-community within the discipline, the political scientists who work on networks have to continue to press the case, or network science may be viewed as another fad. I think this would be really sad (rhyme..nice), because in my opinion networks are enduring features of social systems and we are missing a huge part of the story if we don’t study them.
A related issue was recently brought up by John Padgett (U of Chicago) at the 2013 business meeting of the political networks section in APSA. Padgett, who is famous for his analysis of social networks in Renaissance Italy, hypothesized that much of the discipline viewed political networks as a “method” or “technique”, rather than a fundamental scientific approach that could crack open core questions in political science. I believe this image needs to change, and again it is the responsibility of network scientists to do it.
Note there is an uneasy tension here between network science as a separate discipline, versus an approach that is applied to political science questions. As a stand-alone discipline, network science needs to have its own core questions, theories and methods; these can easily exist (and already do) without ever mentioning the word politics. Call this basic research if you want. But when these ideas are “applied” to political science, then it becomes harder to claim that network science is a stand-alone discipline rather than just a branch of political science.
Well I’ve written something that epistemologists and logicians would probably nail me for. But I think these three questions are important. If you don’t like my answers, write your own, or let’s try to figure out as a community how to answer them.