domingo, 11 de agosto de 2013

Freeman: Cómo la centralidad pasó de la Sociología a las ciencias duras

Going the Wrong Way on a One-Way Street:
Centrality in Physics and Biology*

Linton C. Freeman, lin@aris.ss.uci.edu
University of California, Irvine
Journal of Social Structure

Abstract

When ideas and tools move from one field to another, the movement is generally from the natural to the social sciences. In recent years, however, there has been a major movement in the opposite direction. The idea of centrality and the tools for its measurement were originally developed in the social science field of social network analysis. But currently the concept and tools of centrality are being used widely in physics and biology. This paper examines how and why that―wrong way―movement developed, its extent and its consequences for the fields involved.

Introduction

More than 170 years ago August Comte (1830-1842/1982 ) defined a hierarchy of the sciences. He claimed that as the oldest science, astronomy belonged at the foundation. Astronomy was followed by physics, chemistry and biology in that order. And finally, he placed sociology (by which he meant what we now call social science) at the top of his hierarchy.
Comte argued that each of the sciences tends to borrow concepts and tools from those that fall below it. Thus, physics borrows from astronomy; chemistry borrows from physics and astronomy and so on. From this perspective, then, borrowing is a one-way street; it goes from the older, more established sciences to the newer, less established ones.
For the most part, Comte seems to have been right. Most, but not all, of the borrowing apparently has involved social scientists adopting tools and concepts developed by natural scientists. As Table 1 shows, it is quite easy to come up with examples in which social scientists have borrowed from biologists, chemists and physicists ― even from electrical engineers. On the other hand, it is hard to find examples where ideas and tools have moved from social science to biology chemistry or physics. One example is the graph theoretic concept of "clique." It was defined by Luce and Perry (1949) who were working on a project in social network analysis. And it is used in physics (e. g. Marco, 2007) and biology (Wang, 2008).




Table 1. Some Examples of Applications in Social Science that were Borrowed from Other Sciences. 
(The social science applications are intended only as illustrations; they are not meant to represent the first or the most important applications of the concepts.)


In the present paper, I will describe a recent phenomenon, one in which a good many ideas and tools have moved from social science to the natural sciences. Specifically, I will show that, although they were developed in social science, concepts and tools related to centrality have been adopted and are widely used in physics and particularly in biology.

The Origins of the Idea of Centrality

It has been argued (Holme, Kim, Yoon and Han, 2002) that the notion of centrality was introduced by the eminent French mathematician, Camille Jordan. It is true that Jordan (1869) did propose two procedures for determining the centers of graphs. But his procedures do not correspond to contemporary ideas about centrality.
Jordan's centers depart from current usage in two ways. First, they are restricted to graphs that take the form of trees; they were not defined for more general forms of graphs. And second, Jordan's procedures are essentially categorical. They do not address the issue of measuring the degree of centrality of any node. Instead, each procedure simply picks out one, or at most two, nodes and specifies them as centers.
One of Jordan's centers, for example, involves calculating the geodesic distance between each node and all of its reachable neighbors. From these results, the distance to its farthest neighbor can be determined. This information, that has come to be called the "eccentricity" of a node, has traditionally been used solely to determine which node or nodes display the least eccentricity and are, therefore, the center or centers of the graph.[1]
In contrast, the procedures developed in social science apply to all graphs and they provide measures of centrality for every node in a graph. They first emerged in the late 1940s in the Group Networks Laboratory at MIT. There, Alex Bavelas and his students - particularly Harold Leavitt - conducted a series of experiments on the impact of organizational form on productivity and morale (Bavelas, 1948, 1950; Leavitt, 1951). The experimental variable that they manipulated was the centrality of each experimental subject in the pattern of communication linking them.
The MIT group proposed several procedures for measuring the centrality of nodes in graphs, but the simplest - the one that is still used - is based on the sum of the geodesic (shortest path) distances from each node in a connected graph to all the other nodes in that graph. This measure may be applied, not just to trees, but to any connected graphs. Moreover, it does not simply specify a collection of "centers." Instead, it yields an index of centrality for every node in any connected graph.

Centrality in Social Science

Following this first work a large and confusing array of other proposals for measuring centrality were introduced. I sifted through those proposals and came up with three measures of centrality that together seemed to capture the essential elements in all the earlier work (Freeman, 1979). One was based on the closeness of a node to all the other nodes in a connected graph. The second was based on the degree of a node - the number of others to which it was directly connected.[2] And the third was based on the betweenness of a node. A given node's betweenness is determined by examining the shortest paths linking all other pairs of nodes in the graph and tabulating the number of those paths on which the node in question falls. Both degree centrality and betweenness centrality, then, can be calculated for all the nodes in any graph at all, connected or not.
All this work was grounded in graph theory. But during this same period, another set of procedures, based on matrix algebra, were being developed. Leo Katz (1953) introduced a "status" index based on the successive powers of an affiliation matrix. In the Katz measure, the centrality or status of an individual depends on the whole pattern of ties displayed in the affiliation matrix. Each individual's status depends on the number of ties that individual has to others, the number of ties each of those others has, and so on. Each successive ring of ties makes a diminishing contribution to the status of the original individual, but they all do contribute.
In a paper that was primarily focused on uncovering social groups, Charles Hubbell (1965) extended Katz's status index. And Phil Bonacich (1972, 1987) characterized both the Katz index and the Hubbell index as measures of centrality. He showed, moreover, that both were determined by the first eigenvector of the data matrix.
Beginning in about 1980 then, the measures based on closeness, degree, betweenness and the first eigenvector became standard in social network analysis. All four were widely used in the field. But, in 1998, there was a revolution in social network analysis; since then nothing has been the same.

The Revolution

The world of research in social network analysis was changed dramatically when a young physicist-engineer, Duncan Watts, working with a mathematician, Steven Strogatz, published a paper in Nature. Their paper was titled, "Collective dynamics of 'small world' networks" and it took up a topic that traditionally had been a core part of social network research. It introduced a new model that was designed to account for the small world experience in social life.
On the face of it, the Watts and Strogatz paper might be taken as an example of movement from social science to natural science. But the fact is that it drew nothing beyond the pop phrase "small world" from our literature. Although the social network literature about the small world included formal models (e. g. Pool and Kochen, 1978; White, 1970) they were not cited in the Watts and Strogatz article. Indeed, the concepts and tools that were proposed in their paper were all brand new.
Other physicists had already been involved in social network analysis. Notable among these were Derek de Solla Price, Harrison White and Peter Killworth (e. g. Price, 1965, 1976; White, 1970; White, Boorman and Breiger, 1976; Killworth, McCarty, Bernard, Johnsen, Domini and Shelley, 2003; Killworth, McCarty, Bernard and House, 2006). These physicists read the social network literature, joined the collective effort and contributed to an ongoing research process. But Watts and Strogatz did none of these things. Their paper simply took a research topic that had been a part of social network analysis and implicitly redefined it as a topic in physics.
The strange thing about all this is that, apparently, other physicists agreed. Very soon there were more publications about small worlds in physics journals than there were in social science journals. Figure 1 shows the situation in 2003 with respect to publications on the small world theme. Each node is an article and each edge represents a citation linking a pair of articles. Black nodes are physicists, white nodes are members of the social network community and gray nodes are outsiders. In the five years between 1998 and 2003 physicists turned out more publications on the subject than members of the social network community had produced over a period of 45 years. And, as the figure shows, members of the two camps seemed for the most part to avoid citing one another; few citations crossed the boundary between physics and social science.
The physicists were quick to extend the range of their interests to include other topics traditionally associated with research in social network analysis. And, for the most part, they continued to ignore earlier work by social network analysts. Three physicists, Barabasi, Albert and Jeong (1999), for example, published a paper in Physica A in which they examined the distribution of what network analysts had been calling "degree centrality." But they talked only about "degree distributions" and made no mention of centrality. Apparently, they were unaware of related work by network analysts (e. g. Price, 1976).
Two years later Jeong, Mason, Barabasi and Oltvai (2001) published a letter in Nature. Again they were concerned with degree distributions, and although they still did not cite research in social network analysis, this time they did use the word "centrality" in the title of their paper.


Figure 1. Small World Publications circa 2003 (Freeman, 2004, p. 166).

Bridging Social Science and Physics

In 1999 Watts made an attempt to move ideas from physics to social science. By publishing an article in the American Journal of Sociology (Watts, 1999), he introduced his physicist's conception of the small world problem to the social science community. In that article he mentioned centrality and several other social network concepts and tools, but for the most part he disparaged them. He suggested, for example, that the "computational costs" of calculating betweenness centrality were likely to be "prohibitive." Watts' effort was designed to move concepts in the traditional direction. He introduced an idea from physics into the social science literature. But he made no attempt to introduce centrality or any other social science ideas into physics.
The first explicit movement in the other direction―the "wrong way"―was made by another physicist, Mark Newman. Newman read Wasserman and Faust's (1994) text on social network analysis and was struck by the potential utility of the notion of betweenness centrality. He used betweenness centrality in a study of collaboration among scientists that was published in a physics journal, Physical Review E (Newman, 2001). In that paper Newman cited my derivation of betweenness (Freeman, 1977).
The physics community was quick to pick up on Newman's article. That same year, Goh, Ki, Kahng and Kim (2001) published a paper in Physical Review Letters. They examined the distribution of betweenness centrality and they thanked Newman for calling their attention to the research in social network analysis. And by the following year, physicists Holme, Kim, Yoon and Han (2002) published an article in Physical Review E that reviewed all three graph theoretic measures of centrality - degree, closeness and betweenness - and thereby made them a part of the physics literature.

Centrality in Biology

The first application of centrality ideas to a topic in biology was made by physicists. The letter in Nature by the physicists, Jeong, Mason Barabasi and Oltvai (2001) that was mentioned above, did not cite social network literature. But it did apply a degree based measure of centrality to a problem in biology.
Structural biologists themselves also began to use centrality that same year. In 2001 two biologists, Wagner and Fell, published a paper on metabolic networks in the Proceedings of the Royal Society of London, Series B, Biological Science. In it, they used degree and closeness as indicators of centrality, but, like Jeong et al. they made no mention of the social network literature on the subject
The very next year, however, four molecular biophysicists, Vendruscolo, Dokholyan, Paci and Karplus (2002) followed up by publishing an article on proteins in Physical Revue E. In it they used betweenness centrality and cited my 1977 paper. Then two systems biologists, Ma and Zeng (2003) followed up a year later when they discussed all three graph theoretic measures.
It was two more years before both the physicists and the biologists adopted Bonacich's first eigenvector as a measure of centrality. A computational biologist, Estrada and a mathematician, Rodriguez-Velásquez, published a joint paper on centrality in Physical Review E. Their publication was in a physics journal, but their applications were drawn from biology. So, by 2005, all four of the centrality measures from social network analysis had moved - the wrong way - into both physics and biology.
Figure 2 shows the number of articles involving centrality that were published each year in social science and in physics and biology. In social science there was a small surge during the late 1950s and the early 1960s and another beginning in the early 1990s. The former resulted from a widespread interest in the MIT experiment by Bavelas and Leavitt and the latter is due to the growing use of centrality in studies of management and organizational behavior. The striking feature of Figure 2, however, is the steep growth of centrality publications in physics and biology since the early 2000s. Once it started, research based on centrality in these two fields quickly outpaced that conducted in social network analysis.
This sudden surge in centrality publications in physics and biology raises questions about how centrality is used in those fields. In the next section I will review some of their applications.


Figure 2. The Production of Centrality Literature by Field.

Applications of Centrality in Physics and Biology

Many of the data sets that physicists and biologists have used to look at centrality are immediately recognizable to social network analysts. The data simply are social networks. Newman (2003), for example studied friendships linking students. And both Newman (2003) and Albert and Barabasi (2002) reported data on human sexual contacts. Holme, Liljeros, Edling and Kim (2003) described contacts among prison inmates, and Holme, Huss and Jeong (2003) as well as Newman (2003) referred to email messages. Newman (2003) and Albert and Barabasi (2002) also talked about telephone calls.
In addition, Girvan and Newman (2002), Holme, Huss and Jeong (2003) and Estrada and Rodriguez-Velázquez (2006) studied data on collaboration among scientists. Newman (2003), Albert and Barabasi (2002) and Estrada and Rodriguez-Velázquez (2006) examined citation patterns. Other studies have dealt with corporate interlock (Newman, 2003; Estrada and Rodriguez-Velázquez, 2006). And Kitsak, Havlin, Paul, Riccaboni, Pammolli and Stanley (2007), Song, Havlin and Makse (2005) along with Newman (2003) and Albert and Barabasi (2002) have examined the World Wide Web. In addition, linguistic data sets have been examined (Albert and Barabasi, 2002; Newman, 2003; Estrada & Rodriguez-Velázquez, 2005).
Finally, physicists and biologists have even reanalyzed some of the classic social network data sets. Girvan and Newman (2002), Holme, Huss and Jeong (2003) and Kolaczyk, Chua and Barthelemy (2007) have used Zachary's (1977) karate club data and Newman (2006) analyzed Padgett's (1993) data on Florentine families.
Physicists have also given some attention calculating centralities on data that outsiders might think of as belonging to physics proper. Prominent among these are studies of packet switching in the internet (Holme, Kim, Yoon and Han, 2002; Holme, Huss and Jeong, 2003; Albert and Barabasi, 2002; Estrada & Rodriguez-Velázquez, 2005; Kitsak, Harlin, Paul, Pammolli and Stanley, 2007; Kolaczyk, Chua and Barthelemy, 2007), the electrical power grid (Barabasi, Albert and Jeong, 1999; Albert and Barabasi, 2002; Govindaraj, 2008) and electronic circuitry (Newman, 2003).
But by far the most effort has gone into applying centrality to various problems from biology. In biological research, centrality has been applied to three main kinds of networks, connections among amino acid residues, protein-protein interaction networks and links in metabolic networks. These three research efforts will be examined below.
Amino acids are molecules that are the building blocks for constructing proteins. They link together to form long chains, called polypeptides. When they link they lose the elements of water and are thereafter called amino acid residues. Polypeptides are linear chains, but in order to turn into proteins they must fold into three dimensional forms. An example of a polypeptide folding into a protein is shown in Figure 3.
Proteins can be represented as graphs. The nodes in the graph of a protein are the amino acid residues it contains. When two residues are in close physical proximity, their proximity is viewed as evidence that they are linked by some sort of chemical interaction. Thus, any pair of residues is defined as linked whenever they are closer together than some specified criterion distance.
Figure 3. An Example of Protein Folding (from Wikipedia).
In any particular case, a polypeptide may fold properly and turn itself into the appropriate protein or it may fail to do so. Since errors in folding can lead to disease or death, proper folding is critical. Experimental research has established that only a small number of the amino acid residues that make up a polypeptide are critical for its proper folding.
Earlier experimental results had shown, for example, that only two residues were critical in the folding of the protein 1AYE. Vendruscolo, Dokholyan, Paci and Karplus (2002) hypothesized that betweenness centrality might pick out those critical residues. Figure 4 shows their results for the residues that are folded to make 1AYE. The betweenness of each residue is shown by the height of the curve and the two critical residues are marked with squares.
The authors concluded, therefore, that betweenness general provided a good way to find the residues that were critical to proper protein folding. But, following up, del Sol, Fujihashi, Amoros and Nussinov (2006) argued that closeness centrality was more effective than betweenness in picking out the critical residues. And Chea and Livesay (2007) went on to show that in a large sample of proteins closeness centralities were statistically significant in their ability to determine which residues were critical.
A second research area in biology in which centralities have been applied is in the study of protein-protein interaction (PPI) networks. Interactions linking proteins are common. They play an important part in every process involving living cells. Knowledge about how proteins interact can lead to better understanding of a great many diseases and it can help in the design of appropriate therapies.
Figure 4. The Residues That Are Folded to Make Protein 1AYE
Often studies of PPI networks generate huge data sets. In the letter in Nature that was mentioned above, Jeong, Mason, Barabasi and Oltvai (2001) examined a data matrix that contained 2440 interactions linking 1870 proteins contained in yeast. Earlier experimental work had demonstrated that some of the protein molecules in yeast were lethal; if they were removed the yeast would die. Removing others, however, had no such dramatic effect. So Jeong et al. examined the question of whether the structural properties of those proteins - in particular, their degree centralities - could predict which proteins were lethal and which ones were not. Their results showed that proteins of high degree were far more likely to be lethal than those of lower degree. Follow-up research (e. g. Coulomb, Bauer, Bernard and Marsolier-Kergoat, 2005; Han, Dupuy, Bertin, Cusick and Vidal, 2005), however, showed that gaps in the PPI data were enough to cast doubt on that result.
Nonetheless, centralities in PPI networks continued to be studied. Joy, Brock, Ingber and Huang (2005) also studied centralities in the PPI networks of yeast. Their results showed that proteins that combined low degree with high betweenness were those that were likely to be essential to the survival of the yeast. They found, moreover, that the evolutionary ages of individual yeast proteins was positively correlated with their betweenness centrality.
Hahn and Kern (2005) extended the PPI network analysis to other forms of life. They studied PPI networks in yeast, in worms and in flies. They reported that the patterns of networks linking proteins in all three species had "remarkably" similar structural forms. Regardless of their degree centralities, proteins with high betweenness turned out to evolve more slowly and were more likely to be essential for survival.
The third major application of centrality in biology is in the study of metabolic networks. These metabolic networks are formed within cells. They include all the chemical reactions that allow cells to process metabolites. Metabolites include the nutrients that start the process, intermediate compounds and end products. Thus, metabolites are connected in chains of reactions. And each cell contains a great many different chains. These different chains are essential for the life of the organism.
The use of centrality in the study of metabolic networks began in 2000 with a study by Fell and Wagner. They examined metabolic reactions in E. coli bacteria. In their data metabolites were defined as nodes and two metabolites were viewed as connected if the occurred in the same chain of reactions. They wanted to determine which metabolites were involved in the widest range of reactions, so they calculated the closeness centrality of each. Those closeness centralities were used to identify the extent of each metabolite's influence.
Three years later Ma and Zeng (2003) did a comparative study in which they examined the metabolic networks of 65 organisms. They used the social network computer program, PAJEK (Batagelj and Mrvar, 1998), to determine that metabolic networks typically embodied several strong components. They calculated degree and closeness centralities in the largest of those components and concluded that organisms from different domains of life display different patterns of closeness centralities.
Schuster, Pfeiffer, Moldenhauer, Koch and Dandekar (2002) sought to untangle the complexity of metabolic networks by removing the nodes of highest degree, and examining the "internal" structure of the resulting components. And a year later Holme, Huss and Jeong (2003) modified that approach. They reasoned that degree centrality was essentially a local index and they substituted a global one, betweenness. They constructed bipartite graphs for 43 organisms in which they defined two classes of nodes. One class consisted of molecules that are acted upon by reaction agents, the second represented chemical reactions. Directed lines link molecules to reactions.
This bipartite structure permitted them to eliminate only those nodes involved in reactions. That way they were able to create separate components and to simplify the overall structure. Because they used betweenness, they argued, they retained information on large scale structural patterns.
We saw above that the social network program, PAJEK, has been applied in research in biology. But the biologists' focus on centrality has also spawned the development of new computer programs. In 2006 Baitaluk, Sedova, Ray and Gupta of the San Diego Supercomputer Center released a new program, PathSys, designed to analyze biological networks. They considered betweenness centrality important enough in biological research that they featured it in their tutorial for the program. And that same year three molecular geneticists, Junker, Koschützki and Schreiber, released a JAVA program, CentiBiN. That program is designed for applications to biological networks, and all it calculates are 17 different kinds of centers and centralities
These, then, are some examples that illustrate importance of centrality to physicists and biologists. From these illustrative examples it should be clear that the use of centrality concepts and tools is widespread in those fields, particularly in biology. In the next and final section of this paper I will provide a summary and present some conclusions.

Summary and Conclusions

I have shown here that applications of centrality have moved "the wrong way" from social science to physics and biology. Centrality was developed by social network analysts. But both physicists and biologists have borrowed the concepts and tools of centrality from social network analysis and applied them in their fields.
The physicists and biologists, moreover, have documented their borrowing by their citations to the social network literature. Figure 5, for example, shows citations to my 1977 article on betweenness centrality, by year and by field. It shows that, in recent years, citations to that article from physics and biology have completely overwhelmed those from social science.
Given that ideas and tools involving centrality have moved the "wrong way" from social science to physics and biology, we are left with the problem of explaining why such a movement took place. Probably the single most important factor leading to this wrong way movement resulted from an explosive growth, in both physics and biology, in research that focused, not on objects as such, but on the connections that link objects together. In both of these fields, scientists quite suddenly developed a major interest in networks - all kinds of networks.
This expanding interest in networks apparently grew out of two relatively recent developments. By the year 2000, both physicists and biologists were faced with huge amounts of readily accessible relational data and, at the same time, they had easy access to large-scale computing power. As Bornholdt and Schuster (2002) put it:
Triggered by recently available data on large real world networks (e.g. on the structure of the internet or on molecular networks in the living cell) combined with fast computer power on the scientist's desktop, an avalanche of quantitative research on network structure currently stimulates diverse scientific fields.

Figure 5. Citations to Freeman (1977) by Year and by Field

Particularly in biology, research scientists were desperate for tools that could be used to analyze network data. They had recently been faced with a huge and ever growing collection of network data stemming from all the work on genome sequencing. Two biologists, Wagner and Fell (2001), described the situation:

The information necessary to characterize the genetic and metabolic networks driving all functions of a living cell is being put within our reach by various genome projects. With the availability of this information, however, a problem will arise which has, as yet, been little explored by molecular biologists: how to adequately represent and analyse the structure of such large networks.
Thus, biologists, and probably physicists too, came upon centrality at the point of their greatest need for analytical tools that could be used to uncover important structural properties of networks. The centrality ideas from social network analysis are simple and easy to grasp. They are intuitively appealing and their structural implications are clear. They have, moreover, been formalized using mathematical tools no more difficult than graph theory and elementary matrix algebra. Thus, centralities are the natural choice for anyone seeking to uncover positions in a new area of application. And they turn out to be the tools that were actually chosen for that purpose by biologists and physicists.
As a final note, I would like to stress that almost all of the applications of centrality in biology have involved interplay between physicists and biologists. A collection of physicists, Jeong, Mason, Barabasi and Oltvai (2001) were the first to use centrality in the study of interactions among proteins. But the idea was picked up and refined by four biologists, Joy, Brock, Ingber and Huang (2005) and again by two more biologists, Hahn and Kern (2005). On the other hand, the use of centrality in cell metabolic research began with two biologists, Fell and Wagner (2000) and was later extended by three physicists, Holme, Huss and Jeong (2003). In all of this research biologists and physicists have managed to work on common problems. And they have done it in a way that allows representatives of both fields to contribute freely to the overall collective effort.
With respect to social network analysts, the physicists have given every indication that they want to build the same kind of cooperative relationship that they have with biologists. They have explored social network research problems, they have analyzed social network data sets, they have cited social network publications and they have even refined and extended tools for the analysis of centrality (e. g. Girvan and Newman, 2002).
There is every reason to believe that a cooperative relationship between these two fields would yield benefits for both. But, so far, a great many network analysts have tended to view the physicists as interlopers, invading our territory. I suggest, instead, that we welcome the contributions of the physicists and build on them. That seems to have worked for the biologists and it should work for us.

References

Albert, R. and A. L. Barabasi (2002). "Statistical mechanics of complex networks." Reviews of Modern Physics 74 (1): 47-97.
Alter, D. (1854). "On certain physical properties of light produced by the combustion of different metals in an electric spark refracted by a prism." American Journal of Science and Arts 18: 55-57.
Angell, R. C. (1961). "The moral integration of American cities." American Journal of Sociology 53: 1-140.
Artigliani, R. U. (1991). "Social evolution: a non-equilibrium systems model." Ed. E. Laszlo. The New Evolutionary Paradigm. New York, Gordon and Breach.
Baitaluk, M., M. Sedova, et al. (2006). "Biological Networks: visualization and analysis tool for systems biology." Nucleic Acids Research 34: W466-W471.
Barabasi, A. L., R. Albert, et al. (1999). "Mean-field theory for scale-free random networks." Physica A 272 (1-2): 173-187.
Batagelj, V. and A. Mrvar (1998). "Pajek-Program for large network analysis." Connections 21 (2): 47-57.
Bavelas, A. (1948). "A mathematical model for small group structures." Human Organization 7: 16-30.
Bavelas, A. (1950). "Communication patterns in task oriented groups." Journal of the Acoustical Society of America 22 (6): 725-730.
Bonacich, P. (1972). "Factoring and weighting approaches to status scores and clique identification." Journal of Mathematical Sociology 2: 113-120.
Bonacich, P. (1987). "Power and centrality: A family of measures." American Journal of Sociology 92: 1170-1182.
Carnot, S. (1824/1960). Réflexions sur la Puissance Motrice du FeuMineola, N.Y., Dover.
Catton, W. R. J. and L. Berggren (1964). "Intervening opportunities and national park visitation rates." Pacific Sociological Revue 7: 66-73.
Chea, E. and D. R. Livesay (2007). "How accurate and statistically robust are catalytic site predictions based on closeness centrality?" BMC Bioinformatics 8, 153.
Cohen, J. (1968). "Multiple regression as a general data-analytic system." Psychological Bulletin 70: 426-443.
Comte, A. (1830-1842/1982). Cours de philosophie positive. Paris, Hatier.
Coulomb, S., M. Bauer, D. Bernard and M.-C. Marsolier-Kergoat (2005). "Gene essentiality and the topology of protein interaction networks." Proceedings of the Royal Society of London, B, Biological Sciences 272 (1573): 1721-1725.
del Sol, A., H. Fujihashi, et al. (2006). "Residue centrality, functionally important residues, and active site shape: Analysis of enzyme and non-enzyme families." Protein Science 15 (9): 2120-2128.
Ecob, E. (2005). The Dating Game: Looking for Mr./Mrs. Right. Physics. Oxford, Oxford MS degree.
Estrada, E. and J. A. Rodriguez-Velázquez (2005). "Spectral measures of bipartivity in complex networks." Physical Review E 72, 046105.
Estrada, E. and J. A. Rodriguez-Velázquez (2006). "Subgraph centrality and clustering in complex hyper-networks." Physica A-Statistical Mechanics and Its Applications 364: 581-594.
Federighi, E. (1950). "The use of chi-square in small samples." American Sociological Review 15: 777-779.
Fell, D. A. and A. Wagner (2000). "The small world of metabolism." Nature Biotechnology 18 (11): 1121-1122.
Fisher, R. A. (1918). "The correlation between relatives on the supposition of Mendelian inheritance." Transactions of the Royal Society of Edinburgh 52: 399-433.
Fisher, R. A. (1936). "The use of multiple measurements in taxonomic problems." Annals of Eugenics 7: 179-188.
Ford, L. R. and D. R. Fulkerson (1957). "A simple algorithm for finding maximal network flows and an application to the Hitchcock problem." Canada Journal of Mathematics 9: 210-218.
Freeman, L. C. (1977). "A set of measures of centrality based on betweenness." Sociometry 40: 35-41.
Freeman, L. C. (1979). "Centrality in social networks: Conceptual clarification." Social Networks 1: 215-239.
Freeman, L. C. (2004). The Development of Social Network Analysis: A Study in the Sociology of Science. Vancouver, BC, Empirical Press.
Freeman, L. C. (2008). Social Network Analysis. London: SAGE.
Freeman, L. C. and A. P. Merriam (1956). "Statistical classification in anthropology: An application to ethnomusicology." American Anthropologist 58: 464-472.
Frick, A. (1855). "On liquid diffusion." Philosophical Magazine 10: 30-39.
Galton, F. (1886). "Regression towards mediocrity in hereditary stature." Journal of the Anthropological Institute 15: 246-263.
Gauss, C. F. (1816/1880). "Bestimmung der Genauigkeit der Beobachtungen." Koenigliche Gesellschaft der Wissenschaften 4: 109-117.
Giovindaraj, T. (2008). "Characterizing performance in socio-technical systems: A modeling framework in the domain of nuclear power." Omega 36: 10-21.
Girvan, M. and M. E. J. Newman (2002). "Community structure in social and biological networks." Proceedings of the National Academy of Sciences of the United States of America 99 (12): 7821-7826.
Goh, K. I., B. Kahng, et al. (2001). "Universal behavior of load distribution in scale-free networks." Physical Review Letters 87, 278701.
Granger, C. W. J. and M. Hatanaka (1964). Spectral Analysis of Economic Time Series. Princeton, NJ, Princeton University Press.
Hage, P. and F. Harary (1995). "Eccentricity and centrality in networks." Social Networks 17(1): 57-63.
Hagerstrand, T. (1967). Innovation diffusion as a spatial process. Chicago, University of Chicago Press.
Hahn, M. W. and A. D. Kern. (2005). "Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks." Molecular Biology and Evolution 22 (4): 803-806.
Han, J.-D., D. Dupuy, N. Bertin, M. E. Cusick and M. Vidal (2005). "Effects of sampling on topology predictions of protein-protein interaction networks." Nature Biotechnology 23 (7): 839-844.
Hilbert, D. (1904). "Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen." Nachrichten von d. Königl. Ges. d. Wissensch. zu Göttingen.: 49-91.
Hollingworth, H. L. (1921). "Judgements of persuasiveness." Psychological Review 28: 4.
Holme, P., M. Huss, et al. (2003). "Subnetwork hierarchies of biochemical pathways." Bioinformatics 19 (4): 532-538.
Holme, P., B. J. Kim, et al. (2002). "Attack vulnerability of complex networks." Physical Review E 65, 056109.
Holme, P., F. Liljeros, et al. (2003). "Network bipartivity." Physical Review E 68, 056107.
Hubbell, C. H. (1965). "An input-output approach to clique identification." Sociometry 28: 377-399.
Jeong, H., S. P. Mason, et al. (2001). "Lethality and centrality in protein networks." Nature 411 (6833): 41-42.
Jordan, C. (1869). "Sur les assemblages de lignes" Journal für reine und angewandte Mathematik 70: 185-190.
Joy, M. P., A. Brock, et al. (2005). "High-betweenness proteins in the yeast protein interaction network." Journal of Biomedicine and Biotechnology 2005 (2): 96-103.
Junker, B. H., D. Koschützki, et al. (2006). "Exploration of biological network centralities with CentiBiN." BMC Bioinformatics 7: 219.
Katz, L. (1953). "A new status index derived from sociometric analysis." Psychometrika 18: 39-43.
Killworth, P. D., C. McCarty, H. R. Bernard, E. C. Johnsen, J. Domini and G. A. Shelley. (2003). "Two interpretations of reports of knowledge of subpopulation sizes." Social Networks 25: 141-160.
Killworth, P. D., C. McCarty, H. R. Bernard, and M. House. (2006). "The accuracy of small world chains in social networks." Social Networks 28: 85-96.
Kish, L. (1965). Survey Sampling. New York, Wiley.
Kitsak, M., S. Havlin, et al. (2007). "Betweenness centrality of fractal and nonfractal scale-free model networks and tests on real networks." Physical Review E 75, 056115.
Klovdahl, A. (1998). "A picture is worth . . . :Interacting visually with complex network data." Computer Modeling and the Structure of Dynamic Social Processes. W. Liebrand. Amsterdam, ProGamma.
Kolaczyk, E. D., D. B. Chua, et al. (2007). "Co-Betweenness: A Pairwise Notion of Centrality." Available: http://arxiv.org/abs/0709.3420 [September 25, 2008].
Leavitt, H. J. (1951). "Some effects of communication patterns on group performance." Journal of Abnormal and Social Psychology 46: 38-50.
Leibnitz, G. (1684/1969). "Nova methodus pro maximis et minimis." A Source Book in Mathematics, 1200 - 1800. Ed. D. J. Struik. Cambridge, MA, Harvard University Press: 271-281.
Lorrain, F., H. C. White. (1971). "Structural equivalence of individuals in social networks." Journal of Mathematical Sociology 1: 49 - 80
Luccio, F. and M. Sami (1969). "On the decomposition of networks into minimally interconnected networks." IEEE Transactions on Circuit Theory CT-16: 184-188.
Luce, R. D. and A. Perry (1949). "A method of matrix analysis of group structure." Psychometrika 14: 95-116.
Ma, H. W. and A. P. Zeng (2003). "The connectivity structure, giant strong component and centrality of metabolic networks." Bioinformatics 19 (11): 1423-1430.
Marco, B. and B. Paolo (2007). The Maximum Clique Problem in Spinorial Form, AIP.
Nagoshi, C. T. and R. C. Johnson (1986). "The ubiquity of g." Personality and Individual Differences 7: 201-207.
Newman, M. E. J. (2001). "Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality." Physical Review E 64, 016132.
Newman, M. E. J. (2003). "The structure and function of complex networks." SIAM Review 45 (2): 167-256.
Newton, I. (1687/1999). Philosophiæ Naturalis Principia Mathematica. Berkeley, University of California Press.
Pearson, K. (1895-1896). "Contributions to the mathematical theory of evolution. III. Regression, heredity, and panmixia." Proceedings of the Royal Society of London 59: 69-71.
Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philosophical Magazine 50: 157-172.
Pool, I. D. and M. Kochen (1978). "Contacts and influence." Social Networks 1 (1): 5-51.
Price, D. d. S. (1965). "Networks of scientific papers." Science 149 (3683): 510-515.
Price, D. d. S. (1976). "A general theory of bibliometric and other cumulative advantage processes." Journal of the American Society for Information Science 27: 292-306.
Richards, W. and A. Seary (2000). "Eigen Analysis of Networks" Journal of Social Structure 1 (2). Available: http://www.cmu.edu/joss/content/articles/volume1/RichardsSeary.html [September 25, 2008].
Rutheford, E. and F. Soddy (1902 ). "The cause and nature of radioactivity." Philosophical Magazine 4: 370-396.
Samuelson, P. A. (1948). Economics. New York, McGraw-Hill.
Schuster, S., T. Pfeiffer, et al. (2002). "Exploring the pathway structure of metabolism: decomposition into subnetworks and application to Mycoplasma pneumoniae." Bioinformatics 18 (2): 351-361.
Seidman, S. B. (1983). "Internal cohesion of LS sets in graphs." Social Networks 5: 97-107.
Shull, G. H. (1911). "The genotypes of maize." American Naturalist 45: 234-252.
Song, C. M., S. Havlin, et al. (2005). "Self-similarity of complex networks." Nature 433(7024): 392-395.
Vendruscolo, M., N. V. Dokholyan, et al. (2002). "Small-world view of the amino acids that play a key role in protein folding." Physical Review E 65, 061910.
von Hofmann, A. W. (1860). Cited in Wikipedia. Available: http://en.wikipedia.org/wiki/August_Wilhelm_von_Hofmann [September 25, 2008].
Wagner, A. and D. A. Fell (2001). "The small world inside large metabolic networks." Proceedings of the Royal Society, B, Biological Sciences 268: 1803-1810.
Wang, J., Z. Cai, et al. (2008). "An improved method based on maximal clique for predicting interactions in protein interaction networks." International Conference on BioMedical Engineering and Informatics: 62-66.
Wasserman, S. and K. Faust (1994). Social Network Analysis: Methods and Applications. Cambridge, Cambridge University Press.
Watts, D. J. (1999). "Networks, dynamics, and the small-world phenomenon." American Journal of Sociology 105 (2): 493-527.
Watts, D. J. and S. H. Strogatz (1998). "Collective dynamics of 'small-world' networks." Nature 393 (6684): 440-442.
White, H. C. (1970). Chains of Opportunity: System Models of Mobility in Organizations. Cambridge, MA. Harvard University Press.
White, H. C. (1970). "Search parameters for the small world problem." Social Forces 49 (2): 259-264.
White, H. C., S. A. Boorman and R. L. Breiger. (1976). "Social structure from multiple networks I: Blockmodels of roles and positions." American Journal of Sociology 81: 730-781.
Zachary, W. (1977). "An information flow model for conflict and fission in small groups." Journal of Anthropological Research 33: 452-473.


*Note
An earlier draft of this paper was presented at the 28th International Sunbelt Network Conference at St. Pete Beach, FL in January 2008. The author wishes to thank his colleagues, Morry Sunshine, Ron Breiger, Elisa Bienenstock, Natasa Przulj, Kim Romney and Russ Bernard for their helpful comments on this paper.


[1] Only recently (Hage and Harary, 1995) has the eccentricity of each node been defined as an index of its centrality and their definition was introduced in the literature of social network analysis.

[2] The notion of the degree of a node has recently been used in physics, but most of that use has involved degree distributions; it has not focused on degree as an index of the centrality of a node.

sábado, 10 de agosto de 2013

Visualización: Evolución de las coaliciones de grupos de presión en telecomunicaciones

The Evolution of FCC Lobbying Coalitions

Pierre de Vries
Research Fellow at the Economic Policy Research Center University of Washington, Seattle

The Evolution of FCC Lobbying Coalitions


Self-Commentary
                The graph is derived from meta-data associated with documents that are filed electronically whenever an organization interacts with the FCC, in accordance with the Administrative Procedures Act. Whenever a letter, comment or other document is filed, the filer provides information on the parties involved, number of pages, relevant proceedings, date, etc.
                When this project started, the meta-data was not readily available. While it was public in the form of search results generated by the FCC’s Electronic Comment Filing System (ECFS), screen scraping a decade of data ten records at a time proved to be impractical. Fortunately Bill Cline, manager of the FCC Reference Information Center, agreed to run ad hoc batch jobs against the database to extract the information, which was provided as year-by-year spreadsheets. Thanks to recent web upgrade, such information can now be user-downloaded from http://fjallfoss.fcc.gov/ecfs2/ .
                Once the raw data had been obtained, it had to be cleaned. The metadata is typed into ECFS by paralegals when documents are filed – often in a rush just before a filing deadline – and input errors are common. They include misspellings and multiple variants of organization names, information entered into the wrong fields, and mismatches between information in the filing and the metadata. There are also ambiguities that are not strictly speaking errors, such as inconsistencies in specifying a subsidiary vs. a holding company, “doing business as” designations, fields left blank in some filings but not others, and abbreviating the list of filing entities. Before generating graphs, a clean-up macro (programmed in Visual Basic for Applications) was run against a synonym list currently contains more than 4,000 entries.
                This project focused on the connection implied by organizations filing together. While multiple filers names are sometimes given in metadata, at other times merely one company name plus “et al.” is given. For such records one has to refer to the underlying filing to “unpack” the list of all the other filers. This work cannot be easily automated, and was done by hand in this case; since it is so time-consuming, it was one of the reasons for limiting attention to a single proceeding that spanned only seven years.
                Once the data is cleaned up, an edge list is created in Excel by running another VBA macro. A graph is created from this list with NodeXL, a social network analysis and visualization add-in for Excel 2007. NodeXL’s Fruchterman-Reingold algorithm is used to prepare a preliminary layout; nodes are then moved by hand into visually intelligible positions, respecting the clusters suggested by NodeXL’s implementation of the Wakita-Tsurumi algorithm. Nodes are colored on the basis of eigenvector centrality. The degree of investment that organizations make in lobbying is measured by the total number of filings it made in this proceeding over the period of study, and reflected in the size of the node. This information is obtained by running another VBA macro against the underlying ECFS metadata, and then matching that to the vertices in the graph.
                Some ambiguities in the data remain to be resolved. Assume that one chooses to map holding companies rather than subsidiaries (rather than vice versa), and has succeeded in substituting subsidiaries named in filings by their parents; on-going mergers and acquisitions changes this mapping over the period being analyzed. Acquisitions lead to companies absorbing not only competitors (that may have been in disjunct clusters early in period of study), but also changing their business interests. In this data set, for example, the “old” AT&T was a long-distance carrier at odds with regional telephone companies like SBC; after the merger in 2005, the “new” AT&T came to have the interests of SBC.

PEER REVIEW COMMENT No. 1
This visualization captures the formal connections between lobbying organizations in the fight over telephone transfer fees. This representation suggests that the companies lobbying the most, or the most well connected, are not necessarily the most structurally important, or the most influential. Smaller companies can play important lobbying roles if they connect particular lobbying subgroups to each other.  This visualization offers a clean picture of the lobbying network but provides little information about the companies: perhaps a different color scheme, combinations of shapes, or more exaggerated node sizes could have told a clearer story about the kinds of companies playing different roles.

PEER REVIEW COMMENT No. 2
This visualization clearly depicts the connections made between phone company’s lobby groups, and the extent of pairwise connectivity is captured by the edge color. Since node size represents the number of filings that a company is on, it appears clear that the number of filings alone do not determine the centrality (which is indicated by “pinkness”).  The color scheme may not be optimal, however, as it is difficult to ascertain the importance of edge weight or co-filings without very careful study, since the different shades of orange are difficult to discern.  The placement of the isolated components in the margins appears to be arbitrary, but it might make sense to imbue the macro-space with meaning here as well.

PEER REVIEW COMMENT No. 3
This layout does a good job of making coalitions easily apparent to the viewer.  I'm curious whether there is a size effect.  If node size was proportional to either volume of telecom traffic or total corporate worth, would we see peer-preferences in coalition choices?

viernes, 9 de agosto de 2013

Cheliotis: Análisis de redes sociales


Visualización: Red bipartita de legislación y organizaciones

Visualizing Positive and Negative Endorsements of S.1782 (2007)

Skye Bender-de Moll
NehalemOregon

A bi-partitie network of legislation and organizations

Caption



This image depicts a network of bills (squares) and endorsing organizations (circles) around S.1782 during the 110th U.S. Congress. The green and red ties indicate support or opposition of S.1782 by an organization. Gray ties link to additional bills positively endorsed by organizations. The information was collected byMapLight.org from various public documents. Color indicates similar group categorization, and the size of the nodes is relative to the total number of endorsements it gave/received in the database. Mousing over small nodes will reveal the title or additional bill info. Clicking on bills will load an associated web page.
S.1782 was chosen as the focus for this extended ego-network visualization because the title and description of the bill give very little indication about its intent or possible effects. MapLight's table of supporters and opponents is quite helpful, but ideally it would be possible to simultaneously see where each of a bill's endorsing organizations stands on other issues in order to place their endorsement in context. This layout results in intersecting circles of legislative preference around groups with similar patterns of endorsements, revealing separation and overlap between the camps surrounding S.1782. Opposition for the bill seems to have come from large industry lobby groups, corporations,and business associations. Support came from consumer and activists groups. The "nays" seem to have won, as the bill died in committee.
The layout was produced using SoNIA and the MDSJ library. The MDS algorithm was run 150 iterations on the matrix of all-pairs-shortest-path distances with a scaling exponent of -7 to weight distant ties less. Some node positions were manually tweaked for legibility.

Self comentary

Note: for this example to function correctly in addition to the png image it most load a script file, an xml data file, and include javascript in the header of the page. Not sure how this will actually work with the JOSS journal formated web page.
Data on the entire set of bill endorsements were kindly provided by MapLight.org in csv from. I loaded it into a MySQL database so that I could experiment with various types of networks. Although, the co-endorsing, and co-endorsee networks of the legislative space were in some ways more interesting, they would require much more work to make a presentable image. I also wanted to test the feasibility of using visualization to learn something about an arbitrarily chosen bill. I wrote a utility program in java to facilitate the process of testing various queries to select the node attribute data and construct the tie relations. The queries are processed to produce a .son formatted network file, which was loaded into SoNIA for visualization.
I initially expected that using different tie weights for the S.1782 ties and the rest of the bills ties would help structure the network. The approach was not successful, so I ended up simply giving the direct ties greater width to increase the visual impact, and graying out the ties to the other bills to focus attention on the bill of interest. Earlier versions of the image were produced with a KK layout, but I found the MDSJ MDS layout was somewhat more stable, and allowed me to adjust the distance parameter to control the "clumpyness" of the (loosely) structurally equivalent node groups. I adjusted some node positions and shortened some labels to reduce clutter in the resulting layout. A challenge was making the labels legible without crowding the layout or distracting from the ties. In the end, I settled for making many of the labels too small to read, but including a mouseover option to show the labels as a "tooltip" on the image in a web browser.
To prepare for the web version, I exported the .PNG image, and an .XML file containing node coordinates and labels from SoNIA. I adapted some JavaScript code previously written by a co-developer to read the xml file and produce the image mouseovers. I added a feature to parse the titles of the bills into an appropriate url on the GovTrack site, making it possible to click through to more bill information. I also inserted bill titles for selected nodes directly into the tooltip, hopefully making it possibly to quickly get a feel for the type of bills in each area without cluttering the layout with the long labels. This is important, because the bill numbers themselves are not meaningful and I do not have (or know of) any bill classification data that could be used to help the viewer determine if the bill groupings produced by the layout make sense, or what is implied about an organizations political position by an endorsement.
Although I think this is an interesting image, I see several issues. One is that the nodes that only endorse a single bill do not have well defined positions in this layout, they tend to land in arbitrary regions and their positions are likely to be falsely assigned significance by the viewer. As with most network visualizations, the groupings might be more rigorously created with a clustering algorithm. There may also be data coverage and sampling issues in the underlying data.

PEER REVIEW COMMENT No. 1
I really appreciate the author’s use of a careful color scheme of support and opposition so the viewer may easily discern coalition blocks.  The visualization also includes other bills supported by each organization, which hints at a more contextualized story – would love to see this fully searchable/interactive. It appears that subgroups of organizations tend to support similar bills, but organizational coalitions are far from uniform; I’m guessing that sets of organizations may come together or fall apart depending on the particular bill of interest. Thus, while the visualization nicely captures the coalition structure for the particular bill of interest, there is, unfortunately, very little sense of how the other bills are interrelated.

PEER REVIEW COMMENT No. 2
The positioning of organizations that support and opposed the bill around Bill S 1782 made the two sides visually easy to locate. In addition to the corresponding edge color scheme, the image conveys a clear picture of the bill’s supporters as mainly activist groups and its opposition as mainly business organizations. While the color the organization is assigned shows their voting pattern, the reader may be overwhelmed with the volume of bills connected by grey lines and surrounding the organizations, would be nice to have a way to summarize the “similar” bills.

PEER REVIEW COMMENT No. 3
This visualization makes great use of user input, providing unabbreviated node identities during mouse-over, and detailed information when clicked. This provides a great circumvention of the trade-off between node information and graph cluster. It also employs an effective yet simple color/layout scheme to display a bipartite graph without the artificial separations that often stilt bipartite graphs. I wonder how this graph would look with the other edges colored (perhaps in faint red/green) according to whether the organization opposed S 1782 – would it give a sense of how firm these organizational battle lines hold across bills?

JoSS

jueves, 8 de agosto de 2013

Esquema de grupos en cajas para análisis de comunidades multifacéticos


Visualización: Diagrama de árbol radial de insomnio

Radial Tree Diagram [RTD] of Insomnia

Philip Topham
Lnx Research, General Manager

Radial Tree Diagram [RTD] of Insomnia



Self-Commentary
                Radial tree diagram [RTD] of Insomnia coauthor invisible college and their cliques (2005-2009) - RTD helps an analyst understand the Insomnia research community by bundling edges that normally obscure meaning in a traditional network layout.  The RTD also maintains overall network topology specific relationships are browsed, and thus avoids the viewer needing to reorient themselves as is the case in traditional spring-embedded or  force-directed approaches that redraw themselves depending on view and focus.

PEER REVIEW COMMENT No. 1
This visualization uses a radial tree diagram to contain a vast amount of information about the structure of co-authorships among insomnia researchers. The image makes good use of color and smooth lines and is visually appealing as a result. However the volume of information makes it difficult to put together a general sense of the network, this may be due in part to how the substantive areas determine position on the circle, rather than connectedness?  
PEER REVIEW COMMENT No. 2
This is a fascinating interactive bit; and I find myself continually “poking around” the visualization.  Very fun!  The colors, rendering and flow are beautiful.  Like many circular layouts, the trade between clarity of connection (highlighted here) and general topology (highlighted in space-based layouts) makes it difficult to discern “distance” between research groups.  Here a *little* more information on how nodes were placed around the circle might help. 

PEER REVIEW COMMENT No. 3
This flash-based visualization supercharges the circle graph into a luxuriously information-rich interactive medium.  It makes beautiful and effective use of color – using it when helpful, dropping it to deemphasize and avoid cluster.  I would be interested to see if additional functionally can be built into the user-interface – Perhaps some way to pull a group/individual bio with a double click?

miércoles, 7 de agosto de 2013

La red desplaza a los profesores de las aulas

How Big Data Is Taking Teachers Out of the Lecturing Business

Schools and universities are embracing technology that tailors content to students' abilities and takes teachers out of the lecturing business. But is it an improvement?


By Seth Fletcher

When Arnecia Hawkins enrolled at Arizona State University last fall, she did not realize she was volunteering as a test subject in an experimental reinvention of American higher education. Yet here she was, near the end of her spring semester, learning math from a machine. In a well-appointed computer lab in Tempe, on Arizona State's desert resort of a campus, she and a sophomore named Jessica were practicing calculating annuities. Through a software dashboard, they could click and scroll among videos, text, quizzes and practice problems at their own pace. As they worked, their answers, along with reams of data on the ways in which they arrived at those answers, were beamed to distant servers. Predictive algorithms developed by a team of data scientists compared their stats with data gathered from tens of thousands of other students, looking for clues as to what Hawkins was learning, what she was struggling with, what she should learn next and how, exactly, she should learn it.
Having a computer for an instructor was a change for Hawkins. “I'm not gonna lie—at first I was really annoyed with it,” she says. The arrangement was a switch for her professor, too. David Heckman, a mathematician, was accustomed to lecturing to the class, but he had to take on the role of a roving mentor, responding to raised hands and coaching students when they got stumped. Soon, though, both began to see some benefits. Hawkins liked the self-pacing, which allowed her to work ahead on her own time, either from her laptop or from the computer lab. For Heckman, the program allowed him to more easily track his students' performance. He could open a dashboard that told him, in granular detail, how each student was doing—not only who was on track and who was not but who was working on any given concept. Heckman says he likes lecturing better, but he seems to be adjusting. One definite perk for instuctors: the software does most of the grading for them.
At the end of the term, Hawkins will have completed the last college math class she will probably ever have to take. She will think back on this data-driven course model—so new and controversial right now—as the “normal” college experience. “Do we even have regular math classes here?” she asks.
Big Data Takes Education
Arizona State's decision to move to computerized learning was born, at least in part, of necessity. With more than 70,000 students, Arizona State is the largest public university in the U.S. Like institutions at every level of American education, it is going through some wrenching changes. The university has lost 50 percent of its state funding over the past five years. Meanwhile enrollment is rising, with alarmingly high numbers of students showing up on campus unprepared to do college-level work. “There is a sea of people we're trying to educate that we've never tried to educate before,” says Al Boggess, director of the Arizona State math department. “The politicians are saying, ‘Educate them. Remediation? Figure it out. And we want them to graduate in four years. And your funding is going down, too.’”
Two years ago Arizona State administrators went looking for a more efficient way to shepherd students through basic general-education requirements—particularly those courses, such as college math, that disproportionately cause students to drop out. A few months after hearing a pitch by Jose Ferreira, the founder and CEO of the New York City adaptive-learning start-up Knewton, Arizona State made a big move. That fall, with little debate or warning, it placed 4,700 students into computerized math courses. Last year some 50 instructors coached 7,600 Arizona State students through three entry-level math courses running on Knewton software. By the fall of 2014 ASU aims to adapt six more courses, adding another 19,000 students a year to the adaptive-learning ranks. (In May, Knewton announced a partnership with Macmillan Education, a sister company to Scientific American.)
Arizona State is one of the earliest, most aggressive adopters of data-driven, personalized learning. Yet educational institutions at all levels are pursuing similar options as a way to cope with rising enrollments, falling budgets and more stringent requirements for student achievement. Public primary and secondary schools in 45 states and the District of Columbia are rushing to implement new, higher standards in English-language arts and mathematics known as the Common Core state standards, and those schools need new instructional materials and tests to make that happen. Around half of those tests will be online and adaptive, meaning that a computer will tailor questions to each student's ability and calculate each student's score [see “Why We Need High-Speed Schools,” on page 69]. School systems are experimenting with a range of other adaptive programs, from math and reading lessons for elementary school students to “quizzing engines” that help high school students prepare for Advanced Placement exams. The technology is also catching on overseas. The 2015 edition of the Organization for Economic Co-operation and Development's Program for International Student Assessment (PISA) test, which is given to 15-year-olds (in more than 70 nations and economies so far) every three years, will include adaptive components to evaluate hard-to-measure skills such as collaborative problem solving.
Proponents of adaptive learning say that technology has finally made it possible to deliver individualized instruction to every student at an affordable cost—to discard the factory model that has dominated Western education for the past two centuries. Critics say it is data-driven learning, not traditional learning, that threatens to turn schools into factories. They see this increasing digitization as yet another unnecessary sellout to for-profit companies that push their products on teachers and students in the name of “reform.” The supposedly advanced tasks that computers can now barely pull off—diagnosing a student's strengths and weaknesses and adjusting materials and approaches to suit individual learners—are things human teachers have been doing well for hundreds of years. Instead of delegating these tasks to computers, opponents say, we should be spending more on training, hiring and retaining good teachers.
And while adaptive-learning companies claim to have nothing but the future of America's children in mind, there is no denying the potential for profit. Dozens of them are rushing to get in on the burgeoning market for instructional technology, which is now a multibillion-dollar industry [see box at left]. As much as 20 percent of instructional content in K–12 schools is already delivered digitally, says Adam Newman, a founding partner of the market-analysis firm Education Growth Advisors. Although adaptive-learning software makes up only a small slice of the digital-instruction pie—around $50 million for the K–12 market—it could grow quickly. Newman says the concept of adaptivity is already very much in the water in K–12 schools. “In K–12, the focus has been on differentiating instruction for years,” he says. “Differentiating instruction, even without technology, is really a form of adaptation.”
Higher-education administrators are warming up to adaptivity, too. In a recent Inside Higher Ed/Gallup poll, 66 percent of college presidents said they found adaptive-learning and testing technologies promising. The Bill & Melinda Gates Foundation has launched the Adaptive Learning Market Acceleration Program, which will issue 10 $100,000 grants to U.S. colleges and universities to develop adaptive courses that enroll at least 500 students over three semesters. “In the long term—20 years out—I would expect virtually every course to have an adaptive component of some kind,” says Peter Stokes, an expert on digital education at Northeastern University. That will be a good thing, he says—an opportunity to apply empirical study and cognitive science to education in a way that has never been done. In higher education in particular, “very, very, very few instructors have a formal education in how to teach,” he says. “We do things, and we think they work. But when you start doing scientific measurement, you realize that some of our ways of doing things have no empirical basis.”
The Science of Adaptivity
In general, “adaptive” refers to a computerized-learning interface that constantly assesses a student's thinking habits and automatically customizes material for him or her. Not surprisingly, though, competitors argue ferociously about who can claim the title of true adaptivity. Some say that a test that does nothing more than choose your next question based on whether you get the item in front of you correct—a test that steers itself according to the logic of binary branching—does not, in 2013, count as fully adaptive. In this view, adaptivity requires the creation of a psychometric profile of each user, plus the continuous adjustment of the experience based on that person's progress.
To make this happen, adaptive-software makers must first map the connections among every concept in a piece of learning material. Once that is done, every time a student watches a video, reads an explanation, solves a practice problem or takes a quiz, data on the student's performance, the effectiveness of the content, and more flow to a server. Then the algorithms take over, comparing that student with thousands or even millions of others. Patterns should emerge. It could turn out that a particular student is struggling with the same concept as students who share a specific psychometric profile. The software will know what works well for that type of student and will adjust the material accordingly. With billions of data points from millions of students and given enough processing power and experience, these algorithms should be able to do all kinds of prognostication, down to telling you that you will learn exponents best between 9:42 and 10:03 a.m.
They should also be able to predict the best way to get you to remember the material you are learning. Ulrik Juul Christensen, CEO of Area9, the developer of the data-analysis software underpinning McGraw-Hill's adaptive LearnSmart products, emphasizes his company's use of the concept of memory decay. More than two million students currently use LearnSmart's adaptive software to study dozens of topics, either on their own or as part of a course. Research has shown that those students (all of us, really) remember a new word or fact best when they learn it and then relearn it when they are just on the cusp of forgetting it. Area9's instructional software uses algorithms to predict each user's unique memory-decay curve so that it can remind a student of something learned last week at the moment it is about to slip out of his or her brain forever.
Few human instructors can claim that sort of prescience. Nevertheless, Christensen dismisses the idea that computers could ever replace teachers. “I don't think we are so stupid that we would let computers take over teaching our kids,” he says.
Backlash
In March, Gerald J. Conti, a social studies teacher at Westhill High School in Syracuse, N.Y., posted a scathing retirement letter to his Facebook page that quickly became a viral sensation. “In their pursuit of Federal tax dollars,” he wrote, “our legislators have failed us by selling children out to private industries such as Pearson Education,” the educational-publishing giant, which has partnered with Knewton to develop products. “My profession is being demeaned by a pervasive atmosphere of distrust, dictating that teachers cannot be permitted to develop and administer their own quizzes and tests (now titled as generic ‘assessments’) or grade their own students' examinations.” Conti sees big data leading not to personalized learning for all but to an educational monoculture: “STEM [science, technology, engineering and mathematics] rules the day, and ‘data driven’ education seeks only conformity, standardization, testing and a zombie-like adherence to the shallow and generic Common Core.”
Conti's letter is only one example of the backlash building against tech-oriented, testing-focused education reform. In January teachers at Garfield High School in Seattle voted to boycott the Measures of Academic Progress (MAP) test, administered in school districts across the country to assess student performance. After tangling with their district's superintendent and school board, the teachers continued the boycott, which soon spread to other Seattle schools. Educators in Chicago and elsewhere held protests to show solidarity. In mid-May it was announced that Seattle high schools would be allowed to opt out of MAP, as long as they replaced it with some other evaluation.
It would be easy for proponents of data-driven learning to counter these protests if they could definitely prove that their methods work better than the status quo. But they cannot do that, at least not yet. Empirical evidence about effectiveness is, as Darrell M. West, an adaptive-learning proponent and founder of the Brookings Institution's Center for Technology Innovation, has written, “preliminary and impressionistic.” Any accurate evaluation of adaptive-learning technology would have to isolate and account for all variables: increases or decreases in a class's size; whether the classroom was “flipped” (meaning homework was done in class and lectures were delivered via video on the students' own time); whether the material was delivered via video, text or game; and so on. Arizona State says 78 percent of students taking the Knewton-ized developmental math course passed, up from 56 percent before. Yet it is always possible that more students are passing not because of technology but because of a change in policy: the university now lets students retake developmental math or stretch it over two semesters without paying tuition twice.
Even if proponents of adaptive technology prove that it works wonderfully, they will still have to contend with privacy concerns. It turns out that plenty of people find pervasive psychometric-data gathering unnerving. Witness the fury that greeted inBloom earlier this year. InBloom essentially offers off-site digital storage for student data—names, addresses, phone numbers, attendance, test scores, health records—formatted in a way that enables third-party education applications to use it. When inBloom was launched in February, the company announced partnerships with school districts in nine states, and parents were outraged. Fears of a “national database” of student information spread. Critics said that school districts, through inBloom, were giving their children's confidential data away to companies who sought to profit by proposing a solution to a problem that does not exist. Since then, all but three of those nine states have backed out.
This might all seem like overreaction, but to be fair, adaptive-education proponents already talk about a student's data-generated profile following them throughout their educational career and even beyond. Last fall the education-reform campaign Digital Learning Now released a paper arguing for the creation of “data backpacks” for pre-K–12 students—electronic transcripts that kids would carry with them from grade to grade so that they will show up on the first day of school with “data about their learning preferences, motivations, personal accomplishments, and an expanded record of their achievement over time.” Once it comes time to apply for college or look for a job, why not use the scores stored in their data backpacks as credentials? Something similar is already happening in Japan, where it is common for managers who have studied English with the adaptive-learning software iKnow to list their iKnow scores on their resumes.
This Is Not a Test
It is far from clear whether concerned parents and scorned instructors are enough to stop the march of big data on education. “The reality is that it's going to be done,” says Eva Baker, director of the Center for the Study of Evaluation at the University of California, Los Angeles. “It's not going to be a little part. It's going to be a big part. And it's going to be put in place partly because it's going to be less expensive than doing professional development.”
That does not mean teachers are going away. Nor does it mean that schools will become increasingly test-obsessed. It could mean the opposite. Sufficiently advanced testing is indistinguishable from instruction. In a fully adaptive classroom, students will be continually assessed, with every keystroke and mouse click feeding a learner profile. High-stakes exams could eventually disappear, replaced by the calculus of perpetual monitoring.
Long before that happens, generational turnover could make these computerized methods of instruction and testing, so foreign now, unremarkable, as they are for Arizona State's Hawkins and her classmates. Teachers could come around, too. Arizona State's executive vice provost Phil Regier believes they will, at least: “I think a good majority of the instructors would say this was a good move. And by the way, in three years 80 percent of them aren't going to know anything else.”
Take an adaptive quiz on state capitals at ScientificAmerican.com/aug2013/learn-smart