Multi-Attribute Dataset on Statisticians (MADStat)

#Journals Time span #Papers #Authors
36 1975-2015 83,331 47,311

The MADStat data set consists of bibtex and citation information of 83K papers published in representative journals in statistics, biostatistics, probability, machine learning, and related fields, spanning 41 years. The dataset was collected and analyzed by:
  • Pengsheng Ji, Jiashun Jin, Zheng Tracy Ke & Wanshan Li (2022). Co-citation and Co-authorship Networks of Statisticians, Journal of Business & Economic Statistics, 40(2), 469-485. (link, pdf)
  • Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin & Wanshan Li (2024). Recent Advances in Text Analysis, Annual Review of Statistics and Its Application, 11, 347-372. (link, pdf)
  • Part 1: Coauthorship and Citation Data. The data and code in Ji et al. (2022) can be downloaded as a single Zip file, or from GitHub, or Harvard Dataverse, or Journal website. In the ReadMe file, there is a section "Ready-to-use data matrices." Users can use them to construct many different co-authorship and citation networks by restricting to a subset of years, journals, and authors.

  • Part 2: Text Abstracts. The data and code in Ke et al. (2024) can be downloaded as a single Zip file, or from GitHub, or Harvard Dataverse, or Journal website. Both the raw abstracts and the processed word count matrices are included (check the ReadMe file).
Attributes included
  • Paper: Title, year, journal, authors (with name cleaning), text abstract, reference list, citations (within the data range).
  • Author: List of papers, list of co-authors, citers (authors whom this author cites), citees (authors who cite this author).
Instruction to users
  • Please consult the ReadMe files when using the data.
    • ReadMe for using data in Part 1.
    • ReadMe for using data in Part 2.

  • Some of the attributes are not listed as variables, but they can be constructed from the data matrices. Here is an example.
    • Author-paper-matrix A (directly available): The (i,j)th entry is 1 if author i writes paper j
    • Paper-paper-matrix B (directly available): The (i, j)th entry is 1 if and only if paper i cites paper j
    • Coauthor list of each author: Rows of the matrix A'A (notation: A' is the transpose of A)
    • List of papers that cites a paper: Columns of the matrix B.
    • List of citees of each author: Rows of the matrix ABA'
    • List of citers of each author: Rows of the matrix AB'A'
    • ......
    • If you would like to restrict to papers in particular years or journals, simply take a sub-matrix of B and repeat the above steps. This is easily done, as the paper attributes are already summarized in AuthorPaperInfo.RData.
    • If you would like to restrict to authors in a particular range, simply take a sub-matrix of A and repeat the above steps. The author names are in author_name.txt.
List of journals and data range:

Ann. Inst. Henri Poincare Probab. Stat. (1984-2015)
Annals of Applied Statistics (2007-2015)
Annals of Probability (1975-2015)
Annals of Statistics (1975-2015)
Annals of the Institute of Statistical Mathematics (1975-2015)
Australian & New Zealand Journal of Statistics (1998-2015)
Bayesian Analysis (2006-2015)
Bernoulli (1997-2015)
Biometrics (1975-2015)
Biometrika (1975-2015)
Biostatistics (2002-2015)
Canadian Journal of Statistics (1985-2015)
Communications in Statistics-Theory and Methods (1976-2015)
Computational Statistics & Data Analysis (1983-2015)
Electronic Journal of Statistics (2007-2015)
Extremes (2008-2015)
International Statistical Review (1975-2015)
Journal of Computational and Graphical Statistics (1997-2015)
Journal of Machine Learning Research (2001-2015)
Journal of the American Statistical Association (1975-2015)
Journal of the Royal Statistical Society Series B-Statistical Methodology (1975-2015)
Journal of Applied Statistics (1993-2015)
Journal of Classification (1984-2015)
Journal of Multivariate Analysis (1976-2015)
Journal of the Royal Statistical Society Series A-Statistics in Society (1975-2015)
Journal of the Royal Statistical SocietySeries C-Applied Statistics (1975-2015)
Journal of Statistical Planning and Inference (1977-2015)
Journal of Time Series Analysis (2000-2015)
Journal of Nonparametric Statistics (1998-2015)
Probability Theory and Related Fields (1986-2015)
Statistical Science (1993-2015)
Scandinavian Journal of Statistics (1977-2015)
Statistica Sinica (1991-2015)
Statistics and Computing (1993-2015)
Statistics & Probability Letters (1984-2015)
Statistics in Medicine (1984-2015)