Which fields drive the h-index?
Published |
September 16, 2024 |
Title |
Which fields drive the h-index? |
Authors |
Paolo Giudici, Luca Boscolo. |
DOI |
10.62684/FSOZ4761 |
Keywords |
H-index, Poisson models, Scaling |
Paolo Giudici(a), Luca Boscolo(b).
(a) Department of Economics and Management Sciences, University of Pavia, Italy.
(b) Top Italian Scientists founder.
Abstract
The measurement of the quality of academic research is often done by means of the h-index measure. Although widely accepted, the h-index has some issues and, in particular, it may depend on the scientific field in which a researcher operates. To date there is not a definitive answer as to whether this difference holds, and to what extent it varies. To fill the gap, we propose to operationally measure the difference in h-index across the sectors of a relatively homogeneous population of all scientists of a nation. To answer the heterogeneity issue we apply three different explainable machine learning models: linear regression, Poisson regression and tree models. Our results show that the latter two models better explain the data. They show that the only sectors for which a difference in h-index is significant are Physics, Biology and Clinical Sciences.
Introduction
The measurement of the quality of academic research is a rather controversial issue. in the 2000s [11] has proposed a measure that has the advantage of summarizing in a single summary statistics the information that is contained in the citation counts of each author. From that seminal paper, a large amount of research has been produces, focusing on in particular on the development of correction factors to the h index ([13], [3], [9]), [2], [12] that may take into account differences between sectors.
In this stream of research, [9] analysed the mathematical properties of the h index, and [3]proposed to employ a stochastic model for an author’s production/citation patterns. Following this mathematical formalisation, it becomes possible to analyse the h-index of individual researchers, whether or not in different fields, and compare them with each other.
Along a more empirical research line, [13] proposed to use a simple multiplicative correction to the h index to take into account the differences among researchers coming from different sectors, thus allowing a fair and sustainable comparison. They propose in particular a table with such normalizing factors, according to specific distributional assumptions of the citations. Their approach provides a simple way to explain and measure differences between different scientific fields. In a similar vein, [2] propose a rescaling procedure based on the Gini entropy and [12] propose a different rescaling, that takes into account the number of coautors: the fractional h-index.
We employ both streams of research as a starting point. More precisely, we follow [6], who, expanding the contribution of [9], propose a statistical approach that indicates that a Poisson distribution is a well suited approximation for the distribution of the h-index. In this paper we will show that a Poisson distribution is well suited to explain the drivers of the h-index. And we will employ this theoretical result to understand whether the h-index of a scientist depends on his/her filed of research, following the research line of [13], also followed by [15].
The paper is organized as follows: in Methodology section we review the proposal of [6] and formalise the model; in Application section we apply the new approach to a database of scientists homogeneous by nationality and, therefore, by scientific culture. Finally, Discussion section contains some concluding remarks.
Methodology
The paper of [11] has proposed a ”transparent, unbiased and very hard to rig measure” ([1]): the h index.
According to the definition, a scientist has index h if [math]\displaystyle{ h }[/math] of his or her n papers have at least [math]\displaystyle{ h }[/math] citations each and the other [math]\displaystyle{ (n-h) }[/math] papers have [math]\displaystyle{ ≤ h }[/math] citations each.
Following the work of Hirsch, many papers have discussed its application, especially in the bibliometric community. Some papers have focused on the statistical learning aspects behind the h index, and, among them, [9] who has stressed relevance of a ”statistical background” for the h index. Recently [?] has provided a complete statistical framework for the h index that holds for all sample sizes and respects the discrete nature of the citations data which are behind the h-index. We now recall their proposal as it dorms the basis of our analysis.
Let [math]\displaystyle{ X_1, . . . , X_n }[/math] be random variables which describe the number of citations of the n articles of a scientist. We assume that [math]\displaystyle{ X_1, . . . , X_n }[/math] are independent with a common citation distribution function [math]\displaystyle{ F }[/math]. Let us then assume that [math]\displaystyle{ F }[/math] is continuous, at least asymptotically, although the citation counts are integers. According to this assumption, the h index can be formally defined by the following:
[math]\displaystyle{ h = 1 - F(h) = \frac{h}{n} }[/math] |
Then, following [6], assume that [math]\displaystyle{ F }[/math] is discrete. Given a set of n papers of a scientist to which a citations count vector [math]\displaystyle{ \underline{X} }[/math] is associated, consider the ordered sample of citations [math]\displaystyle{ {X_{(i)}} }[/math], that is [math]\displaystyle{ X_{(1)} ≥ X_{(2)} ≥ . . . ≥ X_{(n)} }[/math], from which obviously [math]\displaystyle{ X_{(1)} (X_{(n)}) }[/math] denotes the most (the least) cited paper. The h index can be defined as follows:
[math]\displaystyle{ h = max(t : X_{(t)} ≥ t) }[/math] |
The distribution of the h index can then be shown to be:
[math]\displaystyle{ p(h) = [F(X_{j(h)}) − F(X_{j(h)+1})]^{(n+1−h)} }[/math] |
The previous expression, albeit elegant, is non parametric, and is not so transparent in the estimation process.
To formulate a more explainable parametric specification, [6] suggested to follow the Loss Distribution Approach (LDA) employed in operational risk modelling (see [7] and [8]) where the losses are categorized in terms of ’frequency’ and ’severity’ (or impact). The frequency is the random number of loss events occurred during a specific time frame, while the severity is the mean impact of all such events in terms of monetary losses.
In the context of the h-index, the frequency is the (random) number of published papers along the career of a scientist and the impact is the (random) mean number of citations received in the same time frame by all such papers.
Let [math]\displaystyle{ Xi = (X_{i1}, X_{i2}, . . . , X_{ini}) }[/math] be a random vector containing the citations of the [math]\displaystyle{ n_i }[/math] papers published by the [math]\displaystyle{ i-th }[/math] scientist.
It follows that the total impact of a scientist [math]\displaystyle{ i }[/math] can be defined as the sum of a random number [math]\displaystyle{ n_i }[/math] of random citations:
[math]\displaystyle{ Ci = X_{i1} + X_{i2} + . . . + X_{ini} }[/math] |
It can be shown that the above formula can be equivalently expressed as follows:
[math]\displaystyle{ C_i = n_i × m_i }[/math] | (1) |
where [math]\displaystyle{ m_i = \frac{\sum_{j=1}^{n_i} X_{ij}}{n_i} }[/math] is the mean impact of a scientist.
Assuming that the scientists [math]\displaystyle{ i = 1, . . . , I }[/math] belong to a homogeneous community, conditionally on the production of each scientist (with number of papers equal to [math]\displaystyle{ n_i }[/math]), the citations of the papers [math]\displaystyle{ X_{ij} }[/math] , for [math]\displaystyle{ j = 1, . . . , n_i }[/math] are independent and identically distributed random variables, with common distribution [math]\displaystyle{ k(m_i) }[/math]:
[math]\displaystyle{ k(x_{i1}) = k(x_{i2}) = . . . = k(x_{{in}_i}) = k(m_i) }[/math] |
Declarations
Conflict of Interest
The Authors declare that there is no conflict of interest.