Which fields drive the h-index?
Published |
September 16, 2024 |
Title |
Which fields drive the h-index? |
Authors |
Paolo Giudici, Luca Boscolo. |
DOI |
10.62684/FSOZ4761 |
Keywords |
H-index, Poisson models, Scaling |
Paolo Giudici(a), Luca Boscolo(b).
(a) Department of Economics and Management Sciences, University of Pavia, Italy.
(b) Top Italian Scientists founder.
Abstract
The measurement of the quality of academic research is often done by means of the h-index measure. Although widely accepted, the h-index has some issues and, in particular, it may depend on the scientific field in which a researcher operates. To date there is not a definitive answer as to whether this difference holds, and to what extent it varies. To fill the gap, we propose to operationally measure the difference in h-index across the sectors of a relatively homogeneous population of all scientists of a nation. To answer the heterogeneity issue we apply three different explainable machine learning models: linear regression, Poisson regression and tree models. Our results show that the latter two models better explain the data. They show that the only sectors for which a difference in h-index is significant are Physics, Biology and Clinical Sciences.
Introduction
The measurement of the quality of academic research is a rather controversial issue. in the 2000s [11] has proposed a measure that has the advantage of summarizing in a single summary statistics the information that is contained in the citation counts of each author. From that seminal paper, a large amount of research has been produces, focusing on in particular on the development of correction factors to the h index ([13], [3], [9]), [2], [12] that may take into account differences between sectors.
In this stream of research, [9] analysed the mathematical properties of the h index, and [3]proposed to employ a stochastic model for an author’s production/citation patterns. Following this mathematical formalisation, it becomes possible to analyse the h-index of individual researchers, whether or not in different fields, and compare them with each other.
Along a more empirical research line, [13] proposed to use a simple multiplicative correction to the h index to take into account the differences among researchers coming from different sectors, thus allowing a fair and sustainable comparison. They propose in particular a table with such normalizing factors, according to specific distributional assumptions of the citations. Their approach provides a simple way to explain and measure differences between different scientific fields. In a similar vein, [2] propose a rescaling procedure based on the Gini entropy and [12] propose a different rescaling, that takes into account the number of coautors: the fractional h-index.
We employ both streams of research as a starting point. More precisely, we follow [6], who, expanding the contribution of [9], propose a statistical approach that indicates that a Poisson distribution is a well suited approximation for the distribution of the h-index. In this paper we will show that a Poisson distribution is well suited to explain the drivers of the h-index. And we will employ this theoretical result to understand whether the h-index of a scientist depends on his/her filed of research, following the research line of [13], also followed by [15].
The paper is organized as follows: in Methodology section we review the proposal of [6] and formalise the model; in Application section we apply the new approach to a database of scientists homogeneous by nationality and, therefore, by scientific culture. Finally, Discussion section contains some concluding remarks.
Methodology
The paper of [11] has proposed a ”transparent, unbiased and very hard to rig measure” ([1]): the h index.
According to the definition, a scientist has index h if [math]\displaystyle{ h }[/math] of his or her n papers have at least [math]\displaystyle{ h }[/math] citations each and the other [math]\displaystyle{ (n-h) }[/math] papers have [math]\displaystyle{ ≤ h }[/math] citations each.
Following the work of Hirsch, many papers have discussed its application, especially in the bibliometric community. Some papers have focused on the statistical learning aspects behind the h index, and, among them, [9] who has stressed relevance of a ”statistical background” for the h index. Recently [?] has provided a complete statistical framework for the h index that holds for all sample sizes and respects the discrete nature of the citations data which are behind the h-index. We now recall their proposal as it dorms the basis of our analysis.
Let [math]\displaystyle{ X_1, . . . , X_n }[/math] be random variables which describe the number of citations of the n articles of a scientist. We assume that [math]\displaystyle{ X_1, . . . , X_n }[/math] are independent with a common citation distribution function [math]\displaystyle{ F }[/math]. Let us then assume that [math]\displaystyle{ F }[/math] is continuous, at least asymptotically, although the citation counts are integers. According to this assumption, the h index can be formally defined by the following:
[math]\displaystyle{ h = 1 - F(h) = \frac{h}{n} }[/math] |
Then, following [6], assume that [math]\displaystyle{ F }[/math] is discrete. Given a set of n papers of a scientist to which a citations count vector [math]\displaystyle{ \underline{X} }[/math] is associated, consider the ordered sample of citations [math]\displaystyle{ {X_{(i)}} }[/math], that is [math]\displaystyle{ X_{(1)} ≥ X_{(2)} ≥ . . . ≥ X_{(n)} }[/math], from which obviously [math]\displaystyle{ X_{(1)} (X_{(n)}) }[/math] denotes the most (the least) cited paper. The h index can be defined as follows:
[math]\displaystyle{ h = max(t : X_{(t)} ≥ t) }[/math] |
The distribution of the h index can then be shown to be:
[math]\displaystyle{ p(h) = [F(X_{j(h)}) − F(X_{j(h)+1})]^{(n+1−h)} }[/math] |
The previous expression, albeit elegant, is non parametric, and is not so transparent in the estimation process.
To formulate a more explainable parametric specification, [6] suggested to follow the Loss Distribution Approach (LDA) employed in operational risk modelling (see [7] and [8]) where the losses are categorized in terms of ’frequency’ and ’severity’ (or impact). The frequency is the random number of loss events occurred during a specific time frame, while the severity is the mean impact of all such events in terms of monetary losses.
In the context of the h-index, the frequency is the (random) number of published papers along the career of a scientist and the impact is the (random) mean number of citations received in the same time frame by all such papers.
Let [math]\displaystyle{ Xi = (X_{i1}, X_{i2}, . . . , X_{ini}) }[/math] be a random vector containing the citations of the [math]\displaystyle{ n_i }[/math] papers published by the [math]\displaystyle{ i-th }[/math] scientist.
It follows that the total impact of a scientist [math]\displaystyle{ i }[/math] can be defined as the sum of a random number [math]\displaystyle{ n_i }[/math] of random citations:
[math]\displaystyle{ Ci = X_{i1} + X_{i2} + . . . + X_{ini} }[/math] |
It can be shown that the above formula can be equivalently expressed as follows:
[math]\displaystyle{ C_i = n_i × m_i }[/math] | (1) |
where [math]\displaystyle{ m_i = \frac{\sum_{j=1}^{n_i} X_{ij}}{n_i} }[/math] is the mean impact of a scientist.
Assuming that the scientists [math]\displaystyle{ i = 1, . . . , I }[/math] belong to a homogeneous community, conditionally on the production of each scientist (with number of papers equal to [math]\displaystyle{ n_i }[/math]), the citations of the papers [math]\displaystyle{ X_{ij} }[/math] , for [math]\displaystyle{ j = 1, . . . , n_i }[/math] are independent and identically distributed random variables, with common distribution [math]\displaystyle{ k(m_i) }[/math]:
[math]\displaystyle{ k(x_{i1}) = k(x_{i2}) = . . . = k(x_{{in}_i}) = k(m_i) }[/math] |
[6] showed that, for each scientist [math]\displaystyle{ i }[/math], the distribution function of [math]\displaystyle{ C_i }[/math], that is [math]\displaystyle{ F(c_i) = P(C_i ≤ c_i) }[/math], can thus be found by means of a convolution between the distributions of [math]\displaystyle{ n_i }[/math] and [math]\displaystyle{ m_i }[/math] as follows:
[math]\displaystyle{ F(c_i) = \sum_{n_i=1}^{\infty} p(n_i)k^{ni∗}(m_i) }[/math] |
where [math]\displaystyle{ c_i = n_i \times m_i }[/math] and [math]\displaystyle{ k^{ni∗} }[/math] indicates the n_i-fold convolution operator of the distribution [math]\displaystyle{ k(.) }[/math] with itself (see e.g. [4] and [?]):
[math]\displaystyle{ k^{1∗}(mi) = k(m_i) }[/math] |
[math]\displaystyle{ k^{n∗}(m_i) = k^{(n−1)∗}(m_i) ∗ k(m_i) }[/math] |
and, for each scientist, [math]\displaystyle{ p(n_i) }[/math] is the distribution of the number of produced papers and [math]\displaystyle{ k(m_i) }[/math] is the distribution of the mean impact.
In practice, the distribution functions [math]\displaystyle{ p(n_i) }[/math] and [math]\displaystyle{ k(m_i) }[/math] depend on unknown parameters, say [math]\displaystyle{ \lambda_i }[/math] and [math]\displaystyle{ θ_i }[/math]. A reasonable modeling assumption is that [math]\displaystyle{ n_i }[/math] , the number of published papers of a scientist in a specific community, follows a distribution [math]\displaystyle{ p(n_i|λ_i) }[/math] with [math]\displaystyle{ λ_i }[/math] a parameter that summarizes the productivity of each scientist and that, conditionally on [math]\displaystyle{ n_i }[/math], the paper citations [math]\displaystyle{ x_i }[/math] follows a distribution [math]\displaystyle{ k(m_i|θ_i, n_i) }[/math] with [math]\displaystyle{ θi }[/math] a parameter that is function of the mean impact that may vary across scientists.
To complete the proposed model, [6] showed that a reasonable starting assumption may be to take:
[math]\displaystyle{ p(n_i|λ_i) ∼ Poisson(λ_i) }[/math] |
[math]\displaystyle{ k(m_i|θ_i, n_i) ∼ Poisson(θ_i) }[/math] |
where [math]\displaystyle{ λ_i }[/math] and [math]\displaystyle{ θ_i }[/math] are unknown and strictly positive parameters to be estimated, representing, respectively, the mean number of published papers and the mean number of citations of each scientist (the mean impact).
The previous results implies that the statistical distribution of the h-index can be reasonably approximated by a Poisson distribution, assuming that the underlying population of scientists for which the h-index is calculated is homogeneous.
In the next section we will extend the literature aimed at comparing the h-index across different scientific fields employing a regression model based on the Poisson distribution and compare it with alternative machine learning formulations. the results obtained by employing the previous model.
Application
The |Top Italian Scientists database started in 2010 when Luca Boscolo got inspired by an article that gathered a list of 300 Italian academics in Italy and abroad with the highest scientific impact in any area. To measure the scientific impact they used the h-index. Luca had the idea to download the entire list of the academics working for the Italian universities (about 54k people) and for each of them calculated their h-index using Google Scholar as database. Luca then extracted a list (about a 1k people) whose h-index was greater or equal than 30. The result was called “list Top Italian Scientists” (TIS), and a paper was published displaying a list of the Italian universities ordered by the number of TIS. The paper was cited by some of the main Italian newspapers such as La Stampa and it went viral scattering a huge interest in the academic world. After that, Luca started to get flooded with emails congratulating the work or indicating someone with h-index ¿= 30. After more than 12 years the list has grown up from a 1k to more than 5.5 k. Nowadays this list is known to all Italian academics working in Italy or abroad.
Declarations
Conflict of Interest
The Authors declare that there is no conflict of interest.