Which fields drive the h-index: Difference between revisions

From Top Italian Scientists Journal
No edit summary
No edit summary
Line 57: Line 57:
The paper is organized as follows: in Methodology section we review the proposal of [6] and formalise the model; in Application section we apply the new approach to a database of scientists homogeneous by nationality and, therefore, by scientific culture. Finally, Discussion section contains some concluding remarks.
The paper is organized as follows: in Methodology section we review the proposal of [6] and formalise the model; in Application section we apply the new approach to a database of scientists homogeneous by nationality and, therefore, by scientific culture. Finally, Discussion section contains some concluding remarks.


===Methodology==
==Methodology==
The paper of [11] has proposed a ”transparent, unbiased and very hard to rig measure” ([1]): the h index.
 
According to the definition, a scientist has index h if <math>h</math> of his or her n papers have at least <math>h</math> citations each and the other <math>(n-h)</math> papers have <math>≤ h</math> citations each.
 
Following the work of Hirsch, many papers have discussed its application, especially in the bibliometric community. Some papers have focused on the statistical learning aspects behind the h index, and, among them, [9] who has stressed relevance of a ”statistical background” for the h index. Recently [?] has provided a complete statistical framework for the h index that holds for all sample sizes and respects the discrete nature of the citations data which are behind the h-index. We now recall their proposal as it dorms the basis of our analysis.
 
Let <math>X_1, . . . , X_n</math> be random variables which describe the number of citations of the n articles of a scientist. We assume that <math>X_1, . . . , X_n</math> are independent with a common citation distribution function <math>F</math>. Let us then assume that <math>F</math> is continuous, at least asymptotically, although the citation counts are integers. According to this assumption, the h index can be formally defined by the following:


==Declarations==
==Declarations==

Revision as of 11:10, 16 September 2024

Published
September 16, 2024
Title
Which fields drive the h-index?
Authors
Paolo Giudici, Luca Boscolo.
DOI
10.62684/FSOZ4761
Keywords
H-index, Poisson models, Scaling

Paolo Giudici(a), Luca Boscolo(b).

(a) Department of Economics and Management Sciences, University of Pavia, Italy.

(b) Top Italian Scientists founder.

Abstract

The measurement of the quality of academic research is often done by means of the h-index measure. Although widely accepted, the h-index has some issues and, in particular, it may depend on the scientific field in which a researcher operates. To date there is not a definitive answer as to whether this difference holds, and to what extent it varies. To fill the gap, we propose to operationally measure the difference in h-index across the sectors of a relatively homogeneous population of all scientists of a nation. To answer the heterogeneity issue we apply three different explainable machine learning models: linear regression, Poisson regression and tree models. Our results show that the latter two models better explain the data. They show that the only sectors for which a difference in h-index is significant are Physics, Biology and Clinical Sciences.

Introduction

The measurement of the quality of academic research is a rather controversial issue. in the 2000s [11] has proposed a measure that has the advantage of summarizing in a single summary statistics the information that is contained in the citation counts of each author. From that seminal paper, a large amount of research has been produces, focusing on in particular on the development of correction factors to the h index ([13], [3], [9]), [2], [12] that may take into account differences between sectors.

In this stream of research, [9] analysed the mathematical properties of the h index, and [3]proposed to employ a stochastic model for an author’s production/citation patterns. Following this mathematical formalisation, it becomes possible to analyse the h-index of individual researchers, whether or not in different fields, and compare them with each other.

Along a more empirical research line, [13] proposed to use a simple multiplicative correction to the h index to take into account the differences among researchers coming from different sectors, thus allowing a fair and sustainable comparison. They propose in particular a table with such normalizing factors, according to specific distributional assumptions of the citations. Their approach provides a simple way to explain and measure differences between different scientific fields. In a similar vein, [2] propose a rescaling procedure based on the Gini entropy and [12] propose a different rescaling, that takes into account the number of coautors: the fractional h-index.

We employ both streams of research as a starting point. More precisely, we follow [6], who, expanding the contribution of [9], propose a statistical approach that indicates that a Poisson distribution is a well suited approximation for the distribution of the h-index. In this paper we will show that a Poisson distribution is well suited to explain the drivers of the h-index. And we will employ this theoretical result to understand whether the h-index of a scientist depends on his/her filed of research, following the research line of [13], also followed by [15].

The paper is organized as follows: in Methodology section we review the proposal of [6] and formalise the model; in Application section we apply the new approach to a database of scientists homogeneous by nationality and, therefore, by scientific culture. Finally, Discussion section contains some concluding remarks.

Methodology

The paper of [11] has proposed a ”transparent, unbiased and very hard to rig measure” ([1]): the h index.

According to the definition, a scientist has index h if [math]\displaystyle{ h }[/math] of his or her n papers have at least [math]\displaystyle{ h }[/math] citations each and the other [math]\displaystyle{ (n-h) }[/math] papers have [math]\displaystyle{ ≤ h }[/math] citations each.

Following the work of Hirsch, many papers have discussed its application, especially in the bibliometric community. Some papers have focused on the statistical learning aspects behind the h index, and, among them, [9] who has stressed relevance of a ”statistical background” for the h index. Recently [?] has provided a complete statistical framework for the h index that holds for all sample sizes and respects the discrete nature of the citations data which are behind the h-index. We now recall their proposal as it dorms the basis of our analysis.

Let [math]\displaystyle{ X_1, . . . , X_n }[/math] be random variables which describe the number of citations of the n articles of a scientist. We assume that [math]\displaystyle{ X_1, . . . , X_n }[/math] are independent with a common citation distribution function [math]\displaystyle{ F }[/math]. Let us then assume that [math]\displaystyle{ F }[/math] is continuous, at least asymptotically, although the citation counts are integers. According to this assumption, the h index can be formally defined by the following:

Declarations

Conflict of Interest

The Authors declare that there is no conflict of interest.

References