Generating Scientific Paper Abstracts with Deep Learning

Paul Carter, PhD
Jan 12, 2021
9 min read

Over the past few years it has become common place to see articles online about machine learning models which generate text based on the writing style and content of popular literature. Usually the implementation of such an algorithm is based on the use of Recurrent Neural Networks (RNNs) which provide a way to encapsulate information in a predictive model which endeavours to conserve memory.

In the world of scientific research (especially physical and computer science) arXiv is the central repository where academics in a multitude of disciplines upload their manuscripts to share with the world. Recently a dataset which provides metadata about the contents of arXiv were released onto Kaggle. Among the numerous pieces of data available about articles is the raw text for the paper abstracts. This text provides an interesting challenge to the above RNN algorithm for text generation because scientific papers usually contain complex domain specific language, as well as scientific notation in a raw LaTeX format.

Below is one example of a paper abstract from one of my own published articles taken from the dataset:

Low redshift measurements of Baryon Acoustic Oscillations (BAO) test the late

time evolution of the Universe and are a vital probe of Dark Energy. Over the

past decade both the 6-degree Field Galaxy Survey (6dFGS) and Sloan Digital Sky

Survey (SDSS) have provided important distance constraints at $z < 0.3$. In

this paper we re-evaluate the cosmological information from the BAO detection

in 6dFGS making use of HOD populated COLA mocks for a robust covariance matrix

and taking advantage of the now commonly implemented technique of density field

reconstruction. For the 6dFGS data, we find consistency with the previous

analysis, and obtain an isotropic volume averaged distance measurement of

$D_{V}(z_{\\mathrm{eff}}=0.097) =

372\\pm17(r_{s}/r_{s}^{\\mathrm{fid}})\\,\\mathrm{Mpc}$, which has a non-Gaussian

likelihood outside the $1\\sigma$ region. We combine our measurement from both

the post-reconstruction clustering of 6dFGS and SDSS MGS offering the most

robust constraint to date in this redshift regime,

$D_{V}(z_{\\mathrm{eff}}=0.122)=539\\pm17(r_{s}/r^{\\mathrm{fid}}_{s})\\,\\mathrm{Mpc}$.

These measurements are consistent with standard $\\Lambda\\mathrm{CDM}$ and after

fixing the standard ruler using a Planck prior on $\\Omega_{m}h^{2}$, the joint

analysis gives $H_{0}=64.0\\pm3.5\\,\\mathrm{kms}^{-1}\\mathrm{Mpc}^{-1}$. In the

near future both the Taipan Galaxy Survey and the Dark Energy Spectroscopic

Instrument (DESI) will improve this measurement to $1\\%$ at low redshift.

Recurrent Neural Networks (RNNs)

Before diving into the implementation of this method onto scientific paper abstracts I will explain a bit more about RNNs and in particular two types known as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). An RNN is a class of feedforward neural networks which performs the same function for every input of data whilst the output of the current input depends on the past ones computation. This allows this form of neural network to build up internal state or memory when processing a sequence of inputs. In general because of this RNNs are a good choice for when dealing with sequence prediction and interpreting dependence between subsequent inputs.

Looking into the procedure of RNNs a bit further we can write out the steps repeated in each sequential transformation:

h_t = ActivationFunction(x_t*w_x + w_h*h_{t-1})
y_t = CostFunction(w_y*h_t)
This method then repeats

The activation function used are commonly chosen from Sigmoid, Tanh or ReLU and the cost function will depend on the application required.

The standard RNN is good at interpreting short sequences however with deep problems the algorithm suffers from the vanishing gradient problem. This happens because w_h is the same for all chains in the sequence, during back propagation this causes the signal to either become too weak or explode. Extensions to the RNN to deal with this problem are LSTM and GRU.

Long Short-Term Memory (LSTM)

In order to combat the exploding gradient problem when dealing with long sequences a method is required to allow for selective read, write and forget processes. Ideally to compress the information being stored in the state space the algorithm needs to remove the information from stop words, read information from selective prior sentiment bearing words and write new information from the current word.

A great overview of the architecture and mathematics which underpin LSTMs are given in this great review. In summary the repeating module is extended from just a single activation function neural network layer to a series of four with gates which have specific behaviour with the intentions of achieving the above desired compression.

First step is a sigmoid layer which acts as a forget gate layer and assigns importance values to the current word as a value between 0 and 1.
A further tanh layer is combined with the sigmoid to create a vector of new candidate values.
Finally another tanh layer is used to govern what is output from the module by filtering.

Gated Recurrent Units (GRU)

A popular variant of LSTM is the GRU, which has a very similar cell design apart from now the input and forget gates have been combined. More detail on the differences are given in this great review.

Generating scientific paper abstracts using GRU

Now that we have the concept behind how GRUs work and why they provide a route to solving the long sequence problem of RNNs we can focus on the use of the algorithm to generate scientific paper abstracts.

As mentioned at the beginning of this post the metadata from arXiv is used and the model built on Kaggle. The dataset itself has information for over 1.7 million scientific articles from a range of different subjects. To demonstrate the ability of the model to encapsulate the vocabulary of different specialised subjects, 3 were chosen out of a possible 153. The fields to look at are (with number of corresponding articles):

astro-ph.CO - Cosmology and Nongalactic Astrophysics (51,733)
cs.LG - Machine Learning (62,754)
physics.bio-ph - Biological Physics (11,084)

The dataset is reasonably large to handle (2.62GB) in a simple loading framework so the loading is done per line in a get_metadata function,

def get_metadata():
    with open('/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json', 'r') as f:
        for line in f:
            yield line

and the abstract for each paper is loaded into a dataframe assuming the article category matches the one we are modelling.

abstracts = []
    metadata = get_metadata()
    for ind, paper in tqdm(enumerate(metadata)):
        paper = json.loads(paper)
        if fields_to_run[i] in paper['categories']:
            abstracts.append(paper['abstract'])

Once the relevant articles have been collected into a list, all the strings are concatenated into one (for speed 3000 abstracts are sampled out of the total available). From here a conversion is needed to vectorise the global vocabulary list.

vocab = sorted(set(abstract_merge))
    
char2index = {u:i for i, u in enumerate(vocab)}
index2char = np.array(vocab)
    
text_as_int = np.array([char2index[c] for c in abstract_merge])

The overall merged abstract string is then batched into sequences to iterate over per epoch. Once split the dataset is then batch resampled using a designated batch size of 64. The training data is now ready to be processed using our GRU neural network.

vocab_size = len(vocab)
embedding_dim = 300
# Number of RNN units 
rnn_units1 = 512
rnn_units2 = 256
rnn_units= [rnn_units1, rnn_units2]
    
model = build_model(
        vocab_size = vocab_size,
        embedding_dim=embedding_dim,
        rnn_units=rnn_units,
        batch_size=BATCH_SIZE)
    
model.compile(optimizer='adam', loss=loss, metrics=['accuracy']) 
    
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor='loss', patience=3)

EPOCHS= 50
history = model.fit(dataset, epochs=EPOCHS, callbacks=[early_stopping_callback])

where our build model function is defined as:

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
           batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units1, return_sequences=True,
           stateful=True,recurrent_initializer='glorot_uniform'),
        tf.keras.layers.GRU(rnn_units2, return_sequences=True,
           stateful=True,recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

the GRU infrastructure we build up consists of a sequence of embedding layer followed by two GRU layers with a decreasing number of units from 512 to 256, then finally a dense layer with the number of outputs equal to the dimensionality of the vocabulary corpus. The model uses as Adam optimiser and tracks the accuracy metric and optimises based on the custom sparse categorical cross-entropy loss function. When fitting over 50 epochs an EarlyStopping callback function is invoked to ensure that the model only trains until optimal scoring has converged - beyond this point the model risks overfitting.

Correspondingly the accuracy achieved for the processed scientific article categories was:

astro-ph.CO - 73.3%
cs.LG - 74.5%
physics.bio-ph - 72.4%

The model once training has completed is stored using in-built TensorFlow save function which stores a model protobuf file (pb extension). This file format contains the graph definition of the model and corresponding weights.

Once saved this model can be reloaded with tf.keras.models.load_model function whilst also parsing the custom loss function which we used during training. These model weights can then be used in a restructured version of the model where we reduce the batch size to 1 to allow for us to input some text for abstract prediction.

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])  
model.set_weights(model_old.get_weights()) 
model.build(tf.TensorShape([1, None]))
model.summary()

A function was written which can take an input string and then make a prediction of a pre-defined length of characters (in this case 1500 was chosen).

generate_text(model, start_string=u"For many years ")

Below are some examples of outputs for different scientific disciplines:

Cosmology and Nongalactic Astrophysics

Input string: "The nature of dark energy is "

Output abstract:

The nature of dark energy is highly designed to understand the dark matter particles in the LMC. The emission line in the disc. The comparison of the theory with the production of high-redshift SNe Ia as well. We study the use of a starburst. The semi-classical and catalogue of the intergalactic medium (IGM), with a low frequency polarization of the AGNs with the presence of several studies based on the observable signal originates for the same source of the Cosmic Microwave Background (CMB) temperature and polarization spectra of the black hole mass distributions and the dynamical mass and an astrophysical observables are consistent with the observed spectral energy distributions (SEDs) and a model of the IRAC colour component of galaxy clusters are likely produced by a constant $\Delta t_{\rm Lambda}$ is a factor 12 at z = 0.069 (-189) which is due to information about the mass of the cosmological simulation to the trends of the observed perihelia in the coordinate static universe. The number of astronomical observations of the NGC 2403 group appears also the relationship between the color selection and are calculated by the distribution of groups and clusters of galaxies. We present a new study of the measurements of the contribution of high-redshift spheroid is responsible for the inflationary epoch with an upper limit on the amplitude of the spherical collapse of the cosmic microwave background (CMB) radiation bubbles.

Machine Learning

Input string: "This new algorithm provides "

Output abstract:

This new algorithm provides high-level state representations, and the whole landscape of sequential bandit algorithms such as matrix multipliers are not only effectively analyzed using algorithms. In this paper, we propose a new approach to the problem of learning a set of possible artificial data are sub-linear equations which are consistent with the optimal solution of a set of variables that can be used to compute a supervised learning curves using a single subject of the approximate solution is often computationally explored in the online computational cost of a large number of instances and their performance on the data samples. We also demonstrate the effectiveness of this approach is better than those of such models. We also show that under the assumption that the experimental results show that the algorithm is based on the number of classes (such as "regression (RR) where the subset of the model and the partition function $f$ over the distance between two distributions over a single parameter to a local minimum is to maximize the traditional approach for the first regularization using standard techniques for data analysis of the dictionary. To determine the problem of learning a single training set by an extension of the input space. We also investigate the convergence to the logarithmic regret can be viewed as an application of the proposed method is also incorporated in the solution path of the problem.

Biological Physics

Input string: "Pathogens exist within "

Output abstract:

Pathogens exist within the system that incorporates the capacity of the interface of a light-harvesting complexes are reproduced by the strength of the case of noise-driven hybrid DNA sequence of activity. In the case of force queers to the dynamics of the model and the associations with self-organized criticality in the presence of continuum limits of the molecules and the continuous in the same time the network to interact with the results for a linear theory of coherence dependent properties of the concentrations of finite filaments and computer simulations and a connected network of structures and the same mechanics of colloidal particles. The former levels in terms of the non-specific simulation to account for the polymer and the absence of the pore constant for the oscillations in the presence of contact networks could be tuned to study long range correlation functions are relevant to the inherent process is determined by the strength of the activation energy transfer in the contact potential and contractility of such a single sample of the system is obtained with the simulation results of the system science with a probability of left2 has been studied in terms of the concentration of the problems is explored in the particular class of bacteria could be approach the stationary state of the constitutive law.

Conclusion

The GRU model has for these different scientific disciplines been able to build up abstracts with complex and domain specific vocabulary. In many cases however the structure of the sentence could be improved. One of the biggest successes of this trained model is the ability to generate some LaTeX format into the abstracts as is the case in the raw descriptions on arXiv. Possible points for improvement in future are varying the neural network architecture to optimise the models ability to learn, the model may also have suffered from a lack of available data in some cases and finally rather than training a deep model from scratch, transfer learning could be utilised with open source models which have extensive language modelling (example of this is OpenAI's GPT-3). End-to-end example Kaggle notebook for the Astrophysics sub-category: https://www.kaggle.com/pcarter/generate-scientific-abstract-on-topics-rnns-tf

Generating Scientific Paper Abstracts with Deep Learning

Recent Posts

Comments