Papers


View Published

Enhancing the application of large language models with retrieval-augmented generation for a research community

Juan José García Mesa, Gil Speyer

The demand for efficient and innovative tools in research environments is ever-increasing in the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML). This paper explores the implementation of retrieval-augmented generation (RAG) to enhance the contextual accuracy and applicability of large language models (LLMs) to meet the diverse needs of researchers. By integrating RAG, we address various tasks such as synthesizing extensive questionnaire data, efficiently searching through document collections, and extracting detailed information from multiple sources. Our implementation leverages open-source libraries, a centralized repository of pre-trained models, and high-performance computing resources to provide researchers with robust, private, and scalable solutions.

Read More


Lab Dragon: An electronic Laboratory Notebook to Support Human Practices in Experimental Science

Marcos Frenkel, Wolfgang Pfaff, Santiago Núñez-Corrales, Rob Kooper

Lab notebooks are an integral part of science by documenting and tracking research progress in laboratories. However, existing electronic solutions have not properly leveraged the full extent of capabilities provided by a digital environment, resulting in most physics laboratory notebooks merely mimicking their physical counterparts on a computer. To address this situation, we report here preliminary work toward a novel electronic laboratory notebook, Lab Dragon, designed to empower researchers to create customized notebooks that optimize the benefits of digital technology.


An Empirical Survey of GitHub Repositories at Research Universities

Samuel D. Schwartz, Boyana Norris, Stephen F. Fickas

In this work we aim to partially answer the question, “Just how many research software projects are out there?” by searching for open source GitHub projects affiliated with research universities in the United States. We explore this through keyword searches on GitHub itself and by scraping university websites for links to GitHub repositories. We then filter these results by using a large language model to classify GitHub repositories as research software engineering projects or not, finding over 35,000 RSE repositories. We report our results by university. We then analyze these repositories against metrics of popularity, such as stars and repository forks, and find just under 14,000 RSE repositories meet our minimum criteria for projects which have a community. Based on the time since a developer last pushed a change to a RSE repository with a community, we further posit that 3,300 RSE repositories with communities and a link to a research university are at risk of dying, and thus may benefit from sustainability support. Finally, across all RSE projects linked to a research university, we empirically find the top repository languages are Python, C++, and Jupyter Notebook.

Read More


Preferred Practices Through a Project Template

Peter F. Peterson, Chen Zhang, Jose M. Borreguero-Calvo, Kevin A. Tactac

In the realm of scientific software development, adherence to best practices is often advocated. However, implementing these can be challenging due to differing opinions. Certain aspects, such as software licenses and naming conventions, are typically left to the discretion of the development team. Our team has established a set of preferred practices, informed by, but not limited to, widely accepted best practices. These preferred practices are derived from our understanding of the specific contexts and user needs we cater to. To facilitate the dissemination of these practices among our team and foster standardization with collaborating domain scientists, we have created a project template for Python projects. This template serves as a platform for discussing the implementation of various decisions. This paper will succinctly delineate the components that constitute an effective project template and elucidate the advantages of consolidating preferred practices in such a manner.

Read More