mAb Bioprocess Engineering In-Context Table Forecasts using Conversational AI Literature Insight Generations
Kevin Kawchak
Chief Executive Officer
ChemicalQDevice
San Diego, CA
November 27, 2024
kevink@chemicalqdevice.com
Bioprocess engineering has incorporated effective AI applications in recent years that consist of traditional approaches to training models on relevant data to then analyze and predict new and unseen data. The missing component has been the ability to process mixed data from an assortment of dissimilar information sources with high contextual awareness to inform the Human-AI team on how LLMs and other authors' methods will further improve performance. Here, real-time web search or document retrieval methods with a max speed multiplier of over 600x by 3.5 Sonnet were obtained vs. the manuscript author regarding monoclonal antibody (mAb) bioprocess engineering kinetics across several models. ChatGPT-4o with an average score of 9/10 was the leader in quality for this task with several detailed reports that were obtained using document search addressing a paper's specific weaknesses being improved with LLMs or two other author's methods. This protocol was applied systematically for each of the other two papers, supported by the other two relevant bioprocess papers. o1-preview's advanced reasoning set a new standard over five other models in processing either 136 extracellular or 101 intracellular metabolite tables, incorporating the analysis of 12 additional paper summaries across two prompts with two table revisions. For extracellular metabolites, o1-preview generated an 18 metabolite table including all metabolite forecasts that were expected to be breakthroughs due to future integration of a LLM or other author's recent methods. The model supported its forecasts with interpretable author citations and quotations for breakthrough metabolites, along with lists of author specific and metabolite specific insights that influenced its conclusions. For intracellular metabolites, o1-preview provided a full 101 metabolite table, matching the number of entries from the original Sukwattananipaat, P., et al. table, including confirmations for each metabolite regarding whether each forecasted value was expected to be a breakthrough. Overall, this work was represented by numerous speed advantages, literature insights to address paper weaknesses, and competent o1-preview in-context table analysis with supporting evidence from leading articles represented by two 9.5/10 scores to lead the first conversational AI mAb bioprocess engineering revolution. Manuscript, Seminar
Monoclonal Antibody Bioprocess Engineering Advancements Using Conversational Artificial Intelligence
October 27, 2024
Kevin Kawchak
CEO ChemicalQDevice
kevink@chemicalqdevice.com
Processing high dimensional and complex monoclonal antibody (mAb) bioprocess data in industry is now more efficient due to conversational AI. The human in the loop approach to Large Language Model (LLM) inferencing with document retrieval and chained outputs is a probable benefit to existing biotechnology workflows. Potential risks of using natural language processing are minimized due to the utility of solving problems with vast amounts of structured and unstructured mixed data that can be verified by the Human-AI team. This novel work demonstrates o1-preview, ChatGPT-4o, L3.1-405B, and 3.5 Sonnet models’ fast and stateof-the-art solutions. In specific, o1-preview provided a response to 16 papers 110x faster than the manuscript author’s time after the number of words were set equal. In addition, ChatGPT-4o was 371x faster than an optimal human researcher to examine and provide an estimate regarding dimension reduction or combinatorial optimization for a recent paper by Kao, M., et al. The third LLM speed advantage of 336x by ChatGPT-4o vs. the manuscript author was achieved using monte carlo simulations and markov chain models performance forecasts and a current paper by Konoike, F., et al.
Part A featured the individual analysis of 5 recent mAb production papers, which emphasized the proficiency of o1-preview (9.9/10.0), ChatGPT-4o (9.2), and L3.1-405B (9.2) providing a forecast report. Example generations for o1-preview and L3.1-405B typically established connections between using dimension reduction or combinatorial optimization and improving bioprocesses. Part B models generated tables regarding how LLMs can improve numerical data from 5 different papers using monte carlo simulations or markov chain models. An example from ChatGPT-4o (9.0) was substantially more complete, accurate, and convincing than the table provided 3.5 Sonnet (8.0). Part C utilized the report format from Part A combined with the numerical approach from Part B across 6 additional papers, led by o1-preview (9.0) and ChatGPT-4o (8.5). The o1-preview example followed the prompt format well, citing cases of how LLMs will utilize reinforcement learning and bayesian optimization to improve mAb production. The work represents a standard for utilizing a considerable amount of bioprocess data to forecast new results, with the transition into LLMs providing near-real-time production data analysis aided by document retrieval to provide a synergistic effect with existing machine learning techniques. Manuscript, Seminar
Paclitaxel Biosynthesis AI Breakthrough
October 3, 2024
Kevin Kawchak
CEO ChemicalQDevice
kevink@chemicalqdevice.com
Paclitaxel, C47H51NO14, biosynthesis is an active area of research due to ongoing progress towards more sustainable and environmentally friendly production of the drug compound. Recent literature details the characterization of enzymes that play a role in synthesis, optimization of growth media, and RNA related regulatory mechanisms. The method of PhD students spending excessive time performing literature reviews to discover new findings is obsolete due to faster and high quality state of the art conversational AI. In this study, approximate AI times were obtained regarding how long would it take the fastest human researcher to read, analyze, extract information, and type a high quality 250 word answer; with the fastest time of 1,380 seconds being used as a standard reference. The slowest AI generation in the study was 79.19s by ChatGPT-4o, which was still over 17x faster than the optimal human performance time. Here, a paclitaxel biosynthesis breakthrough was illustrated twice using LLMs and LMMs. In the first instance, full length papers were summarized by AI models – with the finding that AI provided more detailed answers across entire papers, generating over 10x longer descriptions and 12x faster times compared to the manuscript author’s methods to summarize abstracts.
The outputs of individual AI generated answers yielded a 10 Paper Summary with 6,322 words, and served as the input for eight separate prompts, which provided valuable insight regarding both emerging and historical views of paclitaxel retrobiosynthesis, engineering microorganisms, as well as top 10 new research recommendations, and top 10 challenges for this area. The second paclitaxel biosynthesis advancement was demonstrated with a speed of 752 seconds for 36 generations compared to the single optimal human response of 1,380 seconds. Top models received an average AI judge score of 9.5 by ChatGPT-4o for Part A; a score of 9.3 by o1-preview, L3.1-405B, and ChatGPT-4o for Part B; and a score of 9.3 by ChatGPT-4o and 3.5 Sonnet, followed by a score of 9.2 for Wiz8x22B for Part C. These superior results have primarily been afforded by OpenAI, Claude.ai, and Meta AI new model releases in late 2024 that have helped to advance the paclitaxel biosynthesis field. The presence of speedups with more detailed answers over optimal human responses is supported by advanced cloud hardware that processes high dimensional and complex data continually to solve combinatorial problems such as those in this study using 15 different prompts across 163 generations. Manuscript, Seminar
High Dimensional and Complex Spectrometric Data Analysis of an Organic Compound using Large Multimodal Models and Chained Outputs
September 12, 2024
Kevin Kawchak
CEO ChemicalQDevice
kevink@chemicalqdevice.com
Large Multimodal Models (LMMs) possess the ability to analyze chemical spectra of an organic compound using state of the art conversational AI. These outputs can then be chained together and introduced as a text input for other LLMs or LMMs to predict the compound name. Here, a challenging 15 carbon molecule problem with 13 complex and high dimensional chemical spectra were analyzed as images by unmodified versions of Claude 3.5 Sonnet and OpenAI ChatGPT-4o models. ScholarGPT judged the responses across the 13 spectra with an average score of 9.01/10, and the highest response scores per individual spectra for 3.5 Sonnet or GPT-4o were used as the text-based chain. For Part B, the chain was then combined with two different prompt formats and the molecular formula to 8 different LMMs or LLMs which produced new compound predictions. 3.5 Sonnet had the highest proficiency in utilizing the formula simultaneously with complex data for three identical compound generations across two prompts, but was likely limited by the quality regarding the chain of 13, primarily with data from 6 2D NMR Spectra. 3.5 Sonnet's compound prediction was then further improved in Part C by utilizing manual chained explanations of the spectra by the author to yield what is believed to be the correct structure with stereochemistry to the unknown problem. To the author's best knowledge, this is the first LMM to generate the C15H22O2 drug compound derivative (S)-ibuprofen ethylester using high dimensional data from 13 detailed spectra. The purpose of this study was to utilize cutting edge natural language processing techniques to evaluate an advanced chemical structure consisting of IR, 1H-NMR, 13C-NMR, DEPT-NMR, GCOSY60, GTOCSY, GHMQC, GHMBC, GNOESY, and expanded views of spectra. Manuscript, Seminar
LMM Spectrometric Determination of an Organic Compound
August 26, 2024
Kevin Kawchak
CEO ChemicalQDevice
kevink@chemicalqdevice.com
Many machine learning models used in academia and industry that identify organic compounds typically lack the ability to converse over prompts and results, and also require expertise across a number of steps to obtain answers. The purpose of this study was primarily to gain insight into the advantages of current unmodified state of the art Large Multimodal Models (LMMs) across several prompts containing multiple spectra of varying difficulty to evaluate the impact of training data, reasoning, and speed. These readily available and easy to use software for the identification of an organic compound based on a molecular formula and spectra were found to be reproducible across three similar LMMs. To the author's best knowledge, this marks the first time that three GPT variants were each able to correctly identify the organic compound quinoline using a variety of different spectroscopic images. The results were obtained using a 2-step process consisting of a) Uploading high resolution spectral images, and b) Submitting a text prompt with the images that requested a compound determination. The main findings were that 1) Four LMMs provided rationale step-by-step interpretations of 1H-NMR, 13C-NMR, and 3 DEPT-NMR spectra from Prompt A, 2) Three of these LMMs, led by a GPT-5 preview model, combined these interpretations into the correct chemical structure with Prompt A, and 3) Two of these LMMs achieved a top score of 5/5 for also generating sequential explanations reflecting the order of the provided spectra along with most of the correct spectral and molecular formula explanations. Manuscript, Seminar
LMM Chemical Research with Document Retrieval
Kevin Kawchak
Chief Executive Officer
ChemicalQDevice
San Diego, CA
August 12, 2024
kevink@chemicalqdevice.com
Chemical research is more effectively progressed using Large Multimodal Models (LMMs) combined with Document Retrieval and recently published literature. The methods described here illustrate significant strides over previously tested Large Language Model (LLM) multi-document workflows for characterization assistance and generating new reactions. Here, 3.5 Sonnet, ScholarGPT, and ChatGPT 4o LMMs processed either 5 images or 5 supplementary documents from leading 2024 journals. Each of the three models performed inference on a detailed prompt to produce a response that included context from attachments. In addition, the LMMs were not provided with which of the 5 files contained the answer. The main findings were that 3.5 Sonnet had an average score of 9.8 for images, while two judges awarded high scores to ChatGPT 4o (9.7, 9.4) and ScholarGPT (9.5, 9.4) for document analysis. Judging was performed by a human evaluator for the image uploads, with document processing evaluated by Llama 3.1 405B and Nemotron 4 340B LLMs which correlated well and improved explainability. Highlights include 3.5 Sonnet's ability to interpret a Two-dimensional Nuclear Magnetic Resonance (2D NMR) spectrum accurately, along with Judge Llama 3.1's ability to provide consistent formatted scores with explanations. The results shown here help illustrate AI's continued revitalization of the established chemical research field. Manuscript, Seminar