
Roche & Novartis: Effective Visualizations for Data Driven Decisions || Posit (2020)
Effective visual communication is a core task for all data scientists including statisticians, epidemiologists, machine learning experts, bioinformaticians, etc. By using the right graphical principles, we can better understand data, highlight core insights and influence decisions toward appropriate actions. Without it, we can fool ourselves and others and pave the way to wrong conclusions and actions. While numerous solutions exist to analyze data, these often require many manual steps to convert them into visually convincing and meaningful reports. How do we put this in practice in an accurate, transparent and reproducible way? In this webinar we will introduce an open collaborative effort, currently undertaken by Roche and Novartis, to develop solutions for effective visual communication with a focus on reporting medical and clinical data. The aim of the collaboration is to develop a user-friendly, fit for purpose, open source package to simplify the use of good graphical principles for effective visual communication of typical analyses of interventional and observational data encountered in clinical drug development. We will introduce the initial visR package design which easily integrates into a typical tidyverse workflow. The package provides guidance and meaningful default parameters covering all aspects from the design, implementation and review of statistical graphics. Webinar materials: https://posit.co/resources/videos/effective-visualizations-for-data-driven-decisions/ About Charlotta: Charlotta is a computational biologist by training and works as a data scientist in the Personalized Healthcare department at Roche where she uses R to untap the wealth of information coming from healthcare data collected in real-world settings to support the development of new medicines. About Diego: Diego is a data scientist specializing in applied machine learning at Roche Personalized Healthcare since March 2019. He has developed models to perform various tasks and analyze diverse data sources. Currently, his main applications of interest are in onocology and clinico-genomics. About Mark: Mark is a methodologist supporting the clinical development and analytics department at Novartis. He has a focus on data visualization working on a number of internal and external initiatives to improve the reporting of clinical trials and observational studies. About Marc: Marc is a biostatistics group head at Novartis. He is interested in advancing the methods and practice of clinical development, for instance through effective use of graphics. https://graphicsprinciples.github.io/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Marc van de Meulebroeke. I work at Novartis, and I just want to say a few motivating words on why we're doing what we're doing and what you can expect to hear today.
We start from the insight that graphics and visuals are such an important component of the work we do as quantitative scientists. They're so important for ourselves to gain insights into data, as well as for communicating results and conclusions to our stakeholders.
And if you've been following all those coronavirus charts out there for the past few weeks, like I did, you will agree that graphics are key to understanding what's going on.
Unfortunately, however, we are not always good at creating effective graphs. How often do we see or even create ineffective visualizations? And do we even know which choices or features make a graph effective for a particular purpose or audience?
This is our first focus, to convey this understanding. And this point is actually part of a wider initiative beyond today's webinar, and you will find a corresponding link at the end of these slides.
Secondly, once we do know what makes an effective graph, how easily can we actually implement it? That is, create it quickly and with minimal effort. This is our second focus, the development of an R package that has good graphical principles sort of built in, building upon many of the great packages out there and minimizing the code required for the user.
We quickly realized that these themes are of common interest across Roche, Novartis and probably many other companies and institutions. So we've started to team up and tackle them together. And one of the goals today is also to call for additional interested contributors.
Which leads me to today's agenda. After this introduction, my colleague Mark Bailey from Novartis will speak about the principles of effective visual communication generally. He will then hand over to Charlotte Fruchtenicht and Diego Saldana from Roche, who will focus more specifically and a bit more technically on the R package which we've started to develop. The goal of this package, as I said earlier, is to make it really easy to implement those good principles in practice, specifically in a health care or drug development context, which is where we work.
All this is work in progress. And at the end of the webinar, we will call for further contributors and we will hopefully also have a few minutes for Q&A. So with that, Mark, do you want to start?
Principles of effective visual communication
Thank you, Mark. So effective data visualisation is a key skill for all quantitative scientists, including statisticians, epidemiologists or data scientists, whatever your current title is. Traditionally, this is a skill has not been a primary focus in the training and development of quantitative scientists. In graduate degrees, the focus is often on analytical and technical competencies. This skill of visual communication is often self-taught or developed later on in the job.
So effective visualisation is extremely important. It is a key skill that allows us to convey complex concepts and information to others. If we get this right, it helps support appropriate decisions or actions. If we get this wrong, it can lead to misinformation, confusion or even harm, especially in clinical and medical research.
The visualisation on the right is a good recent example of this covering the current coronavirus pandemic. The visualisation was recently published in The New York Times. It displays the concept of social distancing and why it is important. It is important because it helps break the transmission chain. The graphic also illustrates the concept of exponential growth. This is something we've heard a lot of recently. Here we see an infected individual who goes on to infect two others. They in turn infect two others until we begin to see many individuals infected with the virus, all stemming from a single person. By practising social distancing, we may break this chain, therefore reducing the number of subsequent infections, which could be substantial, as displayed in the bottom graphic.
We are not always good at it. On the right is an example from a clinical study report. Waterfall plots are commonly used to report treatment response in oncology clinical trials. Aesthetically, this plot doesn't look good, from the choice of patterns to the general look and feel. For me, it feels like it was produced by a typewriter rather than modern software. There's a lot not to like about this plot.
But as I said earlier, if we get visual communication wrong, it can lead to the wrong conclusions. And in working in medical and clinical research, this can often result in harm. So apart from the aesthetic limitations, the real failing of this graph is not the look and feel, but that the graph expects the reader to make calculations mentally to compare treatment effects across multiple doses. Waterfall plots are often used to support decision making, especially in critical indications found in oncology. So asking the reader to make those calculations mentally isn't an effective way to communicate.
So what do we mean by effective? We don't necessarily mean beautiful.
So to illustrate here is an example of a visually appealing graph from David McAnalyst. He is a data artist and journalist from the UK. The visualisation is called Mountains out of molehills, a timeline of global media scare stories. The visualisation attempts to investigate the number of news story reports and if there is a sense of overreporting. And so this could cause unnecessary panic. It covers earlier pandemics such as SARS, bird flu and spine flu, but also video games and killer wasps. Visually, it is appealing and draws you in, but I would argue it's not effective as key information is hidden away in the legend.
So if you look at the legend, there are numbers and brackets that provide the deaths associated with each story. It's very difficult to see, I can imagine. This information is important to determine if there really is overreporting, but the information is not factored into the main body of the graph. Also, the subtitle uses languages such as scare stories. It has comparisons to annual seasonal flu deaths and it also has comparisons to topics such as video gaming. All of these design choices indirectly influence the reader, which can then influence decision making.
So beautiful and effective, that's what we really want. So here's an example of a recent effective visualization published by The Economist. The visualization captures a recent phrase that we hear a lot of now, flatten the curve. The graph summarizes this concept really well. So we have two distributions of a time which represent the number of infections in the population. Each distribution represents a strategy. Strategy one for not taking preventative measures such as social distancing and strategy two that does. You can see that strategy two reduces the peak of the distribution through social distancing. This strategy extends the epidemic over a longer period of time, but reduces the peak number of infections. This special visualization is effective in introducing this concept of flattening the curve.
But can we be even more effective? So here's another version of the same graph. Aesthetically, it's not as appealing as the previous one. But by adding a dotted line and a simple annotation health care system capacity, this visualization now highlights the why. Why do we want to flatten the infection curve and spread an epidemic over a longer period of time? The purpose of flattening the curve is not to overwhelm the health care system. So the dotted line illustrates that the first strategy would result in more infections than the health care system could cope with. This, sadly, can have terrible consequences that we are currently witnessing now.
But by adding a dotted line and a simple annotation health care system capacity, this visualization now highlights the why. Why do we want to flatten the infection curve and spread an epidemic over a longer period of time? The purpose of flattening the curve is not to overwhelm the health care system.
So both graphs are very effective ways to convey complex concepts to the population and get messages out there. And I just want to highlight that both of these slides were influenced by a tweet from Carl Bergstrom. The link is there if you want to follow up on it. So effective visualization is effective visual communication. Effective graphs are visually appealing, intuitive, legible. Then use the correct graph type and axis scales. Then use proximity and alignment to facilitate comparisons. Then use labels and annotations to add clarity to the message, as I illustrated before. And more importantly, effective use of visualizations enable clear and impactful communication. Evaluate our influence with our stakeholders and facilitate informed decision making.
The three laws of effective visual communication
This is one of the main motivations behind the collaboration that we are presenting today and the initial design of the Bizarre Package. But how do we become more effective? Myself and colleagues at Novartis have been thinking about this topic for a number of years. We've tried a number of approaches to try and address this question, from reading all the literature available, defining best practices, developing galleries of code, newsletters to highlight the importance of the topic, to handing out cheat sheets. And we've had various levels of success. But to make this concept stick and to be easier to remember, we recently defined what we call the three laws for effective visual communication. Have a clear purpose, show the data clearly and make the message obvious.
You can think of these three laws as superheroes. So have a clear purpose. This is really about why do we need this graph? So we should really know the purpose of creating the graph. And this is really related also to data analysis itself. We should identify the quantitative evidence to support the purpose and also identify the audience and focus the design to support their needs. Show the data clearly. This is about avoiding misrepresentation, choosing the appropriate graph type to display the data, maximizing data to ink ratio. So this is really just a complex way of saying keep it simple. Finally, make the message obvious. So using proximity and alignment to aid in comparisons, minimize mental arithmetic. This is really going back to the waterfall plot where you really don't want the reader to be doing summations in their head and using colors and annotations to highlight important details.
So I'll now go through each of these three laws in more detail. So law one, have a clear purpose. This first law is really advanced common sense. It follows typical planning questions such as why, what, who and where. So why do I need a graph and what is its purpose? Is it to explore data or to deliver a message? What evidence is available to support this purpose? This could be data sets, new experiments, simulations, for example. Whatever. OK. Who is my intended audience? Is it a mix of specialists or non-specialists or both? You really need to think about who this audience is and support their needs. And the where, where will I publish and what are the constraints there?
So for the where, we may identify certain formats such as a Word document, a PowerPoint, PowerPoint presentation or a Markdown notebook or a Shiny app. And really the design and the purpose and the principles that we have really, really is important for all of these different formats. But depending on the format, some of these principles may apply and the design choices may vary.
Another aspect to this law is to identify what are the underlying questions you're trying to address. So here I've highlighted an article by Roger Penn that discusses his thoughts on John Tukey, exploratory data analysis and design thinking. He argues that better analyses are related to the quality of the question. So more time spent on defining better questions is time well spent. And this is related to the famous Tukey quote, an approximate answer to the right question is worth a great deal more than the precise answer to the wrong question. So the message here is to spend more time thinking about your question prior to an analysis or designing a visualisation.
Another aspect to this law is why are you creating visualisation? So thinking about this aspect can help identify the audience and their needs, which in turn helps with the design. So, for example, if you want to dig into data, get familiar with the data or find the story in my data, then the audience is normally you. If you want to communicate the results or to tell the story behind the data, then the audience is someone else. So being clear on the purpose is important here. For example, if you're designing a visualisation to be presented to a group of stakeholders where you have three minutes to get your recommendations across, do you really want the audience to play? Where's Wally? Or in the case of America, where's Waldo?
Law two, show the data clearly. The key part of this law is not to lie or represent with data. Try to be open and transparent with your analysis and the reporting of it. Here we have an example from YouGov, an important topic of pizza toppings. The visualisation displays a pizza pie chart. The visualisation is confusing as the numbers add up to way beyond 100%. Another thing that's confusing is mushroom is the UK's most liked pizza topping. So as a Scotsman, I really find this hard to believe. Another aspect of this law is to be transparent. Here I've displayed a response from YouGov to clarify the purpose of their chart. The intent was not to display a pie chart, but to present percentages from a survey. Here we see that being clear and transparent builds trust and it helps with communication.
Another aspect to this apology is that it highlights the need to think carefully about your design and the graph type. So choosing the correct graph type aids in interpretation. So thinking about the purpose of the graph can help identify appropriate graph types. Often we don't need to reinvent the wheel here. For example, if we want to show a deviation, a correlation or a ranking, then there are common graph types that we can select from.
Another aspect to think about in terms of graph type is what information do you want to convey? So here is an example of the same data displayed six different ways. The pie and donut charts in panels A and B make it difficult to see the order of magnitude of some of the segments. I need to compare areas or angles. These attributes are not easy to compare. The donut chart even omits the angles and bends the areas to make it more challenging to make these comparisons. The mosaic plot in panel C only relies on areas. Again, it's hard to compare across categories. It is better to use lengths with a common baseline or positions on a common scale, such as a bar chart or dot plot. The bar chart in panel D, though, introduces a fake dimension, which is unnecessary and makes it hard to compare numerical values from the height of the bars. Panels E and F are simple and show the data clearly. They also order the data by magnitude to aid comparison even further. So you're really not making the reader work or do the mental arithmetic in their head. The dot plot in panel F uses minimal ink and draws the eye to the position of the dots. So it's probably the most effective way of displaying the data.
Choose the right scale for your data. So try to avoid plotting log-normally distributed variables on a linear scale. This can lead to misleading interpretation, especially when you introduce lines to encode uncertainty. So in this example, the uncertainty appears to increase above 1, which is unity, but this is an artifact of the choice of the scale. Using a log scale resolves this issue.
Another important aspect related to this is time and how that is displayed. So space measurements proportional to the time they were measured. So what do we mean by that? So measurements that are displayed close together are perceived to be closer in time. Not reflecting this in your visualization could introduce a visual bias. So, for example, slopes or trajectories may appear steeper or less so than they typically are.
Law 3, make the message obvious. A key aspect of this law is not to assume that the reader will understand what messages or information you're trying to convey. You have to really work at it to get them to understand this information or what you're trying to convey to them. A simple illustration of this is to try and not to set text at an angle. Think of alternatives such as transposing the graph. So the left-hand plot, we are essentially asking the reader to tilt their head to read the text. Do we really want to give readers neck strain? A simple transposing of the graph allows us to resolve this issue and also introduces a more effective visualization.
Also, try to avoid unnecessary color. Avoid using color to differentiate between categories of the same variable. This can introduce confusion in a similar way to the previous pizza example from YouGov. Also, introducing unnecessary color when it's not required limits us to when we really do need to use color later on. For example, use color when it adds value. Use a bold, saturated or contrasting color to emphasize important details. So the left-hand plot, our eyes are naturally drawn to the top of the bar. The right-hand plot, we introduce a contrasting color. Now our eyes are drawn to the bottom of the bars. This is an important design choice that allows us to highlight key information and can be used to really highlight important messaging.
And here's an example of using annotations and labels to help with messages. So use labels and annotations to tell stories about the data and models. On the left-hand side, we have a typical dose response plot. By adding annotations such as active control and target dose on the right-hand side, we start to provide more contextual information for the reader. And then we can start to get to the why of the plot. So think back to the platinum of the curve plot where we add in the dotted line to indicate the healthcare system capacity.
So finally, here's an example that we created of a visualization that tries to follow all three principles. The analysis is a subgroup comparison of a genetic marker. The analysis is a repeated measures model over time. We use the title to indicate a conclusion or recommendation from the analysis. So genetic marker positive is not predicted for treatment response. We use annotations to indicate the direction of the treatment benefit. And we also highlight the main time point, which is our primary analysis at week 12 with a dotted circle. Instead of legends, we use annotations to indicate each subgroup. And we also use color to differentiate between subgroups.
So if you're interested to learn more about the three laws and some of the principles I've introduced today, I'd recommend that you visit the website below graphicsprinciples.github.io. We have a cheat sheet there that summarizes a lot of these principles and it's developed using R. There are also other great resources out there that you can use. For example, the book by Jean-Luc Dumont really introduces scientific communication. It's really an exemplar book that everyone should read. If you're really interested in data visualization, I would recommend the other two books, the data visualization from Kieran Healy and also the book by Klaus Wilk, which focus on not only just the principles of data visualization, but also how to do this in R.
So effective data visualization is effective visual communication. Effective visualizations enable clear and impactful communication. They evaluate our influence with stakeholders and they facilitate informed decision making. To help design effective visualizations, remember the three laws, purpose, clarity and message.
Introducing the VisR package
Thank you, Mark. So what you've seen in this first section is how important it is to implement graphics principles for effective visual communication. But the reason why we are giving this webinar is that you often don't see that in typical data science analyses. And the reason for that is that implementing those visual principles can be quite tedious and time consuming and adding quite a bit of extra overhead on the coding side.
So especially early on in projects, data scientists tend to skip a lot of those principles to just get on with their project. For example, if you look on the right, you can see that creating this relatively simple dose response curve, that's not even adhering to all the principles that Mark just described. And you see that actually it requires as much extra code to style the diagram and add annotations as it does to plot the actual data. So because of this extra work, we often skip it.
But then downstream, for example, when communicating to external stakeholders, we have to keep up with that work and actually implement new versions of the plots and figures and tables that we did early on. Exploratory projects. So we thought this is something we probably have to change if we want to be more effective in the way we communicate our results.
And as Mark mentioned, we there are different stakeholders or different audience to your reporting outcomes. So early on in a project, this might be just for you and yourself, like yourself. And you might remember all the work you've done, all the data manipulations, and you might not need a lot of additional annotation. But usually you're working in a team, at least in clinical development. And so you have additional stakeholders that might look at the outputs that you create, maybe figures or tables. And you might actually have those stakeholders take those outputs and then further use them to communicate their insights in their teams.
So it's actually quite important that whatever you create in your analysis cannot be taken out of context and that at any time it is possible to reproduce the output that you created, tables and figures, and also trace where the data came from. So on the right, you can see a typical table shell for baseline demographics in a cohort. And as you can see clearly, we're showing a table here because Mark showed a lot of figures and we want to make sure that you all are aware that outputs can also be tables. So if you have a table, you can see that there is a lot of metadata in addition to the actual demographics that are being annotated.
So, for example, every table and figure should have a title and it should name the data source and the version the data comes from. So if that plot or that table finds its way into a PowerPoint presentation, it's easy to trace back where it came from and at what time point the analysis was done. We should always have also a part as part of the rendered object, the information about the abbreviations and table columns that are being used, the statistical tests that were employed, as well as some baseline information or basic information about the cohort, such as sample size. Generally, also, we would like to have the reports that we create to be harmonized across the different visualizations, plots and figures again, so that, for example, the same treatment refers is always coded by the same color in different plots.
So there is this second layer in addition to visual communication and those graphics principles that Mark just introduced that we often have to take into consideration in our data science project. And again, looking at the metadata, you can imagine that annotating all of these information can create quite a bit of overhead on the coding side and is often skipped early on in projects.
So we realize this is something we have to change. And as Mark said in the beginning, it's it's probably an issue that not only data science teams at Roche and Novartis have, but also a lot of other people doing analytics work across the industry and beyond. And so we thought we want to implement something like a toolbox that helps us integrate graphical principles in our data science and analytics projects.
We didn't want to do that in a vacuum, though, so we reached out to not only other data scientists in our in our own teams, but also important stakeholders like therapeutic area scientists or clinical scientists to better understand what they would like to see in typical reports that we create and what the crucial aspects are that should never be missing. And with that, we came up with a bunch of development considerations that the toolbox we wanted to develop should always adhere to.
And the first one is really not creating extra work. So it should integrate seamlessly into our analytics and reporting workflows, and it should work well with the tools we are already using. So often this is the analysis are done in R, so it should play well with the tidyverse. Plotting should probably be done using ggplot and then it should be rendered in different output formats like R Markdown using like HTML output as well as Word and LaTeX or interactively using Shiny. And the package or toolbox should be suitable for our special use cases and clinical development, but might also reach beyond.
And it should be flexible enough so that also expert users find the tools useful and like to use them to ease some of the coding burden and create, for example, these additional metadata objects around the visual output they are already creating. As Mark said, the audience might actually change for your for your outputs as a project evolves. And so the toolbox should allow us to easily adapt to these different target audiences without having to repeat the actual analysis.
That could be, for example, by exploring different visualizations of the same data set and see which one helps us adhere to those three laws that Mark introduced, especially which one helps us to make convey the message clearly and help in decision making. And as I said, the the outputs might make their way through different stages of stakeholders as well. So it should be possible to export them. And there was something very, very critical from our teammates on the therapeutic area side so that you can output them, for example, so that they can easily take the output and present it in their own presentations.
This quickly led us to the idea of developing an R package because R is increasingly popular and widely used in our teams. And it already comes with a bunch of really excellent existing packages that solve parts of the problems that we just stated. So there are packages already that allow very flexible plotting opportunities, for example, ggplot. And we thought actually building on those on those packages and standing on the shoulders of giants here will make the work less hard for us. The other nice part about using R is that it is flexible towards multiple analysis questions and stages of the workflow. And if we developed a package that had different functions, we could actually build in the flexibility that was so requested by our users. We obviously can have the documentation and testing. And given that it's open source, it's really easy to collaborate across companies and even industries in the future and contribute content to go to the next slide.
Architecture and design of VisR
So that's why we are now presenting the vis R package in its current form, which is actually is still very early experimental phase. Or you could actually call it a prototype, maybe which we which we developed and would like to hear feedback of you all later on as well. We thought to there were there are a few architecture considerations we put in in order to implement those considerations and requirements that we got in our earliest stakeholder interviews.
So the first one was really about this integration, seamless integration into the tidyverse in our analysis workflow. So we thought whatever we do, it should interact with the plier and the modeling packages and plotting should be built on ggplot. We also want to keep full transparency on data modifications, first of all, for the reproducible reporting aspect, but also for being able to scale projects across different stages. And that led us to this decoupled design that you can see on the right where we actually take our input data, let's say, for example, one row per sample, one row per patient data set. And then we have dedicated functions that take care of the data wrangling and the modeling before another set of functions takes care of the visualization. And that visualization ideally adheres to the graphical principles that Mark just described before. And then last but not least, we have a separate set of functions, again, that takes care of the separate that takes care of the styling of the outputs. So there shouldn't be any hard coded theming, for example, in the plotting function. But that is then taken care of and allows us, for example, to adapt the visualization to corporate designs. But it also allows us to adapt the visualization to be consistent across the report where, for example, the treatments are always colored in the same way across.
We mentioned before as well that in addition to being flexible, though, the coding of and adhering to those graphical principles shouldn't add a lot of overhead to the coding work you're doing as a data scientist anyways. So we thought it would be great to have, in addition to those individual decoupled functions, the wrapper functions that allow you to explore different visualizations for typical analysis questions. So one typical question in clinical development when working with a patient cohort in the beginning is, for example, to see how many of those patients in the data set adhere to a set of inclusion and exclusion criteria. And you might want to show the cohort attrition. And there are different ways of doing that. And Diego will talk about it a bit later. But having a wrapper function that allows you to explore these different visualizations can then help you quickly find out which visualization might be the best one to convey your message clearly to the target audience you're currently designing your report for.
So this was all very theoretical so far, and it's going to stay a bit theoretical for the next two slides before I hand over to Diego. But what I want to show here is really how this would look if you implemented it for a typical question, another typical question around time to event analysis. And so by having decoupled functions, if we had our initial input data model for patient data that has, for example, intervention like treatment and then the time of follow up or time to event and the status at the end, then we wouldn't go directly from this input data model to the plot, but we would go through three individual steps. The first one is really taking this input data and have an estimator function that computes the estimate, in this case, the survival model. And after cleaning that with Broom, we get a data table or a table as an interim data model, which we can then build our visualizations on. And in this case, for example, would contain the time, the stratum, as well as the estimates at different time points and the confidence intervals. And these can then be easily passed on to ggplot, for example, for visualization.
As I said before, we didn't want to reinvent the wheel where this was necessary. So we are building a lot of those functions on existing packages of the tidyverse and beyond, such as Broom or in this case, also survival. The idea is by having this multi-tier approach that you can define a lot of custom features that might be required, such as custom time windows or adding hypothesis tests. And so after you have this interim data model, you have the visualization function and then the styling in the end that helps you allow to adhere to those graphical principles that we spoke a bit about before.
So this were a lot of the design considerations that are definitely not set in stone. So if people have ideas and thoughts about that, we would love to hear them. But Diego will show you now in the next few slides how we implemented the prototype and what those outputs look like for now. And with that, I would like I would like to hand over to you, Diego. Thanks, Charlotte.
Practical walkthrough: time-to-event analysis
Yeah, so so now that Charlotte has introduced the main architecture considerations and the main concepts that we used to design the package. Let's look at this in the context of a practical application. So let's look at this in the context of a time, your typical time to event analysis workflow. And so when we're carrying out these types of analysis analyses, there's a number of steps that will be there virtually every time we carry out these analyses. And all of which require some level of communicating results or communicating interim results and which can be communicated in different ways, depending on the audience and depending on the circumstances, as mentioned also by Mark before.
And so, for example, you might start with something like building your analysis cohort. So you might want to understand how many patients are kept after applying certain inclusion exclusion criteria. And you might want to show that by either showing a table or a flow chart. Then once you have your analysis cohort, you might want to understand their baseline characteristics of the population that you're analysing. And so you might do that using some kind of table one, which is a very popular table that's used in virtually all and virtually every paper out there on these types of studies. And then you might be interested in doing some kind of survival analysis. And for that, you might plot something like a Kaplan-Meier plot or you might want to show some tables showing the median survival. So there are different things that you might want to show as interim results and that all require a certain level of code and programming and effort and time. And so we as data scientists spend a lot of time doing redoing these a lot of very often.
So now let's look at these one by one. So the first one was building an analysis analysis cohort. And so first, the philosophy that we have tried to follow with with this R was that we wanted to, as Charlotte mentioned, not create more work. But we also wanted this to implement sort of out of the box best practices and and also reuse the knowledge that we have acquired through our own experience and through interviews that we have carried out. And so the way this works, as you can see, you can create a list of filters as well as descriptions. So that's very so it takes very little code to create your cohort attrition using this R and then really creating a table that shows the attrition just takes basically five lines of code using this function called VR underscore render underscore table. And then you can output this nicely formatted table that shows you the relevant information, including the criteria that's being applied. What's the condition? How many patients are being kept and excluded as well as percentages for these? And of course, it comes with nice I mean, it will show nicely formatted titles as well as metadata that might be relevant, such as data sources. And so you don't have to spend a lot of time thinking of what should I show to what should I show in this plot in order to follow best practices?
So then you might also want to show this in the form of a flowchart, which is a very typical visualization of the attrition diagram. And so, again, this takes very little code. You can create a vector of complement descriptions as well as reuse the outputs from the previous chunk of code. And using this VR underscore attrition function, you can plot this nicely formatted attrition diagram, which shows at each step how many patients have been kept and excluded. And this is a very popular, again, figure that shows up in many real world data analysis papers. And I should highlight that the that this type of package can help us better comply with reporting guidelines like consort and strobe.
So once we have selected our baseline population and our analysis cohort, we might want to be interested in showing and understanding the baseline characteristics. And so this is typically done using the very popular table one. We looked at a number of implementations of it and we couldn't find one that really satisfied all the requirements that we had. So we have created our own. The nice thing about it is that it's very flexible. So it comes already with very good inbuilt summary functions that will help you summarize your categorical as well as numerical variables out of the box. But it also lets you input your own custom built summary functions if they don't satisfy your specific requirement. It also outputs the tables in different formats, including cable RStudio GT as well as DTHTML tables. And importantly, it also lets you download the table one as Excel or CSV, which was one of the requests that came the most often when we interviewed our internal scientists. And again, it takes very little code. It's about four lines of code and again follows best practices and reuses are the knowledge that we have acquired.
Yeah, we might be interested in doing a survival analysis. So this is typically done using a Kaplan-Meier curve together with a risk table. And usually there's a number of metadata elements that are important here. And so I should highlight that we have based our design of our Kaplan-Meier visualization on the outcome and the findings of a paper by Morris and collaborators that was conducted among a large number of researchers about what a good Kaplan-Meier plot should look like. And so our Kaplan-Meier plot shows relevant information out of the box, including things like number of patients, nicely formatted access labels, including units where needed, as well as a data source from which the plot was made. It also shows a risk table in a format, including things like numbers, number of patients at risk, number of events and number of censoring events as well, grouped by stratum at regular time points. And this format actually follows the recommendations by Morris and collaborators.
So how do we how do we do this using this R? It's very simple. We have two options. One is either we plot the Kaplan-Meier curve alone using the first chunk here and then the risk table alone using the second chunk here in the middle. And both of these show our design philosophy of separating the estimation from the plotting and styling. But then we also have the option of plotting all of this in one go and one function call. And it's the chunk in the bottom here. So this one is just one function call and then gives you that nice plot. Again, very little code. It reuses best practices and knowledge.
So finally, we have also implemented a set of convenient estimation functions, including median survival times by stratum, as well as multiple methods for testing equality between strata. And again, you can do this very easily with only a few lines of code. And it shows you very nicely formatted tables that follow best practices.
Roadmap and call for contributors
So now going into our packet roadmap, I would like to to highlight that this is still a prototype. So we are basically reaching now the stage where we are, where we have our prototype that's that's reaching the state where it's ready. And now we are reaching the stage where we want to launch this initiative out there and find and identify collaborators to to kickstart the the next step of our projects.
And so with that, I'd like to announce that we're looking for contributors to join the VisR team. So again, VisR is still in its experimental stages. We welcome feedback and ideas for features using GitHub issues, for example. And so actually there are two ways in which you can contribute. The first one is what we call the open source way. And so this is the informal collaboration where you might pick up an issue and work on it proactively. The second way, it might be the more formal way where you reach out to us and then we can see together if it makes sense for you to try to join our core team. And so what kind of contributions are we looking for? So we're open to hearing your ideas and ask for design choices, project governance, hands on engineering, as well as how to maintain an activity use package. And if you want to reach out, please reach out to Mark Bailey and James Black, whose emails are listed here.
Finally, I'd like to acknowledge that we that many other people have been involved in the development of this package, so we would like to thank them. And also we have made heavy use of open source packages which are listed here, without which this prototype would not have been possible to develop. And with that, I'd like to hand over to Mark.
Q&A
Yeah, so thanks, Diego, and thanks, Mark and Charlotte. That's the end of the presentation, and I guess we can start the Q&A session. Let me start with one of the one of the questions that came in during the presentations. Someone has asked, are you going to open collaborations with federal agencies? And we are working with various other groups, but more specifically, maybe, Mark, do you want to take this question?
Yeah, thanks, Mark. We would be open to collaborating with anyone. This is part of a wider set of collaborations. So, for example, the PSI, the Pharmaceutical Statistics Initiative, has also started a special interest group on data visualisation, where a number of us are currently joint members of. And also there's the Stratos collaboration, which is related to this, which is looking at methodology as a whole for observational data. But yes, we would be open to collaborations from any avenues.
Thank you, Mark. Another question that came in earlier was, when will this package be on CRAN? Yeah. Yeah. So as I said, currently it's in prototype stages. I think we would still have to probably go further into development in order to put it into CRAN. But definitely we have it in our radar to do that.
Thank you. Another question was, what do you mean by join the core team? Charlotte. Thank you, Mark. Join the core team, I guess, means that at the moment we're this relatively small group of people who had regular meetings around aspects like the package architecture, for example. And while you might not want to commit to having those alignment meetings and other aspects, I guess if you join the core team, this would be part of like driving the general direction of the package, making design choices and things like that. Whereas if you just picked an issue on GitHub and contributed to it, there would be a little less involvement in the overall direction.
Thank you, Charlotte. So there's two questions which are relatively similar. One was, is this focusing on clinical data only? And another question, do you have a sub team focused on reporting for real world data? So I guess overall, the question is, what's the focus of this effort? Mark, do you want to take that one? I would say the general focus at the moment is exploratory analysis of clinical and medical data. Probably Diego and Charlotte could back me up here, but they work primarily on real world data. So it is a mixture of clinical and real world data. Thank you, Mark.
A quick one was, can you share the link again to those graphical principles? I will take this one. You can see it on this last slide.
Do you have any recommendations for developing visualisations for different kinds of users, users with different fluency in data, for example? Diego, what are your thoughts on this one? Yeah, I think in general, as Mark mentioned before, it's really about knowing your audience. So I really couldn't give you one rule of thumb for everything. But we definitely, I think just as for how we are handling that with the package, that's really part of why we have decided to sort of separate the idea of doing estimates and visualisations and like having multiple ways of presenting a message. Because, yeah, as Mark said, really, you should know your audience. And so that also translates into being able to show things in different or being able to communicate your results in different ways. And so that flexibility is something that we want to embed in our project as well.
Yeah, I think in general, as Mark mentioned before, it's really about knowing your audience. So I really couldn't give you one rule of thumb for everything.
Thank you, Diego. I'll move on to a question regarding compatibility. Two people have asked about intentions to combine this with LaTeX or somehow create a link to LaTeX. And on the other hand, whether we are planning to create some templates for reporting with R Markdown. So both of the questions maybe can go to Mark. Yeah, at the moment, I guess the table one is an example where the engine can change, but the underlying data or the reporting of the data should not change. So I do believe that LaTeX is an option, but hasn't been implemented. I possibly need to hand over to Charlotte to confirm, but we would be looking at different engines for reporting.
Yeah, just to add to that, actually, there was and I saw there was another question on the chat about GT summary. So the reason why we build our own table rendering was to support different engines, namely Cable, GT and DT. And and that is specifically because some of these already come with LaTeX support and that is implemented already, at least in the rudimentary form regarding templates. I think that's a great idea. We haven't really we haven't really reached that stage yet. But we we try to to create the functions at the moment in a flexible way that allows easy integration in different Markdown output formats as well as Shiny.
Thank you, Mark and Charlotte. Here's the question. Is it planned to have only survival analysis or other epidemiological model or analysis? I think the answer is clearly, yes, we want to expand to many different types of analysis, common ones and gradually, perhaps a little less common ones as well.
One question I think is important here, just coming in, what level of proficiency are you looking for in contributors? Maybe Charlotte, do you want to answer this one? Sure. I guess we're we are open for all levels of proficiency. So as we said, this this package is at a really early stage. So we'd love to hear and see people contribute who have who have a lot of experience in developing our packages, from which we could probably learn quite a bit, being more in the data science and domain expert side at the moment. But then also people who come in who would be like target users or who want to write documentation or vignettes are very welcome. So all levels of proficiency are welcome to join the team.
Thank you very much. And maybe a last question, because we only have one minute left. Someone has asked about the possibility to incorporate corporate branding, colour schemes and the like on top of the good graphical principles. The answer is clearly yes. That's the way the package is designed. It is it should be possible to to plug your personal or your company's corporate branding at the very last step on top of the visualisations.
