
Wes McKinney & Hadley Wickham (on cross-language collaboration, Positron, career beginnings, & more)
We hosted a special event hosted by Posit PBC with Wes McKinney (Pandas & Apache Arrow) and Hadley Wickham (rstats & tidyverse) to ask questions, share your thoughts, and exchange insights about cross-language collaboration with fellow data community members. Here's a preview into what came up in conversation: 1. Cross-language collaboration between R and Python 2. Positron, a new polyglot data science IDE 3. Open source development, how Wes and Hadley got involved in open source and their experiences in building and maintaining open-source projects such as Pandas and the tidyverse. 4. Documentation for R and Python, especially in the context of teams that use both languages (shoutout to Quarto!) 5. The use of LLMs in data science 6. The emergence of libraries like Polars and DuckDB 7. Challenges of switching between the two languages 8. Package development and maintenance for polyglot teams that have internal packages in both languages 9. The future of data science The chat was on fire for this conversation and we've gathered most of the links shared among the community below: Documentation mentioned: Positron, next-generation data science IDE built by Posit: https://positron.posit.co/ Quarto tabset documentation: https://quarto.org/docs/output-formats/html-basics.html#tabset-groups Packages / Extensions mentioned: Pins: https://pins.rstudio.com/ Vetiver: https://vetiver.posit.co Orbital: https://orbital.tidymodels.org Elmer: https://elmer.tidyverse.org Tabby Extension: https://quarto.thecoatlessprofessor.com/tabby/ Blog posts: AI chat apps with Shiny for Python: https://shiny.posit.co/blog/posts/shiny-python-chatstream/ Using an LLM to enhance a data dashboard written in Shiny: R Sidebot & Python Sidebot Marco Gorelli Data Science Hangout (polars): https://youtu.be/lhAc51QtTHk?feature=shared Emily Riederer's blog post on Polars: https://www.emilyriederer.com/post/py-rgo-polars/ Jeffrey Sumner’s tabset example: https://rpy.ai/posts/visualizations%20with%20r%20and%20python/r_python_visualizations Emily Riederer's blog post on Python and R ergonomics: https://www.emilyriederer.com/post/py-rgo/11 Sam Tyner's blog post on Lessons from "Tidy Data": https://medium.com/@sctyner90/10-lessons-from-tidy-data-on-its-10th-anniversary-dbe2195a82b7 Other: Hadley Wickham's cocktails website: https://cocktails.hadley.nz5 Posit subscription management to find out about new tools, events, etc.: https://posit.co/about/subscription-management/ New to Posit? Posit builds enterprise solutions and open source tools for people who do data science with R and Python. (We are also the company formerly called RStudio) We'd love to have you join us for future community events! Every Thursday from 12-1pm ET we host a Data Science Hangout with the community and invite you to join us! You can add that event to your calendar with this link: https://www.addevent.com/event/Qv9211919
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi all. Thanks so much for taking the time to join us today. If we haven't had the chance to meet before, I'm Rachel. I lead customer marketing at Posit and I also co-host our weekly Data Science Hangout. I'm joined by my other co-host here behind the scenes, Libby.
But I know many of you are already Posit customers, but just in case Posit is new to you today, Posit builds enterprise solutions and open source tools for people who do data science with R and Python. We are also the company formally called RStudio, which is why my mug is the RStudio one, not the Posit one today.
But I'm so excited to have you here for this special event today. So I'm joined by Wes McKinney, entrepreneur and open source software developer, focusing on data science tools and analytical computing. Wes is co-creator of Pandas, Apache Arrow and Ibis projects and currently principal architect at Posit. And I'm also joined by Hadley Wickham, chief scientist at Posit. Hadley builds tools to make data science easier, faster and more fun. His work includes packages for data science like the tidyverse, which includes ggplot2 and dplyr.
So for today's session, this is going to be a casual chat for us all to ask questions and exchange insights with each other about cross language collaboration. But we at Posit are also really excited to learn from you and to better understand how your teams work with both our free and open source projects, but also our professional products like Posit Workbench and Posit Connect.
So I do encourage you all to connect with each other here too in the chat. If you've been on our data science hangouts before, this is one of my favorite things, people getting to know each other through the chat. So if you want to briefly introduce yourself and say hi, maybe include your role or your base, something you do for fun. Feel free to share LinkedIn as well. That's your chat is yours to share resources with each other.
There were so many great questions submitted ahead of time, which I can definitely start with. However, today you can jump into and ask questions that come up live. And if you want to share your own experience, so you could put questions in that chat. If it's anything you want me to read out loud instead, you could put a little asterisk next to it. But we do also have a Slido link where you can ask questions anonymously too. And if I end up missing something, feel free to raise your hand on Zoom and we can call you to jump in there too.
Real quickly, we did keep this to a smaller group today, so I will be sharing the recording more broadly. I just want to make sure I let everybody know that. But I can also send the recording to you all later this week as well. But thank you again so much for being here today. And let's jump in. I know Wes and Hadley, I just briefly introduced yourself, but if you want to say hello here first too. And maybe also let us know something you do for fun outside of work.
Introductions
OK, Hadley, go first. I'll let Hadley go first. OK. Hi, I'm Hadley. As Reg just said, I'm a chief scientist. I make lots of art packages. And outside of work lately, I've been doing crochet.
I was just thinking of buying one of those the other day, those kits. Yeah, I got onto it from Woobles, which make it super easy to get up and go. And I highly recommend it for a Christmas prism.
And the fun, cool Woobles connection is that the people who started Woobles were actually one of the ex-data scientists from Google who actually used our entire universe in the previous slide. So I thought that was pretty awesome. That is, I had no idea.
All right, Wes. Yeah, so I'm Wes. Most of the time I live in Nashville, Tennessee. Over the last year at Posit, I've been mostly, I would say, mostly working on Positron, which is a new polyglot data science IDE that's in public beta. Pretty excited about where that's going.
Rejoined Posit about a year ago, but we've been collaborating actively on Arrow and making polyglot data science projects work better, making Python and R play nicely together since, gosh, it must be going back to like 2015 or 2016. So coming up right up on a decade. Been a pretty long collaboration with the open source tools, and it's great to be able to work together on a day-to-day basis in the company. I guess we released Feather in like 2016.
And the fun thing outside of work. I'm a big, you know, big on cooking and cocktails. I know Hadley's also really into making cocktails, maybe more than me. But during the wintertime, I enjoy doing like kind of large braises, you know, putting something in the oven to cook for five or six hours, and that makes for a really fun dinner party.
Getting into open source
So, let me jump in with some of the questions people submitted ahead of time, and we could start early days here. So the first one is, how did you both get into open source? Wes, do you want to go first?
Yeah. For me, it was a bit stumbling into it because I started doing Python programming back in like 2007, and I was working inside a hedge fund where all the code was very secret and there was no open source. And even using open source was a little bit, you know, was a little bit dicey. You had to be really careful about, you know, what code you pull in and everything was really scrutinized.
But I learned about, you know, the scientific Python community and started looking into these different projects and how they became open source projects and what was an open source project. And at a certain point, this was like the middle of 2009, I decided that I really wanted to open source, you know, what was then very early version of Pandas as an open source project. And so that led me to, I finally got permission to do that. And then I went to my first PyCon in 2010. And that was like my first foray in the Python community.
A lot of the folks in the scientific Python community, like NumPy, SciPy, people who go to the SciPy conference, like they, you know, I met them and then they mentored me in like how to do open source, how to build open source communities, and just became a bit of a, you know, a bit of an addiction, I guess, after that. And, you know, really enjoy working in public and building tools that are freely available on the internet and have a lot of impact.
And Hadley, what about you? What, why data science for you? Yeah, so I, in like high school, I really enjoyed both like programming and statistics. And so I ended up doing a double major in computer science and statistics, which at the time, seemed like a weird combination to a lot of people. But it's now obviously what we call data science.
So I did that at the University of Auckland, which was the home of R. So as far as I can tell, I actually started using R in 2003. And I had to look up and that was R version 1.6, which is a little horrifying to me that I've been using R for like 21 years now.
So that, so R was open source and that, I mean, that just kind of felt natural to me to like try and, you know, develop R packages and release them in open source. And just like such a great way to have like an impact on the world. And, you know, my original career track was like more, more academia thinking I would be a professor somewhere one day. And it just seemed to me like open source was just such a great way to get your ideas in the world and not just provide like a text description of what you're working on, but to provide code that people could actually use to implement your work.
open source was just such a great way to get your ideas in the world and not just provide like a text description of what you're working on, but to provide code that people could actually use to implement your work.
Pandas, Polars, and the data frame ecosystem
Okay, let me jump into some of the package type questions, but I see a few coming into the chat as well. I noticed a few of the pre-submitted questions focused on pandas and also mentioning polars. So I'm curious, Wes, how do you envision pandas evolving over the next year in response to performance-first single machine data frame libraries?
This is a pretty common question. You know, because pandas has millions of users, it's very hard to make large sweeping changes to the project. So some years ago, back when we were starting the Arrow project, which provides kind of a more efficient data management layer, so like a place to store data efficiently in memory and then compute all the data frame operations really efficiently against that data, there was a discussion about, you know, could we make large sweeping changes to pandas to make it a lot faster and more efficient, have better scalability and memory use of the single node? But ultimately, we found that, you know, making significant changes would cause too much disruption to people that depend on pandas for all of their business applications.
And so over the last 10 years, I haven't been actively involved in day-to-day development in pandas, but it's been really careful to introduce new components, to refine the internals, to make them more efficient. But ultimately, like the project has to preserve its API and not make changes too quickly because that risks breaking people's code and harming people that don't have tests for their production code.
One of the big changes has been the introduction of a more robust extension arrays system. So that enables people to use Arrow arrays in their pandas data frame, so they can get a lot more efficient string data types and get better analytics performance that way. There have been some other performance and memory use applications around the project. But in terms of like, you know, really high performance computing for data frames, I think that's mostly been happening in the DuckDB and Polars projects. You can feed a pandas data frame into DuckDB and use DuckDB to execute on it. And I think maybe similarly with Polars. But, you know, without the burden of supporting over a decade worth of legacy code and an existing API, it's been nice for projects like Polars to be able to rethink the data frame API and build something new from the ground up that's focused on performance and scalability.
But also with a much smaller API. And so the way I describe Polars to people, it's like a much smaller, simpler API. There's no row labels. Like, there's a bunch of stuff that makes things more complicated in pandas that doesn't exist in Polars. So I think Polars is, you know, maybe more similar to an R data frame or, you know, if you're using dplyr, the dplyr API and the Polars API are very similar.
Quarto for polyglot documentation
Yeah, sure. It's buried somewhere in the chat. So, I work for the Washington State Department of Health. We have a pretty large user base of our users. We also have quite a few Python users as well as like our technical capability continues to expand. And we've been struggling creating documentation that serves both of those communities. So, with the advent of Posit Workbench coming to us soon, and with the advent of things like Positron and these polyglot IDEs, some of us use BS code too, so Positron shouldn't be that big of a leap there.
But we've been trying to create documentation that can serve both communities better with Quarto, especially since Quarto works well in Python and R, but we've had challenges using them together. So, we want to use Quarto to share documentation with different contexts, the same documentation, but quickly switching between R and Python for our users that need to see one language or the other. They usually don't know both. It's usually only R or only Python. And I was just curious, is there examples out there or recommendations that you might have for creating this type of documentation that we can switch between quickly?
There's one thing that I'm curious about in Quarto, and that's with, you can use tab sets. And I think if you give the tab sets all the same name, you can kind of switch them globally on a page. So, then you can have an R code and Python code, each in their own tab. And then when you pick one for Python, that changes everywhere. I can't find that on the Quarto docs, so maybe I imagined it, or maybe it's not documented. But I know that was something that came up in some of our internal discussions as we start to create more packages that work with both R and Python as well. So, that's definitely something to try with the tab sets, if you're not already.
Great. Yeah, and I'll take a look. I'm curious about the global switching, because that would be perfect. I'm just not sure how to implement that yet. Yeah, and if it doesn't work, just ask, because it should work.
I just found this. I was just Googling, and I found this Quarto extension called Tabby, T-A-B-B-Y, that just makes it easier to create code blocks with tab sets. So, you could say, switch between the R version of the code block and the Python version of the code block. I'm not endorsing. I haven't used the package, but it came up in case that's interesting. I just found in the docs. I dropped a link in the chat that is supported. You have to add a group thing to your tab sets.
Positron and the polyglot IDE
Okay, I see, Manuel, you had asked a question when you registered about Positron, and I'm seeing some excitement about Positron in the chat, too. And feel free to jump in if you want to add any more context to it. The question was, in what ways does Posit see using the best of both for R and Python development occurring in Positron?
Yeah, and I guess, you know, I was going to add a different question slightly on this theme. I'm with Exxon Mobil, so I'm with a large enterprise, and we have many products, many solutions. One of them is Databricks. And I get the sense that the individuals very passionate in either community are accustomed to their IDE of choice, right? So it was RStudio or Spyder or whatever, and now maybe Positron. But I'm finding myself having to build solutions that are so large that I'm not certain if that's the right home anymore.
So maybe the question can be, you know, what's the continuing value proposition for something like Positron when the landscape continues to change and the scale of things that people are being asked to build could grow, right? So I get it if I'm modeling, if I'm deploying something to predict that IDE experience is very powerful, but for a lot of what I have to do for business intelligence or analytics, they just want to report.
Well, so I don't know how much folks in this call know about Positron, but the idea of Positron was to develop the kind of – you can think of RStudio pioneered what we can think of as the classic four-pane data science layout with the console, the code editor, the variables pane, and the plots pane. And so you can move between. And there's some other components like the connections pane for interacting with databases, the data viewer for being able to look at your data frames and your tabular data. But having each of those concepts as first-class citizens within a development environment is really powerful, and I think RStudio has shown that over the last 15 years with its enduring popularity.
And so the idea with Positron is that we wanted to create that same kind of experience and make it available to Python users, but in a polyglot-first IDE. But it's kind of a difficult problem to build from the ground up something that works equally well in R and Python and supports all of those different components. So we chose to build Positron on top of the open-source VS Code codebase. So we have basically an extensive set of customizations and components that we've built on top of open-source VS Code.
And we've created that kind of hybrid IDE where you can, within the IDE, you can have both an R session and a Python session running. And so you might have one tab that's an R program where you're doing something in R and then another tab that's Python. And whenever you switch contexts, it will change the session, the active session, and that will change the variables that are displayed, the plots, and the different components in that four-pane data science layout.
You can imagine this has been an awful lot of engineering, but I think it reflects kind of how we as a business are thinking about building tools going forward, building things polyglot from the ground up. Similar ideas with Quarto and Connect is also kind of being built as a polyglot application and document publishing platform. So it's going to take a long time before Positron—well, I should qualify, it will take some time before Positron is able to reach the level of polish that RStudio has after 15 years.
But it's an early-stage project. We're really interested in feedback and users. I think it will be available as a preview, kind of in preview alpha, beta form in Posit Workbench in the near future. I'm not sure, maybe in the first quarter of next year. And then we're hoping for it to become generally GA production toward the end of next year or early 2026 in terms of timeline. But we're making a big investment there and getting those, supporting those, not only Python-first teams, but people who are just using Positron for Python. We want them to have a great experience, but also teams that have a lot of R users and Python users. We want them to be able to work effectively within the development environment and switch between the R context and the Python context.
Integrating R and Python workflows
I think the goal of both Workbench and Connect is to put R and Python on equal footing. So regardless of whether you're using R or Python, you can develop your scripts in the same editing environment, and you can publish them to the same environment using the same tools. So I think that's really about helping both 100% R and 100% Python teams or even individuals work effectively. And then how do we make sure those teams, the R users and the Python users can collaborate as effectively as possible?
And to me, that's mostly happening on the open source side. That's making sure, like on the R side, that we have this nano parquet file so that you can read in R or read and write parquet files, which are a great way of sharing data with your Python using colleagues. There's things like the pins package, which we have for both R and Python, that allows you to save data sets and share them with your colleagues really easily. Tools like Veta for model monitoring, they're working with R and Python. Great tables that work with both R and Python. So having these tools where you end up with like a shared vocabulary and a shared tool set and shared conventions around storing and using data that hopefully ease some of those boundaries a little.
LLMs and AI in data science
I don't know if we're at the point where we have a roadmap, I'd say, but we have a lot of internal skunkworks projects on the go to try and figure out what LLMs mean for data scientists, both for doing data science, whether that's extracting structured data from unstructured text, or writing the code that you use to do data science. So we're exploring a bunch. Personally, I've been working on the Alma package for R, which makes it easy to use any of a range of LLM providers from R programmatically. That's hopefully going to go to CRAN tomorrow, or maybe the day after, or maybe Friday, failing that.
And then I think we're really thinking about how should LLMs and Positron interact? Is there something special for data scientists? Is this just like kind of a generic programmer software engineering problem and you can integrate with all the existing tools like Continuum or Codium or any of the 17 other LLM VS code extensions that I forget the name of. But no roadmap, but I think a lot of interest, a lot of excitement and a lot of experiments.
Yeah, I mean, I think one of the things that we're, you know, we're looking closely at is how, because I assume that a lot of people in the call have used Copilot or have used one of the other LLM VS code extensions like Continuum. Like, I personally recently just switched from using Copilot to Continuum. And so you install the Continuum extension. It's basically like a common, it's kind of like an open source Copilot, you know, auto completion, and it does code editing and chat. Kind of within, and you plug in your own, you know, API key from your preferred model. Like, I'm using Anthropic, like the Anthropic API. Like, it's very, it's very inexpensive to use, but you could also plug in like an open AI API key. It uses Mistral for code completion. So, like, Mistral has a code auto completion model called Codestral. That's like, I think one of the best, you know, kind of the best low latency auto completer models that's out there.
So, I've been learning a lot about that, but I think one of the things that we're looking, taking a close look at is how we can take a general purpose LLM AI assistant that's for software engineering and augment the context with information from the data science environment. So, that's things like the column names in your data frames and kind of other information that we can cheaply glean about the active data science environment to make the prompts and the suggestions from the AI more contextually helpful based on what you're doing.
And so, if the LLM is just looking at the code in your code editor, it may lack the context that can be picked up from looking, you know, kind of looking deeply at the data. And so, it's, you know, you can imagine this is a, you know, kind of a pretty significant engineering problem. And, you know, we also, you know, I think we want to make sure that we don't bite off more than we can chew and that we're taking advantage of, you know, what's out there and not, you know, not building too much from scratch ourselves.
I wish I had something thoughtful to add. My head's kind of similar to those lines I added in the chat. Like, at this point, most of the LLMs we're using are an API endpoint. You know, where I work, we've developed an internal fast API to hook us into every LLM we need. So, you know, I'm able to develop with LLMs right in R. I'm not stuck using Python for everything. I'd say the one place I'm still kind of stuck being behind in R is vector databases, reg, network searches, that kind of stuff. So that's still where I'm struggling a little bit to kind of argue to our leadership that it's worth, you know, sticking with R when, well, if you want to do reg, you have to develop it in Python, use chroma or face or whatever else.
I will say I am planning this very much on my to-do list for next year to do a package for regs in R. What I currently have to show for it is I have a name and a logo, but hopefully next year there will be some. I'll put the logo in the chat.
One neat thing, I don't know, I haven't built anything with it personally. I've just seen demos, but there is like a chat widget in Shiny. So if you have a Shiny dashboard or something that you're building and you, you know, there are components that could enable you to add a kind of an LLM chat sidebar to a dashboard or Shiny application. And so I think there's lots of, you know, with a little bit of, you know, legwork, you know, if you're willing to spend the energy to, you know, twiddle with generating prompts from the other metadata in your Shiny dashboard, that could be used to build some pretty interesting things. But I think it's definitely early days in terms of realizing the full implications of all of the different, you know, things we can do with these tools.
How Wes and Hadley use AI
So another question was, are you using AI? And if yes, which part of your way of working could benefit most from AI? Yeah, a lot of my hexes now are created by, like, I use DALI to do a bunch of kind of experimentation to kind of like get the sense of like what I want. And then I work with one of our designers at Posit to kind of finish it off. But I found them like really useful for brainstorming hex logos.
More usefully for work, I guess I find them super useful when I'm programming in areas that I'm not that familiar with, like doing like web development, some stuff there, just really useful. I can kind of read JavaScript if it's been written for me, but writing it is painfully slow. So having someone like at least get me 90% of the way there makes a huge improvement to my productivity.
Yeah, I, I think my, my use of AI is similar to Hadley's. I mentioned that I currently use the continue continue VS code extension. And that's relatively recent. And I actually held out on using copilot for a long time because I felt like, hey, I don't need a, I don't need an AI to help me write code. But then I found myself working on indeed on code bases and especially like JavaScript and TypeScript code bases where a, there's just a lot of concepts and things that I was not familiar with. A lot of like library code that I didn't fully understand as well as like the, like a new programming language.
And so I find that it really helps me like, with the, like the blank, the blank slate problem of, of starting in on something new and I could spend, you know, in the old days, I would spend an hour googling things and looking up how to do stuff on stack overflow. Whereas now I can ask it to insert some template code, get me started. Maybe it's not right the 1st time, but then I can edit it to my satisfaction. I also find that it's, um, very, like, really good at refactoring stuff. So I used to like, manually. Like, refactor, you know, refactor code, but, you know, the other day I was, I started to refactor some code. And as soon as I started doing, you know, I think the auto completer just realized what I was doing and made all the right suggestions. And so, rather than having to type out the thing I was doing, I just pressed, you know, tab 5 or 6 times and it did exactly what I wanted to do kind of like magic.
Um, so, you know, I think there's a certain set of places where it really helps, um, you know, with boilerplate and stuff where, um, you know, stuff you could do, but maybe you'd have to do some Googling and things you're not familiar with. But, but, um, I found it's definitely made me, uh, made me a lot more productive. Yeah, I think, like, to me, like, the use cases where it's not like the place where it's best. It's like the sort of brainstorming stuff. You're not like, you never really trust it a hundred percent, but it often gets you like 90% of the way there. And in places where, like, it doesn't have accuracy isn't like, but the other kind of framing, I think of it as like all of those things you used to Google and you just kind of blindly trust. Whatever you get with Google, and if it's a recipe, you've also got to scroll through 15 pages of the person's life story. If we actually get to the recipe, like, is like, almost uniformly better for those kind of tasks. We're like, actually getting 100% correct answer is not that important, but you really just looking for a starting point.
R and Python ergonomics and ecosystem cohesion
Yeah, let's get back to a conversation we were having earlier around just sort of like the R and Python relationship and interoperability. And the question is mainly around, like, what Emily reader calls the ergonomics of the language. Like, I find myself working are really used to the tidy verse. It has a very natural kind of workflow. Thinking about how you wrangle data. When I go into Python, like my brain breaks, I just can no longer think about how to manipulate that I feel so dumb, you know, it's like that me like, you know how smart I am and in our, you know, and I'm wondering if, like, if posit. Well, I'll back up there are these community led initiatives to kind of make Python in some ways feel a bit more like that tidy workflow right there out there. I'm wondering if positive investing in ways to sort of unify the language ergonomics, so that the switch between the languages is more natural, not just in like the IDs and the tools but like the language API themselves.
I mean, where's my never another aspect, but I think we are doing this a little bit in terms of like packages like GT and great tables. And I'm like to default to and plot nine, and, you know, I think the pliers influence all those a lot. So the same time I have to say like, I find like Python packages, really strange. And like, I think the strangest thing the thing that I find the weirdest and strangest about Python packages is they don't have like logos. They don't have hex logos and then they have stickers and I think that's like really weird and disturbing.
Yeah, that's funny. I never, I never really thought about the, you know, the logos, you know, the logos thing until, you know, working more with the art community. And, yeah, it's a, we should, you know, I think it would be good to have, you know, to have to have more logos on the on the API front. I, I, I know that, you know, I, I started a project called called IBIS, which is like a portable data frame API, very much inspired by by D plier, you know, getting on almost, almost a decade ago. And so, you know, there's kind of a growing community of people that have built that and built out like different back ends to, you know, works really well with DuckDB out of the box and pollers and pandas and you can use it with a big query and snowflake and Presto and all these things.
And so I think that's been one area where there's been, it's been a little bit under the radar because everybody uses pandas and so, you know, they're not not thinking as much about like, hey, what's, you know, the value of a portable data frame API. I also started a project called narwhals, which is like similar to IBIS, but is using the pollers API is like the, the kind of the API interface. So it's similar I think narwhals is like the way that DB plier is is related to D plier in terms of, you know, making D plier expressions run on databases, I think narwhals is the same concept, but for the pollers for the pollers API.
But I agree that stuff like, you know, the Python port of great tables helps at least with, like, for the data presentation problem, like, reducing some of the cognitive dissonance moving between languages. There's plot 9, which is basically a Pythonic port of GG plot 2. So if you're looking for something that maps semantically, if you're used to GG plot 2, and you want to be able to express, you know, things in the same way in Python.
But, you know, it's, it's because the, the Python ecosystem is so big and so federated, like, it's, it's, it's, it's a little bit more of the Wild West, whereas, like, the, our ecosystem is like more curated and, and there's more of like a kind of, at least within stuff in the greater, you know, tidy, tidy cinematic universe. There's a little more of like ideological consistency of like, we have to make all these libraries work well together. And so that that's helped create like a cultural cohesion that isn't as present in, in the Python kind of in the, in the Python world, because it's so it's so federated. If that makes sense. Yeah, I know it's something we talk about at least and something we're trying to make better, but it's, it's difficult. It's a very, you know, it's a very big ship that we're trying to steer with a small rudder.
Do you think that our language cohesion is a function of just the, like, ours is a smaller language in general, there's just fewer players. And as you scale you kind of inevitably have this like volume and noise that you're talking about. I think I, I mean, Hadley would probably have a better view, but I, my outside impression is that, that the is, is not necessarily, it's not just about the, the language size or the community size, but more of like the, I think that the, the developer community is more is more cohesive and more kind of like there's more dialogue and collaboration.
And so I think there's been like more of a, like an active effort where like with, with like the tidy versus is kind of this, you know, gravity well, like it's created almost like a kind of a desire within the community to, to, to sort of create things, create things that were played, play nicely together that are cohesive for the benefit of like kind of end to end productivity. It's like, oh, well, you want to build a new thing. Well, it needs to work. Well, kind of within this ecosystem of tools to play, play nice with people that are using, you know, these other, you know, 15 packages and their daily workflow.
Yeah, I mean, I think that, you know, the art community is smaller and it feels more centralized, like another kind of interesting difference is the difference between CRAN and PyPI where CRAN is like such a pain if you're a package developer, because you are responsible for making sure your package is compatible with all the other. You know, you're not breaking other packages on CRAN where the PyPI really, really easy to deploy a package to PyPI, but now the user's responsibility to debug packaging compatibilities. And so there's something, there's just something I feel like the art community is kind of more centralized by nature and that's, you know, there's lots of good things about that. And there's obviously lots of good things about being more decentralized and spread out like the Python community.
Building blended R and Python applications
So I see there was a, there's a lot of questions about using R and Python together, because I think most teams have both R and Python users now. But one was, what is your recommended architecture or approach for creating a blended web-based application?
So, like, a blended, blended in the sense of, like, one that uses both, like, both R and Python, is that? Yeah, so I'll, I'm not sure exactly who asked the question, but I'm imagining there's a team that has both R and Python users and they're working on a project together where the output is some web app that someone's using to make decisions.
Okay. I mean, it, I guess you have to make a decision, like, where is the, like, the front end portion of the app built? Like, whether you build the, you know, build the application in R, say, using Shiny or in Python using, you know, Shiny Streamlet or Dash or, you know, I think they're all supported and they're all supported in Connect, for example. So you can publish, you know, choose your app framework, choose your fighter, basically. And then if you need to cross between the languages from Python, you can call R code that's within the application using, I guess it's RPy2 is the thing that people use. And then from R you can call Python code with Reticulate.
And so it's a bit of a, you know, choose your, based on like, you know, where you want to maintain the app application portion of things. You know, maybe the, you know, in some ways, like, building or in many ways, like, building the application layer in R is, you know, kind of a lot nicer in many ways. You know, kind of in the publishing experience, especially if you're working with, you know, working within RStudio. I know we've got, I think we're, you know, kind of working to achieve the same level of, like, one click publishing and kind of application, making application development work as smoothly for Python, the Python app frameworks and Positron as they, you know, as they do for if you're building a, you know, building an application in R and RStudio.
Maintaining parallel R and Python packages
Yeah, and I'll, I'm not sure exactly who asked the question, but I'm imagining there's a team that has both R and Python users and they're working on a project together where the output is some web app that someone's using to make decisions. But, you know, even with Posit, like, they support both R packages and equivalent for the Python packages. And I was curious from an engineering perspective, like, how you're maintaining those and keeping those synchronized or maintaining those in a way where it makes sense to developers and maintaining them going forward. I guess, like, what are the challenges? And, like, what advice do you have for teams that have internal packages in R that also need to have those in Python?
Not really, but I just commented on another thread asking the same question, saying that that would be worthwhile maybe having Edgar and Scritch come and talk to this meeting about that since they both have experience developing both R and Python packages simultaneously. I think that would have some good ideas about how can you, you know, minimize the amount of, like, copying and pasting you're doing. You know, I think there's also that's another place where it seems like LLMs are really useful or getting increasingly more useful, like translating code from one language to another. You know, certainly not going to be 100% correct, but will it radically speed up, you know, translating from one system to another? I suspect so.
Learning resources for R users moving to Python
And, Sam, Tyner Munro. Yeah, I'm an R user, and I manage some people who use Python. And, of course, with LLMs, everybody is using Python more. But, you know, as someone, for those of you with experience going from R to Python, what are your favorite resources for package documentation, tutorials, other sort of learning materials generally? And I saw somebody else in the chat, too, say that, you know, Python documentation tends to be quite difficult to learn from and understand compared to R package documentation. But, yeah, just wondering what some good resources might be.
I know at least, like, I think I know that, like, Quarto has been helping a lot for Python packages. There's, it's called Quarto Doc, and so it's, like, helps with, like, a number of Python projects have moved over to, you know, building their, like, building their websites, kind of hybrid of documentation, blogs, tutorials, like, where they publish content for their packages, like, through, you know, through doing it, you know, through doing it with Quarto. And so, you know, in terms of, like, if you have projects that are doing documentation, both for R stuff and for Python stuff, that, you know, definitely helps with having fewer tools to master in terms of authoring and publishing content.
I mean, I, you know, there's many different documentation frameworks. I think in Python there's, you know, not only Quarto Doc, but there's still Sphinx, and there's MK Docs, and there's, you know, a number of other documentation frameworks. But, you know, I think I've been, I've been advising, you know, I've been advising a lot of most people to move, you know, not only their code blogs and anything code related or Python related, you know, to move things over to Quarto just because of the, you know, the benefits of working within a, you know, within a polyglot, you know, polyglot framework.
Yeah, I think the surprising thing to me is just how, like, the other thing that Python kind of likes that R has pretty deeply ingrained is the idea of vignettes. Like, I think Python generally there's lots of, you know, there's good function and method documentation. But that idea of having these, like, longer form documents which explain, like, that's just less baked into the kind of package, the packaging system in Python than it has been in R. I think partly because, you know, R always had this very strong academic connection where there's this idea, you know, at least historically, like, if you're writing an R package, you're also going to write a paper about it. So you want some way to put that, like, long form, you know, detailed technical discussion.
Big data trends: DuckDB, Polars, and single-node computing
I mean, I think these projects will continue to, you know, continue to get better and attract, you know, just more add-on third-party library integrations and just become more integrated and useful within, like, the mainstream programming environment, both in Python and R. You know, Posit has a relationship, has worked with DuckDB Labs on integrating DuckDB with dplyr. And, you know, we made similar investments kind of in making dplyr work better, or not dplyr, but DuckDB work better with Python.
And so, yeah, I think that my hope is that, you know, Polars and DuckDB just become more and more popular and widely used and that we don't end up with, like, 10 different projects that are solving, you know, solving similar problems. But it is true that these tools have really opened up, especially on big machines with, like, big servers with lots of cores and lots of RAM. You know, what you can do on a single machine now is pretty impressive. And, you know, there's fewer and fewer cases where you need to spin up a distributed cluster and use Spark.
So, you know, having just the ability to work at a single node or if you're working in the cloud to say, oh, well, I'm working on a big problem and I can spin up a large instance to do something with DuckDB, that's amazing. And to not have to go through the complexity of porting a workflow to use Spark and run something in EMR or whatnot.
Yeah, I think that the trend that's really interesting to me is the kind of the unbundling of, like, data storage and compute. Like, when you look at, like, MySQL and Postgres, they have a completely different way of storing data on disk. But now, like, you have a directory of Parquet files, you can use that with DuckDB, you can use that with Arrow, you can use it with Polars, you can use it with Athena. And just that idea that, like, you can compute on your data with lots of different engines, I think, is really powerful. And tools like DuckDB and Polars are bringing those to your laptop so that you can be working with gigabytes of data pretty easily. And makes it, I think, gives you a lot of much higher speed of iteration than having to work with a tool like Spark, where you have to create a job, it gets spread over a bunch of computers, you know, all aggregated back together and sent to you. So I think a really, really interesting trend.
the trend that's really interesting to me is the kind of the unbundling of, like, data storage and compute. But now, like, you have a directory of Parquet files, you can use that with DuckDB, you can use that with Arrow, you can use it with Polars, you can use it with Athena. And just that idea that, like, you can compute on your data with lots of different engines, I think, is really powerful.
Data heroes and inspirations
I don't know about data heroes, but I'd say, like, one of my, like, JJ is definitely one of my programming heroes. Because I think the thing, the thing that I think, there's many things that are incredible about JJ, but the thing that I find most incredible is his willingness to move from an area where he is like an expert and knows it down pat to a new area where he knows nothing. And that, I think, is just, like, an incredible, like, that's sort of incredibly brave and an incredible tool to be able to say, like, I'm a master, I've mastered this, and now I'm going to go over here and become a novice again. But I just have a huge amount of admonition for that.
Yeah, I mean, I'm, you know, I've become a big, you know, a big fanboy of the, of the DuckDB creators, like Mark and, Mark and Hannes. I know Rachel mentioned that they had, there's a data science hangout on Thursday with, with Hannes, who's one of the creators of, creators of DuckDB. And, you know, in addition to being, like, just super, like, really, just super nice people. You know, occasionally a bit sarcastic, and, you know, with very dry, very dry humor. You know, but, but they're, you know, they're really great people and really collaborative and I've just been impressed with their, you know, willingness to, you know, take on any problem and say, oh, you know, we're unfazed by, we're going to make DuckDB work on this problem, we're going to make DuckDB do this or make DuckDB do that.
small teams of people who are really, really focused and, uh, motivated can, can really accomplish, uh, accomplish great things.
Um, and in general, like I've been really excited to have just more collaboration between the traditionally like pretty, you know, crusty, stodgy, non-uncollaborative database community. It's a pretty insular community with mostly, you know, commercial, um, software projects. And so to have not only an open source database project, but one that is like collaborating with, you know, folks like Hadley and me with open arms has been, um, has been really exciting and opened up a lot of, you know, kind of new as I think made the future a lot more interesting.
What's exciting for the year ahead
I will say kind of like echo, like going off of what Wes said, like another project I think is incredible is types, which is this, you know, recreation of latech or everything that latech could do in terms of creating high quality PDF publications. Like that's something I just assumed I would never see a replacement for latech in my lifetime because it was just too big and too complex. And the types development team, which I think is only like a couple of people just like dive, dived into that and produced the system that's incredibly good and incredibly fast. And I think that's just, that's like so cool.
Yeah, I mean, for, yeah, for me, like, I'm, I think I'm, I'm excited about all the stuff that we're building and building in Positron and getting that into the hands of more users. But on the, you know, on the AI front, I think I'm excited for some of the hype to, you know, for some of the hype to settle down so that, and for there to be like, maybe not, you know, 10 new AI tools to look at every week. And so I'm, I think for there to be a little more consolidation, and then, you know, for folks like us who are just trying to make people more productive, we can focus on fewer things and just, you know, have the goal be to be, you know, make people more productive. And so as usage of these different tools, and, you know, however many different LLM models are out there, I think more consolidation and fewer things for us to, you know, for us to look at, to look to integrate with, I think will be, will be a good thing.
Thank you both so much. I really appreciate you taking the time to join us all, Wes and Hadley. I, I know the Hangouts normally go by really quickly, but this hour has been the fastest of all time. So thank you all so much for all the great questions and taking the time to join us as well. As a reminder, if you had fun today and want to find out about other POSIT events, I'm going to share a link for you to subscribe to find out about those. But also, I would love to hear from you and hear what your thoughts were on today's session. I'm going to share a 3-question survey with you in the chat, just in case it doesn't pop up at the end of the meeting for you. And if there were questions that we didn't get to, that we didn't get to answer, I would love to hear from you.


