Why We Need an AI Coding Assistant Designed for Data Scientists | by Daniiar Abdiev | Oct, 2022

Copilot is great but not good enough for data science, doesn’t have proper notebook support, SQL sucks, no data awareness, and doesn’t take cell outputs into account. That’s why we decided to build CodeSquire.ai – an AI code-writing assistant specially designed for data professionals.

Source

Data scientist is one of the most sought after jobs today. Companies are looking for people who can help them make sense of the vast amount of data they have. And, as more and more data is generated, the demand for data science jobs is only increasing.

Due to the ever-increasing need to process data, thousands of companies are looking for ways to streamline the work of data scientists, from data extraction companies like Stitch and Fivetran to MLOps companies like Waits and Byes.

One of the biggest parts of Data Science is simply writing code.

We are constantly writing code to clean, transform, visualize and model data. Naturally, it would be great if we had a code-writing assistant to help us write code.

There are certainly great AI code-writing assistants like Github’s Copilot. But they don’t fit the needs of data scientists. We are in need of an AI code-writing assistant who understands the field of data science.

So what features should an AI code-writing assistant for data scientists have?

  • Jupyter Notebook and Jupyter Lab — As of this time, Github Copilot only supports VS Code, JetBrains, and Atom IDEs. While certainly a lot of data scientists use VS Code and Pycharm, even more data scientists use Jupyter Notebook and Jupyter Lab in everyday work. The notebook interface is perfect for data exploration and iterating over data problems.
  • databricks notebook – The Databricks ecosystem is growing very fast. More and more companies are turning to Databricks for data lakehouse solutions and more and more ML engineers use Databricks’ MLFlow for their MLOPS operations. As more businesses adopt DataBricks, the number of users of Databricks Notebooks (Jupyter-like notebooks running on a Databricks cluster) grows. I’ve personally interviewed several people who have expressed interest in supporting CodeSquire.ai for Databricks Notebooks.
  • google collab — I love Google Colab! Every time a new cool model drops into HuggingFaces or PapersWithCode, I try to quickly introduce it to Google Colab. Often this quick start turns into a full project. Support for Google Colab is required. Lots of data scientists love those free GPUs.
  • r studio R is a very popular language in data science. In fact, it is second only to Python. So supporting R Studio, the most popular IDE for R, is a must.

CodeSquire.ai is already available in Google Colab and will be launched

VS Code, Jupyter Notebook and JupyterLab in three weeks.

As can be seen from the above screenshot, Github Copilot does not take cell output into account when generating the code. And while the fix would be fairly straightforward in the example above, one can easily see how taking notebook cell output into account is important in a real project.

Data Science work is an iterative process with lots of small experiments. Notebook cell outputs often guide our decisions about what to watch and explore next. Therefore it is extremely important to provide notebook cell output context to guide the AI ​​for better and accurate output.

Also as can be seen from the above screenshot the current AI code-writing assistants do not take Markdown into account.

Data scientists use Markdown a lot. To separate data processing into several logical steps and to let people reviewing the notebook know what they are looking for. This is a great way to help the audience of your notebook understand the code.

So AI coding assistant for data science should take Markdown into account while giving code suggestions.

So far SQL code generation with Copilot is mostly useless. You need to provide a complete table schema if you want to get a relevant output, and most queries usually require more than one table. This means specifying the dataset schema of multiple tables in comments just to get the right generation, which in turn can take the same amount of time as writing an SQL query from scratch.

The ideal AI coding assistant for data scientists will prioritize proper SQL querying support.

Data scientists often use pandas. Pandas are great but not best suited for big data ETL operations. So SQL, Dask, Modin or Spark are better choices for large data production loads. Data scientists often find themselves converting Pandas to SQL, or searching for Dusk or Spark methods similar to Pandas.

A feature that would allow data scientists to translate Pandas into more performant alternatives with the click of a button would be of great help.

Every data science project requires a lot of data exploration.

So having an AI assistant that can not only write code but also recommend next steps would be amazing!

For example, after computing the mean of some columns, the AI ​​may recommend checking the median and outliers in the box plot. If you’re building a pipeline for sentiment detection, the AI ​​assistant might recommend a state-of-the-art algorithm you haven’t heard about yet, or vice versa that’s less accurate but more efficient depending on your use case. Can recommend algorithm.

Creating an AI assistant that satisfies the above features requires a data science domain focus. That’s why we at CodeSquire.ai decided to build an AI coding assistant specifically designed for data professionals from scratch.

All ideas and more are being developed on CodeSquire.ai

So if you want to try out an AI coding assistant developed by data scientists for data scientists, you can sign up to use it today. Contact,

Leave a Reply