Set up Databricks personal compute cluster with RStudio

The following instructions set up an ODBC connection between your laptop and your DataBricks cluster, which can then be used in R/RStudio to query data using an ODBC based package or sparklyr. Personal clusters are able to run SQL, R, python and scala. They can be used within the DataBricks environment, or through R studio and can be set up yourself if you don’t have access to a SQL warehouse or shared cluster.


Pre-requisites


You must have:

  • Access to Databricks
  • Access to a personal cluster on DataBricks
  • R and RStudio downloaded

Downloading an ODBC driver


  1. Install the ‘Simba Spark ODBC’ driver from the software centre.

    1. Open the Software Centre via the start menu.

    2. In the ‘Applications’ tab, click ‘Simba Spark ODBC Driver 64-bit’.

    3. Click install.

  2. Get connection details for the cluster from Databricks. To set up the connection you will need a few details from your cluster within DataBricks.

    1. Login to Databricks

    2. Click on the ‘Compute’ tab in the sidebar.

    3. Click on the name of the cluster you want to connect to, and click the ‘Advanced options’ at the bottom of the cluster page.

    4. Click the ‘JDBC/ODBC’ tab under ‘Advanced options’

    5. Make a note of the ‘Server hostname’, ‘Port’, and ‘HTTP Path’.

  3. Get a personal access token from Databricks for authentication.

    1. In Databricks, click on your email address in the top right corner, then click ‘User settings’.

    2. Go to the ‘Developer’ tab in the side bar. Next to ‘Access tokens’, click the ‘Manage’ button.

    3. Click the ‘Generate new token’ button.

    4. Name the token, then click ‘Generate’. Note that access tokens will only last as long as the value for the ‘Lifetime (days)’ field. After this period the token will expire, and you will need to create a new one to re-authenticate.

    5. Make a note of the ‘Databricks access token’ it has given you. It is important to copy this somewhere as you will not be able to see it through Databricks again.

  4. Setup ODBC connection from your laptop. We now have all the information we need to setup a connection between our laptop and DataBricks.

    1. In the start menu, search ‘ODBC’ and open ‘ODBC Data Sources (64-bit)’.

    2. On the ‘User DSN’ tab click the ‘Add…’ button.

    3. In the ‘Create New Data Source’ window, select ‘Simba Spark ODBC Driver’ and click ‘Finish’.

    4. In the ‘Simba Spark ODBC Driver DSN Setup’ window,

      1. Enter a ‘Data Source Name’ and ‘Description’. Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13.
      2. Set the remaning options to the settings below.
        • Enter the ‘Server Hostname’ for your cluster in the ‘Host(s):’ field (you noted this down in step 2).
        • In the Port section, remove the default number and use the Port number you noted in step 2.
        • Set the Authentication Mechanism to ‘User Name and Password’.
        • Enter the word ‘token’ into the ‘User Name:’ field, then enter your ‘Databricks access token’ in the ‘Password:’ field.
        • Change the Thrift Transport option to HTTP.
        • Click the ‘HTTP Options…’ button and enter the ‘HTTP Path’ of your Databricks cluster, then click ‘Okay’.
        • Click the ‘SSL Options…’ button and tick the ‘Enable SSL’ box, then click the ‘OK’ button.
      3. Click the ‘Test’ button to verify the connection has worked. You should see the following message. If you get an error here, repeat steps 5.e.i – 5.e.ix again and ensure all the values are correct.

      1. Click the ‘OK’ button to exit the ‘Test Results’ window, then the ‘OK’ button in the ‘Simba Spark ODBC Driver DSN Setup’ window.
  5. Connect through RStudio. Watch the below video and view the ADA_RStudio_connect GitHub repo for methods on connecting to Databricks and querying data from RStudio.


Pulling data into R studio from Databricks


Once you have set up an ODBC connection as detailed above, you can then use that connection to pull data directly from Databricks into R Studio. Charlotte recorded a video demonstrating two possible methods of how to do this. The recording is embedded below:

A template of all of the code used in the above video can be found in the ADA_RStudio_connect GitHub repo.

Key takeaways from the video and example code:

  • The main change here compared to connecting to SQL databases is the connection method. The installation and setup of the ODBC driver are all done pre-code, and the only part of the code that will need updating is your connection (usually your con variable).
  • If your existing code was pulling in tables from SQL via the RODBC package or the dbplyr package, then this code should in theory run with minimal edits needed.
  • If you were writing tables back into SQL from R, this is where your code may need the most edits.
  • If your code is stored in a repo where multiple analysts contribute to and run the code, in order for the code to run for everyone you will all need to individually install the ODBC driver and give it the same name so that when the con variable is called, the name used in the code matches everyone’s individual driver and runs for everyone. If this is the case, please add a note about this to your repo’s readme file to help your future colleagues.
Back to top