How to execute R Scripts using Azure Batch Services and Azure Data Factory?

Aditya Kaushal
5 min readMay 3, 2021

Microsoft’s Azure is one of the biggest cloud platforms providing its solutions across a wide range of services. During one of my challenges to generate a solution using Azure Data Factory and Azure Batch Services, we as a team faced some problems to execute R/Python Scripts using Microsoft’s Azure Batch Services.

Objective:

The objective of our use case was to execute some Python/R scripts using the custom activity provisioned in the Azure Data Factory which was referencing some Delimited CSV and Excel files from the Azure Data Lake acting as input files.

So prior to executing the Python/R scripts, we needed to create an Azure Active Directory App which was responsible to authenticate our python scripts to communicate with the Azure Data Lake Storage and then further use the files stored inside the Azure Data Lake Storage for successful execution of the script.

We needed the computing power of Azure Batch to execute the R/Python scripts. So, we decided to use the DSVM (Data Science Virtual Machine) image (Window Server 2019) available to the developers/users in the Azure Marketplace.

By default, according to Microsoft’s documentation, the DSVM image supports Python and CRAN R.

Purpose of the Blog:

The purpose of this blog is to help all the other developers/Data Scientists and especially the Data Engineers to help them walk through the problems that you might be facing while using Azure Batch, Azure Data Factory, and Azure Data Lake. This blog has the solutions to the problem that we solved with executing R Scripts using the DSVM Windows Server 2019 image.

Problem:

Execution of R Scripts by using the DSVM image was throwing an error. The Error message was “Command Program was not found.”.

Initially, we thought that there might be a problem with the path of the R script and that the Azure Batch could not find the actual file. So, we checked and made sure that the R script which was stored inside the Blob Storage had the path matching to the path given in the Azure Data Factory batch settings.

After making sure that the path of the R scripts was correct, we again executed our pipeline to run the R script. Surprisingly, we had the same issue and the same error message. This was a unique challenge as we had followed everything mentioned in Microsoft’s official documentation.

We wrote the command Rscript <r-script>.R and executed the pipeline and every time we checked we got the same error message.

So, to understand the problem we took the help of a simple search to find out the meaning of the error message that we were dealing with.

Microsoft did not have much information on the error message “Command Program was not found”. Further nobody in the Developers community had faced such an issue before.

All the developer’s blogs and other Microsoft documentation did not report or had any information regarding this issue before.

So, we then took the help of Microsoft’s Customer Support. We together fixed the issue at hand and were able to execute the R scripts.

Diagnosis:

The Nodes of the Azure Batch’s DSVM image that we were using did not have the proper environment variables set inside the system settings.

We utilized the RDP (Remote Desktop Protocol) to connect each node that we were using and saw that the command line prompt could not recognize the R script/ R command.

Through this, we understood the meaning of the error message that we were getting as “Command Program not Found”.

The meaning of this error message “Command Program could not be found” was that when Azure Data Factory sent the command “Rscript <r-script>. R” to Azure Batch Service, the command was then again sent to the command line prompt of DSVM Server 2019 image. But the command prompt could not recognize the command due to which it was not able to run the R script.

Solution:

Setting the environment variable and appending a cmd /c before the executing command Rscript <r-file.R>

The solution to the problem was to add the environment variables in the Node’s OS to let the command prompt know the location/path of the R executable.

The executable file is mostly located in the C://Windows/Program File (x86) Folder. After that locate the R folder and copy the path of the bin folder. Paste the path to the PATH list in the environment variable settings.

The above steps are going to let the command prompt know where the R executable is located.

After finishing the above steps, we have to add cmd/ c before the Rscript <r-file.R>. The cmd/c lets the Azure Data factory know that the command must be executed on the command line prompt in the Node.

By following all the steps above we were able to execute the R scripts.

Further, while executing the R scripts we also faced some installing the libraries for R scripts.

The solution to the above problem was to again utilize the RDP (Remote Desktop Protocol) to connect with the Nodes. Then further launch the command prompt as Run as Administrator. Further, execute the R script command. The final step was to install all the libraries by using the install. packages(“library_name”).

--

--