Visual Analytics as a Service (Visualized Apache Spark Design Tool)

A drag-and-drop tool for a creating script for the Apache Spark.
Date: Dec. 2017
Role: Frontend (100%) Backend (80%)
Features: ML Automation, ML Visualization, Drag-and-Drop, File Upload, Dataset Modification, ML Container over GCP
Abstract
Engineering industry and specifically the computer engineering industry has always valued the data. Earlier days of collecting data were through the surveys, forms, bulk experiments and other cumbersome methods. With the evolution of computer industry, the sources of the data evolved rapidly, and the data started being used to solve business intelligence problems becoming the primary business strategy for almost all companies, organizations and individuals. Presently, almost all companies are using machine learning algorithms, data wrangling and insights from both of those to solve problems that cannot be solved by humans in realistic time frame or with use of realistic effort.
With the growth of data, the analytics is growing rapidly. With current systems, retrieval of the results of analytics and machine learning is not easier to visualize without coding. Different types of data and different inferences from the data pose challenge to visualize and make sense out of the plethora of information that we possess right now. One need to have better understanding of what machine learning algorithms to be applied with the advantage of one over another. Some code needs to be written for the algorithms to see their relevant results, and compare them, to fine tune them and in general to ensure the effectiveness of the processes used in business production environment. This process takes some time with prior software and coding skill. There is a need of highly skilled developers to leverage the analytic solutions, or machine learning models.
In proposed system, the user can use a web-based visual data analytics and machine learning procedure designer with a user interactive interface. Users can simply drag and drop objects or components, such as inputs or outputs, or popular machine learning algorithms that are needed for a procedure. Regardless of the library, proposed back-end system will run optimized services and provide independent URLs. With this visual interaction, the results are dynamically displayed without need of code from user.
Introduction
The architectural design is conceived as below. The communication between backend and frontend is done through WebSockets, through Node.js and socket.io in the frontend, and via WebSocket (WS) in Akka HTTP in the backend.
In the GUI, when the user’s drag-and-drop is completed, and finally the machine learning logic comes out, it is converted into a script upon execution the ML logic. The transformed logic packetizes and requests the result back to the backend server via socket.io in JSON format.

The server is mainly composed of Akka HTTP and Akka Actor. All Akka components run on the Play Framework, which can be depreciated if it is not needed later. Akka HTTP exists to communicate all layers, and our program exists to provide WebSockets. All WebSocket requests are bypassed to the Akka Actor, and the Akka Actor acts as a Message queue within our program. The Akka actor calls the corresponding handler through the state and behavior of the request sequentially. For example, when the user calls Execute at the GUI level, the JSON file of the script is transmitted. In Akka HTTP, a request of ID = EXECUTE_ML having JSON is transmitted to the Akka Actor.
Akka Actor invokes the corresponding handler through the ID. In our program, the ML Manager module is called to analyze the actual user’s request through the body and header of the request and select the appropriate machine learning library.
The corresponding machine learning libraries are all docked, Kubernetes is managed as a single IP through the service, and scale is automatically managed for each library worker.
The status of the worker is checked periodically by the Akka Actor through the batch system, and the confirmed work is delivered to the GUI (Socket.io) periodically via WS. Socket.io allows the user to check the status and results in real time through the URL through the corresponding handler for the response received.
Architecture Subsystems
There is a total of two deliverables in our program, and the architecture for each is as follows.
1) Visual Analytics GUI web program
The program consists of Node.js and Angular.js, and follows MVVC architecture of AngularJS. The program itself consists of a service, a controller, a factory, and a view, and manages the routing of each service in the main script. The service manages session and web storage, and sends and receives Ajax to the main app. The controller manages the view models that are 2-way bound directly, and bind logic that handles user actions (e.g., drag and drop). We will manage all the endpoints in a single file through our own built-in Restful factory, which will manage the routing steps in the backend, not the front-end and the back-end, in Node.js. For example, example.com/dashboard is directly managed by AngularJS, but routing like example.com/api/:user/:input (colon means URL parameter) is managed through Express in Node.js. At the back end, the database is managed through MongoDB, and user login sessions are cached and managed through Redis. In addition, if you need push notification in addition to Ajax, you can create a channel through socket.io to enable real-time processing.
2) Docker-based cloud backend service
To implement the logic, we created in (1), we create a back-end service using an Actor model based on Play Framework and Scala and Akka. Basically, we use big data model of batch processing model and design three Akka services:
A. A service that hooks up the stored analytics and ML logic of the user
B. A service that runs the logic that was called in A.
C. A service that periodically sends the result of the processing of B (1) and the hashed URL.
As a basic program scenario, once the visual analytics program is done, it will be duplicated to the master database of MongoDB, and it is hooked continuously for 5 seconds in the batch processing service based on Akka, the cloud service will use the customized Spark and Tensor Flow libraries sequentially to run it. We use it through the first container and automatically calculates the CPU share, etc., and auto-scaling through Kubernetes. The machine learning logic allows the user to access via example.com/ {hash value}. When the corresponding URL is generated, the program of (1) is automatically notified to the user through web socket through the program of C do. The user can access the program through the URL of the corresponding hash value and know the processing result (preparation, running, error, completion). When the processing is completed, the user can additionally know the URL.
Chapter 1. Technology Descriptions
Drag and Drop
We have implemented the drag and drop feature in the UI for file uploading using DropzoneJS. DropzoneJS is an open source library that provides drag and drop file uploads with image previews, when the user of the system wants to upload a dataset (in .csv etc format) he can either directly drag and drop the files in the predefined area or he can select and upload the files. We have used the nodejs with express version of dropzoneJS for implementation purposes.
Authentication
There are two types of authentication systems: 1) Facebook Login: After registering as a third-party application on Facebook, allow users to log in to their Facebook account. 2) Database Login: Users can sign up with simple form and log in with their e-mail and password.
File Upload
The user needs to upload the dataset as a file on his / her account before using the file as an input to any process chain. The uploaded file from the web client goes to the file storage of the python server where it is stored on the HDFS store. The mapping of the filename along with the root path with the IDs of the files is stored in the database. Whenever the user processes a dataset with any node chain or process chain, a new dataset / result is generated which is saved into a separate file with a “processed” extension to the original filename. This creates a separate entry into the database to identify this result as a separate usable or downloadable file.
Our system supports previewing the file size when user finish the drag and drop. After uploading, user can check the name in the sidebar and use it as an input file by dragging and dropping.
File Download
File download will be handled by simple REST request to the python backend services. A list of all the available datasets – be that an original dataset of a processed dataset from a node chain processing in past – will be available to the user in his / her account which will all have access links to the download file. This link downloads the mentioned dataset as a file on the machine. This creates a REST call which basically sends the file as a FileIO object to force the file to be downloaded rather than to be shown on the browser or previewed.
Backend Routing
Our program supports sending the user’s data wrangling and machine learning logic to the backend system along with the input file. Therefore, the corresponding backend is split into several routes to support this, so that HTTP communication can actually send and receive files and run the machine. Both the backend routing parts take care of two different set of operations.
One is the Node.js routing module which being on the client side, will take care of the routing of the HTTP REST requests from the web client. This part is used to receive the node chain and other inputs from the client and ensuring the right request to the right backend service. Also, it is responsible to communicate to database and provide business logic in cases where the data processing and the interaction to the analytics frameworks is not needed. These will mostly be the tasks involving the CURD operations with the database.
The other part on the other hand will be the python services that will take care of primarily the data processing and node logic execution. This code will take care of the data processing logic that the node chain mentions in the input. This module interacts with the data analytics and machine learning frameworks installed in the backend and provides the modular execution of the nodes needed for the user’s logic execution. It also provides a way to save the user’s logic as a node of its own and use that in future. To use map, reduce and filter nodes with the other built in nodes, the user is prompted for the code to be used with map, reduce and filter as input parameter. In the other inbuilt nodes, the parameters are just a way to control and modify the behaviors of the nodes ever so slightly. And they will be expected as a JSON of all parameters to be provided. They will dictate the processing logic of the node and give the desired output to the user for any built-in node or custom node the user uses.
Database Connections
Our program uses a DBMS for user session management and storage of processed input files. At this point, MySQL will be used to store user information and the user’s custom designed logics. Both the backend modules connect to the database for different purposes. The routing module’s work is for the most part to perform CURD operations on the application and different entities in the application. It will have a heavy use of the database connection and will mostly perform the CURD operations of each entities in the database. On the other hand, the work of python services is to process the data and generate the output dataset by executing the logic on input data. This will basically need to use database connection only for accessing the file path and name mapping with the file ID and for adding new entries to dataset entity when new processed dataset is generated. This will have a very limited connectivity to the database and most of the work for this module will rely on the heavy processing independent from the database itself.
Distributed Processing Frameworks
The frameworks that we have decided to use as the data analytics and machine learning distributed processing core are Apache Spark and Tensor Flow. These two are the most popular ones in their respective tasks. The Apache Spark for distributed processing of data for analysis purposes and Extract Transform and Load operations is a standard in the industry and is widely used, deployed and researched upon. Also, for deploying flexible and robust machine learning modules, Tensor Flow from Google is a famous product that provides a lot of freedom for to design the machine learning models. These frameworks are installed and configured into the environment on an image which will be used in cloud as the base machine image for new elastic instances.
Project Design

The main screen starts from login. The reason is that our program wants to preserve the user’s independent data, and provides a login for this. It also has a Facebook login so users can log in conveniently.

Users can also sign up through simple information. All fields must be filled in, and IDs are automatically checked for duplication.

The above screen is the main screen of our program. There is a menu on the left, and a canvas where the user can drag and drop. Also, after login, simple user information is displayed on the upper left corner, and buttons for uploading, downloading and checking result exist below.

Every data flow that previous draw by the user will be automatically loaded. While the user does the drag and drop, every node, the flows, and its data will be automatically saved in both database and web browser.
The Upload button allows the user to navigate to the upload screen. Our program uses an upload library called Dropzone.js. It can also be dragged and dropped, and you can see the file size and upload status during user upload file. So far, users can only upload CSV and PSV files at present, and will add XML and XLS files later.
There are two buttons for upload function:
1) Upload: upload multiple files to temporarily folder
2) Next: once the upload is done, the user can see the preview of the file to add, modify or remove each row for user’s purpose.

User can add custom row via [Add New Row] button and write down some custom row. Plus, user also modify and remove current row by using edit and remove button located in the rightest side.


Drag and drop is the most important feature of our program, and the user can drag the necessary resources through the menu on the left. The user-uploaded file will appear below the Input menu. In addition, the user can specify the format of the output file under JSON, CSV, XML, etc., along with the machine learning functions such as Filter, Reduce, and Map.

The user can define the action through a circle below the dragged node. The circles below define the output, and the output file can only be input to another node. The above is the format of the example. After defining the node as above, you can define the detail attribute by double clicking on each node, and execute the corresponding logic through the Run button.

The dataset (input file) can modify by using dialog. Once user double clicks the dataset node, user can see the exactly same screen of upload preview. User can also add new row, delete and modify any row by his or her purpose.

User can write down his or her logic. Appendix A is the every node’s screenshot and field definitions

User also set the output format as well as its properties. We currently have two fields: isSorted and Limit.
Controller Architecture

Controllers:
The controllers used in our Application:
Login
The login controller is used when the user initially logs into the application using his email id and password, the user can login using Facebook also. If he is not authorized then a “not_authorized” alert is thrown or if there is any other error while login then a “error” alert is thrown to the user.
Register
The Register Controller is used when the user wants to register for new account, details like First name, Last name, Email, password are required to be entered. If either of the fields are left empty then a “First Name is empty!” Etc alerts are thrown.
App
This controller contains the footer and navigator pages. The navigator page contains buttons like upload, run and result, here the user can upload desired files, run them and can obtain the result.
Main
It is the main Controller of the VisualAnalyticsApp. The main controller draws the actual flowchart and takes action on it. There are five major actions in our program:
- Initialize and define the actual flowchart object.
- It is responsible for adding a node and double clicking. Each time a node is added, the script in the global variable is updated and the corresponding json is sent to the server to maintain the latest node structure for each user.
- You can delete selected node and flow line.
- When each node is double-clicked, the corresponding attribute will be displayed in the form of modal dialog. The attributes are largely divided into input, tool, and output.
- Clear all nodes and lines and initialize data.
This controller is used to upload files to the server, the user can upload multiple files and can set the file size and number of files while doing so and if the file size exceeds the set size or if it does not match the desired file type then the respective file will not be uploaded.
Upload
After uploading the files, the user can preview them and edit using this controller. If the user has uploaded multiple files, then he can preview and edit all of them sequentially. Pagination feature is included in the page which enables only 100 rows to load at a time for faster loading. Once edit is done the changes made are saved in the files which are further used by the app for processing.
UploadPreview
This controller is used to provide user job history, it shows the job_id, status along with a corresponding result_url.
Result
Users can log in through their email address or Facebook account. The reason you need to log in is to load the machine learning logic you have drawn. After login, user data stored in MySQL linked with Node.js is loaded.
Authentication


We’ll use the basic authentication procedure of Facebook. Once user successfully login via Facebook, user can get the JWT (Json Web Token) that share with server. Every HTTP request should be communicated with JWT to maintain the login session.
Frontend API Definitions
Frontend has a nine part of routing functions:
Router | Descriptions |
POST /users/addUser | Register new user with parameter. It will check whether user already there. |
GET /users/login | Do the login. It will return entire user’s information as json |
GET /users/FBlogin | Do login with Facebook. It will return entire user’s information as json |
GET /users/updateScript | Update user’s script |
GET /users/userHistory/:seq | Get user’s job history as user_seq |
POST /files/file_upload | Upload file |
GET /files/get_file_list | Get file list |
GET/file/get_file/:file_name | Get one file |
Backend API Definitions
Backend has a three part of routing functions:
Router | Descriptions |
/process/<dataset_id> | Depends on dataset_id, our Apache Spark instance will run the dataset_id with user’s specific request like map, reduce, filter, etc. |
/upload | Support the upload of input file |
/download/<dataset_id> | Download the result of Apache Spark based on dataset_id |
The backed router connects the user actions (drag & drop of input dataset node, Spark algorithms like Map, Reduce, Split, Filter, etc., and output node (CSV, JSON, etc.)) to the backed Python services. The backend router collects the data from the front end draggable interface and makes the data required for sending to Python API. The Python API then goes through the HTTP request received and executed the file through the Spark nodes respectively to generate a processed file.
HTTP API Request:
The below code shows the HTTP request in Angular.js code
$http({
method: "POST",
url: 'http://localhost:8080/process',
data: {
"dataset_id": dataset_id,
"node_chain": jsonFile,
"output": output_options
}
}).then((results) => {
alert(results)
}, (error) => {
console.log("Error", error);
});
The Python API HTTP request
The request has 3 important parameters.
- Dataset_id: This is an unique dataset Id for identifying the input node user wants to process.
- Node_chain: This is an array of JSON’s used for specifying the list of algorithms the user has dragged to make a model. These algorithms will be executed sequentially, and final processed file will be collected.
Sample JSON

node_chain = [
{
"node": "map",
"logic": <User Logic>
},
{
"node": "reduce",
"logic": <User Logic>
},
{
"node": "filter",
"logic": <User Logic>
},
{
"node": "extractUsingRegex",
"params": {
"regex": <User Written regex>,
"column": 0
}
}
]
3. Output: The output has 3 sub parameters. They are listed below.
- output file format (CSV, JSON, etc.)
- IsSorted (if the file needs to be sorted after being processed)
- Limit (Rows limit after being processed)
Database Schema

The below tables are used in the database design. The tables schema diagram shows the tables to store user information, user transactions, file URL’s and link them.
Python Services Interface to Frameworks
The design of the python services that acts as an interface to the machine learning and data analytics frameworks will be modular in nature and flexible to be extended and modularized on the fly. These modules have node logic saved on different files to be executed in the service local storage. As soon as a node is invoked, the file corresponding to the node is imported and the logic is appended to the object of the input dataset. This way, a chain of operations to be performed on all the nodes is appended on the node chain and then the whole chain is executed rendering the final output of the process.
The map, reduce and filter nodes differ in behavior slightly from the other regular nodes. They will have an input code which needs to be imported while applying the map, reduce and filter logics. This makes need to save the input code to be saved in the python module on the fly. This enables the input code to be saved on file and to be loaded instantly using the module to be used into map, reduce and filter nodes. After this map, reduce or filter node is applied at the node, the loaded module is considered obsolete and the file with the code is deleted from disk.
The custom node is another exception in the behavior of the node. The user can input his custom code in the web UI to be executed for the node. He / She can choose to name the node explicitly and save the node logic under his / her account history. This way the user can access the node at any time in the future. If the user decides not to name the logic and let it be anonymous, like in almost all the programing languages with anonymous functions, the logic will be saved to temporary file and be deleted after serving the purpose and cannot be used later on for the later executions.
The nodes are intended to be of two different types like input node, output node and processing nodes. The input nodes read the input from the file using the database mapping for file name, path and IDs. They can be used as the starting points of the node chain. The processing nodes basically append the node logic for each node to the input dataset object to be executed while expecting the output from it. The output nodes will execute the appended node logic on the dataset to obtain the output from it which will be saved in file, shown on the web portal or directly downloaded according to the options and parameters selected by the user.
Project Implementation
Authentication
There are two types of authentication systems: 1) Facebook Login: After registering as a third-party application on Facebook, allow users to log in to their Facebook account. Since all Facebook users have their own unique key values, when they log in, we store their keys and user’s information in the browser’s cookie and our DB. 2) Database Login: Users can sign up with simple form and log in with their e-mail and password. In this case, the password is encrypted with Bcrypt library that encrypted and hashed with a unique salt value.
FB.login(function (response) {
// handle the response
if (response.status === 'connected') {
// Logged into your app and Facebook.
FB.api('/me?fields=email,first_name,last_name,link', function (fb_userinfo) {
const {
email_id = "",
password = "",
first_name = "",
last_name = "",
id: user_name
} = fb_userinfo
$scope.userinfo = {
user_name,
email_id,
password,
first_name,
last_name
};
$http.post("user/FBlogin", $scope.userinfo)
.then(response => {
localStorage.setItem("__USER_INFO__", JSON.stringify(response.data)); // Save User Info into LocalStorage
localStorage.setItem("__USER_SCRIPT__", response.data.script);
$state.go('index.main');
}).catch(error => {
alert(error.data);
})
});
} else if (response.status === 'not_authorized') {
alert("not_authorized");
} else {
alert("error");
}
}, { scope: 'public_profile,email,' });
Users can also sign up through simple information. All fields must be filled in, and IDs are automatically checked for duplication.
$scope.register = function () {
const { first_name, last_name, user_name, password, email_id } = $scope.userinfo;
if (password !== $scope.password2) {
alert("Password does not match!");
return;
}
if (first_name === "") {
alert("First Name is empty!");
return;
}
if (last_name === "") {
alert("Last Name is empty!");
return;
}
if (password === "") {
alert("password is empty!");
return;
}
if (email_id === "") {
alert("Email is empty!");
return;
}
$http({
url: "user/addUser",
method: "POST",
dataType: 'json',
data: $scope.userinfo
}).then(function successCallback(response) {
// this callback will be called asynchronously
// when the response is available
//console.log(response.data)
var __USER_INFO__ = response.data;
localStorage.setItem("__USER_INFO__", JSON.stringify(__USER_INFO__));
$state.go('index.main');
}).catch((error) => {
alert(error);
});
}
After the user logs in, the user information is stored in the LocalStorage in the JSON format as described above, and the necessary information in the GUI can be fetched through the cookie. The logged in user can invoke his or her existing machine learning logic.

After logging in, the web token is stored in the browser’s cookie so that the session can be maintained. The GUI periodically sends the ping to the node server, and if the token is invalid, the cookie is initialized and automatically logged out.
Visualization Tools
After analyzing our benchmarked Node-red, we found that it was built using HTML5’s Canvas. So, we decided to use Jquery.Flowchart, a library that uses Canvas. Jquery.Flowchart supports basic drawing of node and its connection via graph. It also supports drag and drop so that we can use it to support movement and connection of each drawn node.
Drag and drop linkage between the HTML DOM and Canvas is implemented using the Draggable function in the jQuery UI. The overall UI uses Bootstrap, MVVC management via AngularJS, front-end server management via Node.js and Express.js.
Drag and Drop
Drag and drop uses the functionality of the jQuery UI. In the navigation.html file, the left menu is defined, and each menu is bound to data- *. This data defines title, index, and so on.
<!-- START Tree Sidebar Common -->
<ul class="side-menu">
<li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data-title="map" data-idx="1"
data-mode="tool">
<a href>
<div>
<div class="nav-label" style="z-index:10000;">Map</div>
</div>
</a>
</li>
<li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data-title="reduce"
data-idx="2" data-mode="tool">
<a href>
<span class="nav-label">Reduce</span>
</a>
</li>
<li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data-title="filter"
data-idx="3" data-mode="tool">
<a href>
<span class="nav-label">Filter</span>
</a>
</li>
</ul>
Every draggable node is associated with the draggable_operator class, which is draggable through the code in nav.js. The code below does the following: 1) Read the file from the server 2) Dynamically attache the read file to the dom. (To make the file list draggable and drop-able as well.) 3) Enable drag and drop with all other menus. (handle with $ draggableOperators.draggable ()) 4) Map draggable objects bound to (3) to getOperatorData (). In getOperatorData (), we create an object for actual future data processing with data such as data- * in the actual dom tag.
$http({
url: "file/get_file_list",
method: "GET"
}).then(function successCallback(response) {
// this callback will be called asynchronously
// when the response is available
//console.log(response.data)
$scope.fileList = response.data;
console.log(!lodash.isEmpty(response.data))
if (!lodash.isEmpty(response.data)) {
response.data.forEach((file, idx) => {
$("#inputFileList")
.append(`<li class="primary-submenu draggable_operator" data-nb-inputs="0" data-nb-outputs="1" data-title="${file.name}" data-dataset_id="${file.dataset_id}" data-idx="${idx + 7}" data-mode="input" ><a href="#">
<div>
<div class="nav-label" style="z-index:10000;">${file.name}</div>
</div>
</a>
</li>`)
});
}
var $draggableOperators = $('.draggable_operator');
console.log($draggableOperators);
function getOperatorData($element) {
var nbInputs = parseInt($element.data('nb-inputs'));
var nbOutputs = parseInt($element.data('nb-outputs'));
var dataset_id = ($element.data('mode') === 'input' ? $element.data('dataset_id') : 0);
var data = {
properties: {
title: $element.data('title'),
inputs: {},
outputs: {},
dataset_id,
mode: $element.data('mode')
}
};
var i = 0;
for (i = 0; i < nbInputs; i++) {
data.properties.inputs['input_' + i] = {
label: 'Input ' + (i + 1)
};
}
for (i = 0; i < nbOutputs; i++) {
data.properties.outputs['output_' + i] = {
label: 'Output ' + (i + 1)
};
}
console.log(data);
return data;
}
var operatorId = 0;
$draggableOperators.draggable({
cursor: "move",
opacity: 0.7,
helper: 'clone',
appendTo: 'body',
zIndex: 1000,
helper: function (e) {
var $this = $(this);
var data = getOperatorData($this);
return $rootScope.flowchart.flowchart('getOperatorElement', data);
},
stop: function (e, ui) {
var $this = $(this);
var elOffset = ui.offset;
var $container = $rootScope.flowchart.parent();
var containerOffset = $container.offset();
if (elOffset.left > containerOffset.left &&
elOffset.top > containerOffset.top &&
elOffset.left < containerOffset.left + $container.width() &&
elOffset.top < containerOffset.top + $container.height()) {
var flowchartOffset = $rootScope.flowchart.offset();
var relativeLeft = elOffset.left - flowchartOffset.left;
var relativeTop = elOffset.top - flowchartOffset.top;
var positionRatio = $rootScope.flowchart.flowchart('getPositionRatio');
relativeLeft /= positionRatio;
relativeTop /= positionRatio;
var data = getOperatorData($this);
data.left = relativeLeft;
data.top = relativeTop;
$rootScope.flowchart.flowchart('addOperator', data);
var data2 = $rootScope.flowchart.flowchart('getData');
$http.post('user/updateScript', {
script: JSON.stringify(data2),
user_id: $rootScope.userinfo.user_id
});
localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data2));
}
}
});
}, function errorCallback(response) {
// called asynchronously if an error occurs
// or server returns response with an error status.
console.log(response.statusText);
});
As mentioned above, MainCtrl handles all actions in the canvas, and is responsible for dragging, dropping, selecting nodes, connecting nodes, etc. on the canvas. The most important thing in MainCtrl is to call / user / updateScript on all actions to update the latest edge and node information.
$flowchart.on('linkCreate', function (linkId, linkData) {
var data = $flowchart.flowchart('getData');
$http.post('user/updateScript', {
script: JSON.stringify(data),
user_id: $rootScope.userinfo.user_id
});
$scope.script = data;
localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data));
});
When the user logs in, the user information is kept in __USER_INFO__ in localStorage. Also, all data is kept in localStorage with a value of __USER_SCRIPT__. This is assigned at login by the user.
$scope.script = $.parseJSON(lodash.isEmpty(localStorage.getItem("__USER_SCRIPT__")) ? {} : localStorage.getItem("__USER_SCRIPT__"));
One of the most important things in the main canvas is to define its properties when you double-click the node. To handle this, we added an event to the existing jquery.flowchart. In this function, we process the various data with user double click as follows. All objects in the global scope are managed, and data.operators [operatorId] .properties is an attribute for each data to be sent to the future server, where the user-written data is located.
$flowchart.on('operatorSelect2', function (el, operatorId, returnHash) {
var data = $flowchart.flowchart('getData');
var title = data.operators[operatorId].properties.title.trim();
var mode = data.operators[operatorId].properties.mode;
$scope.tempNode.title = title;
//$("#myModal").modal('show')
if (mode === "input") { // For Dataset (Input) Node
var modalInstance = $uibModal.open({
templateUrl: 'views/modal/view_input.html',
controller: function ($scope, $uibModalInstance, $http) {
$scope.load = function () {
// Omitted. File handling is same as upload_preview
},
scope: $scope,
windowClass: "hmodal-success",
size: 'lg'
});
} else if (mode === "tool") { // For Logic Node
var modalInstance = $uibModal.open({
templateUrl: 'views/modal/view_node.html',
controller: function ($scope, $uibModalInstance, $http) {
$scope.map = {
logic: ''
}
$scope.filter = {
logic: ''
}
$scope.reduce = {
logic: ''
}
$scope.extractUsingRegex = {
regex: '',
column: '',
}
$scope.splitUsingRegex = {
regex: '',
column: '',
}
// Omitted.. same the other nodes
$scope.cancel = function () {
$uibModalInstance.dismiss('cancel');
};
$scope.submit = function () {
var data = $flowchart.flowchart('getData');
$uibModalInstance.dismiss('cancel');
//$rootScope.script = $scope.script;
switch (data.operators[operatorId].properties.title) {
case "map":
$rootScope.map = $scope.map;
data.operators[operatorId].properties.params = $scope.map;
break;
case "filter":
$rootScope.filter = $scope.filter;
data.operators[operatorId].properties.params = $scope.filter;
break;
case "reduce":
$rootScope.reduce = $scope.reduce;
data.operators[operatorId].properties.params = $scope.reduce;
break;
case "extractUsingRegex":
$rootScope.extractUsingRegex = $scope.extractUsingRegex;
data.operators[operatorId].properties.params = $scope.extractUsingRegex;
break;
// Omitted. Same logics of the other nodes
}
console.log($rootScope, $scope, data);
$http.post('user/updateScript', {
script: JSON.stringify(data),
user_id: $rootScope.userinfo.user_id
});
$scope.script = data;
localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data));
$flowchart.flowchart('setData', data);
console.log($flowchart.flowchart('getData'));
}
$scope.load = function () {
switch (data.operators[operatorId].properties.title) {
case "map":
$scope.map = {
...data.operators[operatorId].properties.params
}
break;
case "filter":
$scope.filter = {
...data.operators[operatorId].properties.params
}
break;
case "reduce":
$scope.reduce = {
...data.operators[operatorId].properties.params
}
break;
case "extractUsingRegex":
$scope.extractUsingRegex = {
...data.operators[operatorId].properties.params
}
break;
}
// Omitted. Same logics of the other nodes
}
$scope.load();
},
scope: $scope,
windowClass: "hmodal-success",
size: 'lg'
});
} else if (mode === "output") { // For output node
var modalInstance = $uibModal.open({
templateUrl: 'views/modal/view_output.html',
controller: function ($scope, $uibModalInstance, $http) {
$scope.cancel = function () {
$uibModalInstance.dismiss('cancel');
};
$scope.submit = function () {
$uibModalInstance.dismiss('cancel');
data.operators[operatorId].properties.limit = $scope.output.limit;
data.operators[operatorId].properties.isSorted = $scope.output.isSorted;
$flowchart.flowchart('setData', data);
$rootScope.output = $scope.output;
}
},
scope: $scope,
windowClass: "hmodal-success",
size: 'lg'
});
}
})
Backend
The backend is implemented using Python. Through a Python object connected to Apache Spark, the lightweight web server is used to receive and deliver data as the router is designed. The dataset when loaded, is converted to the Resilient Distributed Dataset structure object into Apache Spark. This enables the parallel processing of the dataset on the Apache Spark cluster. This input RDD is appended a bunch of operations for node logic execution by different nodes in the chain. This makes the RDD a collectible object and the process chain can be executed at the time when the RDD needs to be collected and the result needs to be obtained. Finally, by the output node, the RDD is collected and the result is saved according to the user’s choice in the web portal input. The output file mapping is saved to the database with the mapping from dataset ID to root path and file path.
Testing and Verification
Unit Test
UI Testing
S.NO | Action Performed | Expected Result |
1. | Navigate to the Website using the URL | It should open our application screen |
2. | Drag and Drop the nodes representing the algorithm into our application | The corresponding nodes will be selected and dropped into our application |
3. | Drag and drop the map and reduce nodes onto our application nodes | The corresponding map and reduce nodes should be selected and dropped into our application |
4. | Click the Execute button to start processing | The process starts executing and a URL will be generated |
5. | Use the URL generated to view the status of the result to be generated | The generated URL will display the status of the result till the ML algorithm and other analytics logic is applied and results are generated |
6. | Save the result | Sends the saved output to the User |
Load Testing
Load testing is performed on an application to check its response when the number of users accessing the application is increased. It is used to check the performance of the system under both normal and peak load conditions.
We are using Jmeter to perform load testing on our application. It creates a simulation of a group of users who are sending requests to the target server and records the performance of the application through graphs etc then and when the number of users is gradually increased.
Steps to perform Load testing on our application.
- Create a thread group to include a group of Users.
- Add other Jmeter components like Http request etc.
- Include how we want the result to be displayed (graph etc.)
- Execute the test and compare the test results through parameters like throughput, deviation etc.
Performance Testing
Mocha is used to perform asynchronous performance testing. It is used majorly to test Node.js applications. We use mocha to test if the method logic was executed correctly by checking the response of the Rest API passed (200, 404 etc.)

Mocha test to test if the response of 200 is accurately returned if the correct ML algorithm is passed when it is dragged and dropped.
If the correct response is obtained, then we will know that our method logic is being correctly executed.
Authentication Test
First, authentication test need that user can go to the main page after signing with Facebook account or email. For the first-time user, the session information must be stored in the DB when logging with Facebook, and in the case of email login, the user must be able to sign up via form. In the case of sign up, email must be checked whether duplicated, and the password must be encrypted properly through Bcrypt library.
Also, at the time of email login, pass through HTTP basic authentication through HTTP header. The data must be base64 encoded.
In the case of existing users, the user information should be stored in cookies as JSON after login, and the machine running logic that has been drawn must be loaded. The left sidebar should contain information such as the user’s name, and when logged out, the user information should disappear from the cookie.
Data Transaction via Router Test
Test that the function corresponding to all the specified routing functions works. Add expected inputs and outputs later.
Integrated Test
Test Scenario
All tests scenario goes through the following steps: 1) User login or membership 2) File upload 3) Drag and drop 4) Connection the point between nodes 5) Test if data is normally transferred to the router during execution 6) Apache Spark: test if the user can check the output of the backend automatically. 8) Test if the user can check the output automatically through Web Socket when the result is displayed.

LON | LAT | NUMBER | STREET | UNIT | CITY | DISTRICT | REGION | POSTCODE | ID | HASH |
-118.2814555 | 37.165987 | 110 | N PIPER ST | BIG PINE | 848e5d7a57c080f8 | |||||
-118.2837718 | 37.1656787 | 151 | N RICHARDS ST | BIG PINE | 9e5b461661958ede | |||||
-118.282186 | 37.1658763 | 155 | N PIPER ST | BIG PINE | 85bc70e3800d233b | |||||
-118.2807843 | 37.1649342 | 214 | N PIPER ST | BIG PINE | d44de7316ab8293a | |||||
-118.2807845 | 37.1646678 | 242 | N PIPER ST | BIG PINE | c0478e2afefb3b57 | |||||
-118.2814629 | 37.165196 | 204 | N PIPER ST | BIG PINE | fa38d2e8030b1c27 | |||||
-118.2814706 | 37.1643965 | 264 | N PIPER ST | BIG PINE | c4924f357414fde1 | |||||
-118.2814681 | 37.164663 | 244 | N PIPER ST | BIG PINE | fc8a7b2c7a41baa1 | |||||
-118.2807859 | 37.1630689 | 392 | N PIPER ST | BIG PINE | 5cb87804d2374eb9 | |||||
-118.2814758 | 37.1638635 | 296 | N PIPER ST | BIG PINE | fc97a9f07cc4ce94 |
Test Result
Test result will be added at the end of project.
Performance and Benchmarks
The performance is measured on how responsive the application is, and how fast the upload file, download file, and Machine Learning algorithm executes. While the application navigation functionality like login, drag & drop, etc. should be quick and within a second. The execution time for Machine Learning algorithm will depend on the size of the data set uploaded, and type of algorithm used. This time can be anywhere between 10 seconds to 5 minutes.
The project (application server, database, Apache Spark, etc.) has been deployed on Google Compute Cloud which decrease the down time of the application. This cloud deployment provides the application availability time of 99.96%.
There were hundreds of tests performed from above given test cases like unit test cases, etc. and the below performance benchmarks were collected. The below information is the average time taken for all the relevant test cases.
Below table describes the performance and benchmarks for the common functionalities.
Type of event | Operation Performed | Average Time Observed (in sec) |
Application Navigation | Login | < 1 sec |
Application Navigation | Sign up | < 1 sec |
Application Navigation | Navigate to the Website using the URL | < 1 sec |
Design ML model | Drag and Drop the nodes representing the algorithm into our application | Instant (~ < 100 ms) |
Design ML model | Drag and drop the map and reduce nodes onto our application nodes | Instant (~ < 100 ms) |
Dataset | Uploading file | 3 Sec/ 20 MB file |
Dataset | Download output file | 3 sec/ 20 MB file |
Machine Learning algorithm execution | Map algorithm from Apache Spark | 4 min/ 10 MB file |
Machine Learning algorithm execution | filter algorithm from Apache Spark | 4 min/ 10 MB file |
Deployment
Describe any deployment strategies, operational needs, and maintenance required for your project.

The deployment plan for our project is shown in the diagram above. First, the developer develops by creating a new git branch in the local area. Since we use the git branch strategy, there will be a master branch by default, and another version-specific [release-major-minor-patchlevel] branch by developer’s on-demand. For example, the development of the first version will be through a branch called release-0.1.0, and each developer will create this branch locally on his or her site. To prevent conflicts between developers, avoid conflicts between modules for each version and prevent them through periodic meetings.
Development is done through the IDE (e.g. IntelliJ, Atom), and development is performed by the local machine through the task manager so that it can be tested and localized.
The developer can pull requests to our public GitHub repository for tests and local execution, and if there are no conflicts between the local tests and the global tests, the team leader accepts the pull request and places them in the master branch.
We will use the continuous integration tool Travis CI, and it will continually hook the master branch, and will automatically perform scheduled tasks when a hash value change is detected. We will proceed with Docker packing for each version that has passed the test, and we will do the following three Docker packing in our source code: 1) GUI (frontend) 2) Cloud Service (backend) 3) Machine Learning Library
The packaged Docker containers will then automatically be placed in the Google Container Repository, which will be manually deployed by the developers themselves using Kubernetes. The deployment is done automatically and sequentially for each container in the Google Container. If an error occurs via browser, we can revert to the previous version via Kubernetes.
When an error occurs, follow the Git branch strategy to create a branch corresponding to [hotfix-major-minor-patchlevel] and go through the above distribution process.
Summary, Conclusions, and Recommendations
Summary
Often any user who wants to use analytics or Machine Learning typically has to follow the strict set of rules and guidelines and should also take care of setting the environment ready to do the processing prior to using any machine learning algorithm. There is a need of highly skilled developers to leverage the analytic solutions, or machine learning models to use them efficiently to produce the best results. We wanted to make this process easy and simple to the users without having to go through all the tedious process.
Our final project will be an interactive web-based UI which allows the users to drop and drop the dataset files or components for processing them, the user will be able to choose between many available machine learning algorithm or include his own code required for the procedure after which the user will be able to upload from a given set of file upload formats. Once the user drag and drops all his components and selects the button ‘Process’ in the UI, the back-end system implemented in python and nodejs will automatically run the corresponding optimized services and will provide different URLs for each of the process. A wide list of tensorflow libraries will be used in our application and allows the user to do data preprocessing after uploading. The end result will be provided to the user which can be downloaded into his system. Since the application is Software as a service, the users need not install any software and use the application, access the results from anywhere and is extremely easy to use. As all the major processes are handled by the application, it greatly saves the users time and make them highly efficient.
Conclusions
Earlier days the process of collecting the data were through surveys, FAQ’s etc but as the technology evolved so did the ways to collect and record data followed by the use of complex machine learning algorithm and techniques to process this huge collected data. These days just to efficiently and effectively use the machine learning algorithm can be a very difficult task with a person needing to have in-depth and a thorough knowledge of it before applying it. Using our web-based visual machine learning application, the user can just drag and drop the dataset files, objects or components such as inputs or any popular machine learning algorithm that are needed for the user’s procedure, they can also link different machine learning algorithms if required to get the final output which will be the combined result of all the algorithm components selected, the proposed back-end system will now run optimized services and provide different URLs after which the results are dynamically displayed to the user which he can download at his convenience. This application will allow the users to use and apply complex technologies for their data without having to spend time mastering the technology or even having an in-depth knowledge about it.
Recommendations for Further Research
Some of the recommendations that can be included in the future to our project which may further increase its performance are as follows,
• To include and support more up loadable file types extensions when the user wants to upload dataset with different formats.
• To be able to modify the dataset manually after uploading it.
• Updating live progress to the user about the output when many components are included.
• To include a wider range of Tensorflow Libraries in the application which the user can use.
References
- Aziz Nasridinov, Young-Ho Park, (2013). Visual Analytics for Big Data Using R, Cloud and Green Computing (CGC), 2013 Third International Conference on. DOI:10.1109/CGC.2013.96
- Daniel Keim, Human Qu, Kwan-Liu Ma, (2013). Big-Data Visualization. IEEE Computer Graphics and Applications. DOI: 10.1109/MCG.2013.54
- Khiari Reda, Alessandro Ferretti, Aaron Knoll, Jason Leigh,… & Micael E Papka,(2013).Visualizing Large, Heterogeneous Data in Hybrid-Reality Environments, IEEE Computer Graphics and Applications. DOI: 10.1109/MCG.2013.37
- Marc Streit, Oliver Bimber, (2013). Visual Analytics: Seeking the Unknown, Computer. DOI: 10.1109/MC.2013.255
- Dan Cheng, Zhibin Mai, Jianjia Wu,(2006). A service-Oriented Integrating Practice for Data Modeling, Analysis and Visualization. e-Business Engineering, 2006. ICEBE ’06. IEEE International Conference on. DOI:10.1109/ICEBE.2006.12
- Josua Krause, Adam Perer, Kenney Ng,(2015). Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. DOI: 10.1145/2858036.2858529
License©San Jose State University
Involved with A. Chellagurki, K. Golani, V. N. K. Kuchi, and Ikwhan Chang