DataScience4NP: Data Science for Non-Programmers
Description
Background
The leading consulting company McKinsey estimates that there will be a shortage of data scientists to enable organizations to explore the full potential of big data. By 2018, the United States alone will face a shortage of 140,000 to 190,000 professionals with strong analytical skills with the know-how to analyze big data to make effective decisions. This shortage will be more dramatic in Portugal since, in contrast to US universities that provide Data Science degrees for several years (e.g., at Berkeley and Carnegie Mellon University), Portuguese universities are just making the first steps. In short, without Data Science professionals, the competitive advantage that big data can bring to Portuguese companies will remain untapped.
The Problem
This shortage of professionals cannot be mitigated easily since training students to become data scientists requires time and resources to teach skills from diverse knowledge areas such as Computer Science, Statistics, Business, and Data Visualization. Data scientists are involved across the full data lifecycle -- from acquiring new data sets to making business decisions based on the scientists are involved across the full data lifecycle -- from acquiring new data sets to making business decisions based on the knowledge discovered. They need to be skilled with programming languages (e.g., Python and Perl) to clean, integrate and transform data, and use complex programming packages (e.g., Scikit-Learn for analytics and MatPlotLib for visualization). Mastering this type of working environment is not easy.
Objective
The objective of the DataScience4NP project is to explore the use of visual programming paradigms to enable non-programmers to be part of the Data Science workforce.
Existing Approaches
In contrast to existing approaches, which require programming, Scientific Workflow Management Systems (SWMS) can become an alternative to support the visual programming of data science projects. Such systems (e.g. Taverna and Kepler) use a simple graphical, graph-based structure to develop applications. This simplicity has shown to be suitable in several scientific areas such as bioinformatics, geophysics, and climate analysis.
Limitation
Despite the success of SWMS in data intensive research, they did not reach a state where non-programmers data scientists can use them. They still require some programming and scripting skills to code individual processing tasks. That is why research teams using those systems are usually composed of scientists and software developers. Thus, further research is required to remove any programming still required from these systems to make them suitable for non-programmers.
Proposed Approach
We propose to extend current SWMS to support the parameterization of generic prebuild workflow templates. Workflow templates capture the processing tasks of data science projects. A template can be seen as a formalized best practice that data scientists can use to solve common data analysis challenges. Templates are developed by multidisciplinary teams of experts and reused by nonprogrammer data scientists, since they do not require programming. Parameterized workflows have been used successfully in the field of enterprise computing since 1970 to increase software reuse. For example, SAP became the largest software company in Europe by using parameterized workflows to automatize business process models. We claim that the same type of benefits can be obtained by parameterizing scientific workflow templates.
Implementation
The proposed approach will be implemented in Taverna, an open source software tool for designing and executing workflows, which is used by 350 major research institutions worldwide. The platform “myexperiment.org” will be used for sharing workflow templates among data scientists.
Evaluation
Portugal Telecom (PT) and the national agency for the modernization of public administration (AMA), which manages governmental Open Data, will evaluate the final system. The evaluation will also be conducted by inviting non-programmers at the Codebits hackathon, an event organized by Portugal Telecom for several years that attracts hundreds of young participants interested in developing science, business, and technology projects.
Benefits
On the one hand, a software system will be implemented for student and professionals to conduct data science experiments. On the other hand, the project will generate a wealth of material – workflow templates, datasets, articles, code, and software – which will be used to prepare a new hands-on course on Data Science and SWMS to be offered both at the University of Coimbra (UC) and at the University Institute of Lisbon (ISCTE-IUL).
Team
DataScience4NP brings together a telecommunication operator (PT) with the need of solutions for big data analysis; the agency for the modernization of the public administration (AMA) that requires techniques governmental open data analysis; and two research units to undertake research - CISUC (UC) and ISTAR (ISCTE-IUL).
Researchers
Funded by
FCT
Partners
AMA - Agência para a Modernização Administrativa I.P. (AMA), Fundação Portugal Telecom (Fundação PT), ISCTE - Instituto Universitário de Lisboa (ISCTE-IUL)
Total budget
186 450,00 €
Local budget
144 420,00 €
Keywords
Data Science, Distributed Systems, Microservices
Start Date
2016-06-01
End Date
2019-05-31
Conference Articles
2019
(2 publications) - Correia, J. and Araujo, F. and Jorge Cardoso and Filipe, R. , "Towards Occupation Inference in Non-instrumented Services", in The 18th IEEE International Symposium on Network Computing and Applications (NCA 2019), 2019
- Filipe, R. and Araujo, F. , "Client-Side Monitoring of HTTP Clusters Using Machine Learning Techniques", in 18th IEEE International Conference on Machine Learning and Applications - ICMLA 2019, 2019
2018
(5 publications) - Pedroso, A. and Lopes, B.L. and Correia, J. and Araujo, F. and Jorge Cardoso and Paiva, R.P. , "A Data Mining Service for Non-Programmers", in 10th International Conference on Knowledge Discovery and Information Retrieval – KDIR 2018, 2018
- Lopes, B.L. and Pedroso, A. and Correia, J. and Araujo, F. and Jorge Cardoso and Paiva, R.P. , "DataScience4NP - A Data Science Service for Non-Programmers", in 10º Simpósio de Informática – INForum 2018, 2018
- Filipe, R. and Correia, J. and Araujo, F. and Jorge Cardoso , "On Black-Box Monitoring Techniques for Multi-Component Services", in The 17th IEEE International Symposium on Network Computing and Applications (NCA 2018), 2018
- Correia, J. and Ribeiro, F. and Filipe, R. and Araujo, F. and Jorge Cardoso , "Response Time Characterization of Microservice-Based Systems", in The 17th IEEE International Symposium on Network Computing and Applications (NCA 2018), 2018
- Pina, F. and Correia, J. and Filipe, R. and Araujo, F. and Jorge Cardoso , "Nonintrusive Monitoring of Microservice-based Systems", in The 17th IEEE International Symposium on Network Computing and Applications (NCA 2018), 2018
2017
(1 publication) 2016
(1 publication)