DataScience4NP: Data Science for Non-Programmers

Description

Background The leading consulting company McKinsey estimates that there will be a shortage of data scientists to enable organizations to explore the full potential of big data. By 2018, the United States alone will face a shortage of 140,000 to 190,000 professionals with strong analytical skills with the know-how to analyze big data to make effective decisions. This shortage will be more dramatic in Portugal since, in contrast to US universities that provide Data Science degrees for several years (e.g., at Berkeley and Carnegie Mellon University), Portuguese universities are just making the first steps. In short, without Data Science professionals, the competitive advantage that big data can bring to Portuguese companies will remain untapped. The Problem This shortage of professionals cannot be mitigated easily since training students to become data scientists requires time and resources to teach skills from diverse knowledge areas such as Computer Science, Statistics, Business, and Data Visualization. Data scientists are involved across the full data lifecycle -- from acquiring new data sets to making business decisions based on the scientists are involved across the full data lifecycle -- from acquiring new data sets to making business decisions based on the knowledge discovered. They need to be skilled with programming languages (e.g., Python and Perl) to clean, integrate and transform data, and use complex programming packages (e.g., Scikit-Learn for analytics and MatPlotLib for visualization). Mastering this type of working environment is not easy. Objective The objective of the DataScience4NP project is to explore the use of visual programming paradigms to enable non-programmers to be part of the Data Science workforce. Existing Approaches In contrast to existing approaches, which require programming, Scientific Workflow Management Systems (SWMS) can become an alternative to support the visual programming of data science projects. Such systems (e.g. Taverna and Kepler) use a simple graphical, graph-based structure to develop applications. This simplicity has shown to be suitable in several scientific areas such as bioinformatics, geophysics, and climate analysis. Limitation Despite the success of SWMS in data intensive research, they did not reach a state where non-programmers data scientists can use them. They still require some programming and scripting skills to code individual processing tasks. That is why research teams using those systems are usually composed of scientists and software developers. Thus, further research is required to remove any programming still required from these systems to make them suitable for non-programmers. Proposed Approach We propose to extend current SWMS to support the parameterization of generic prebuild workflow templates. Workflow templates capture the processing tasks of data science projects. A template can be seen as a formalized best practice that data scientists can use to solve common data analysis challenges. Templates are developed by multidisciplinary teams of experts and reused by nonprogrammer data scientists, since they do not require programming. Parameterized workflows have been used successfully in the field of enterprise computing since 1970 to increase software reuse. For example, SAP became the largest software company in Europe by using parameterized workflows to automatize business process models. We claim that the same type of benefits can be obtained by parameterizing scientific workflow templates. Implementation The proposed approach will be implemented in Taverna, an open source software tool for designing and executing workflows, which is used by 350 major research institutions worldwide. The platform “myexperiment.org” will be used for sharing workflow templates among data scientists. Evaluation Portugal Telecom (PT) and the national agency for the modernization of public administration (AMA), which manages governmental Open Data, will evaluate the final system. The evaluation will also be conducted by inviting non-programmers at the Codebits hackathon, an event organized by Portugal Telecom for several years that attracts hundreds of young participants interested in developing science, business, and technology projects. Benefits On the one hand, a software system will be implemented for student and professionals to conduct data science experiments. On the other hand, the project will generate a wealth of material – workflow templates, datasets, articles, code, and software – which will be used to prepare a new hands-on course on Data Science and SWMS to be offered both at the University of Coimbra (UC) and at the University Institute of Lisbon (ISCTE-IUL). Team DataScience4NP brings together a telecommunication operator (PT) with the need of solutions for big data analysis; the agency for the modernization of the public administration (AMA) that requires techniques governmental open data analysis; and two research units to undertake research - CISUC (UC) and ISTAR (ISCTE-IUL).

Researchers

Funded by

FCT

Partners

AMA - Agência para a Modernização Administrativa I.P. (AMA), Fundação Portugal Telecom (Fundação PT), ISCTE - Instituto Universitário de Lisboa (ISCTE-IUL)

Total budget

186 450,00 €

Local budget

144 420,00 €

Keywords

Data Science, Distributed Systems, Microservices

Start Date

2016-06-01

End Date

2019-05-31

Conference Articles

2019

(2 publications)

2018

(5 publications)

2017

(1 publication)

2016

(1 publication)