Welcome to MLDAS 2023!

October 23-24, 2023

MACHINE LEARNING AND DATA ANALYTICS SYMPOSIUM

MLDAS 2023 is dedicated to fostering connections between researchers, practitioners, students, and industry experts in the fields of machine learning and data science. Our aim is to bridge the gap between cutting-edge academic insights and the practical needs of industry.

This year’s program is focusing on fundamental research in risk-informed decision making and time series modeling. We will also cover applications and challenges of AI in aviation, education, sport analytics and recommendation system as well as the technical hurdles associated with edge computing.

In addition to the technical talks, MLDAS 2023 will feature a panel discussion on responsible AI, where experts will tackle critical ethical and technical considerations in the field of machine learning and data science.

Location: Multipurpose Room, HBKU Research Complex

Organization: Qatar Computing Research Institute (QCRI), HBKU and Boeing Research & Technology.
The symposium is co-chaired by Dragos Margineantu (Boeing), Sanjay Chawla (QCRI, HBKU) and Safa Messaoud (QCRI, HBKU).

Local and Registration Chair: Keivin Isufaj (QCRI, HBKU)

Participation in MLDAS is free. Please fill up this form for in-person attendance.


AGENDA

Day 1 ▼
08:00 – 08:30
Registration and Coffee

08:30 – 09:00
Welcome and Opening Remarks

Session 1
Sanjay Chawla | Session Chair

09:00 – 09:45
The good, the bad and the ugly truth about AI in education
Abstract : I will bring you to a journey introducing one possible vision for the future of education, giving practical examples of how it could be achieved and what role AI could play in achieving it.

09:45 – 10:30
Learning and Reasoning: A Soccer Analytics Story
Abstract : This talk will discuss our journey to develop novel ways to quantify the performance of professional soccer players and our struggles with how to evaluate the models that power our novel metrics. I will start from the motivation of data-driven scouting where I will highlight why traditional statistics such as goals, assists, and pass completion percentage are insufficient for evaluating player performance. I will then present our approaches for assessing a player’s contributions to a match’s goal difference, where our conceptual framework has been adopted by most major data providers, and measuring the creativity of their passing. A key challenge with these models is to ensure that practitioners will trust them. Concretely, we must ensure that the learned models do not display any unwanted or non-intuitive behavior. This talk will argue that the solution to this problem is to develop techniques that are able to reason about a learned model’s behavior. Moreover, I will advocate that using such approaches is a key part of evaluating learning pipelines, regardless of the problem domain, because it can help debug learned models and the data used to train them. I will present two generic approaches for gaining insight into how any tree ensemble will behavior. First, I will discuss an approach for verifying whether a learned tree ensemble exhibits a wide range of behaviors. Second, I will describe an approach that identifies whether the tree ensemble is at a heightened risk of making a misprediction in a post-deployment setting.

Session 2
Dragos Margineantu | Session Chair

10:30 – 11:00
Coffee Break

11:00 – 11:45
How to develop safe AI/ML for Aviation?
Abstract : Machine learning holds a great promise to improve the performance and safety of aviation. However, the data-driven nature and large number of parameters makes it difficult to guarantee the safety of the system. In this talk we will present some recent successes of applying machine learning for several aviation tasks as well as discuss the challenges and how to develop assurance for learning-enabled algorithms in aviation. We demonstrate results from an “AI” pilot, detect and avoid as well as intent prediction.

11:45 – 12:30
Recommendation systems: Challenges and solutions
Abstract : In this talk, I will present Machine Learning solutions for three specific challenges in recommendations systems –
Node recommendations in directed graphs: Given a directed graph, the problem is to recommend the top-k nodes with the highest likelihood of a link from a query node. We enhance GNNs with dual embeddings and propose adaptive neighborhood sampling techniques to handle asymmetric recommendations.
Delayed feedback: The problem is to train an ML model in the presence of target labels that may change over time due to delayed feedback of user actions. We employ an importance sampling strategy to deal with delayed feedback – the strategy corrects the bias in both target labels and feature computation, and leverages pre-conversion signals such as clicks.
Uncertainty in model predictions: For binary classification problems, we show that we can leverage uncertainty estimates for model predictions to improve accuracy. Specifically, we propose algorithms to select decision boundaries with multiple threshold values on model scores, one per uncertainty level, to increase recall without hurting precision.

12:30 – 14:00
Lunch Break

Session 3
Amin Sadeghi | Session Chair

14:00 – 14:45
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society
Abstract : The rapid advancement of conversational and chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents and provide insight into their “cognitive” processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of chat agents, providing a valuable resource for investigating conversational language models. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond. More details of this CAMEL project can be seen here.

14:45 – 15:30
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Abstract : I will discuss Jais and Jais-chat, two state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than previous open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. I will discuss the training, the tuning, the safety alignment, and the evaluation, as well as the lessons we learned.

15:30 – 16:00
Coffee Break

16:00 – 16:45
Uncertainty in Compositional Models
Abstract : Deep learning studies functions represented as compositions of other functions, 𝑓 = 𝑓L ○ … ○ 𝑓1. While there is ample evidence that these type of structures are beneficial for algorithmic design there are significant questions if the same is true when used to build statistical models. In this talk I will try to highlight some of the issues that are inherent to compositional functions. I will talk about the identifiablity issues that, while beneficial for predictive algorithms becomes challenging when building models. Rather than a talk providing solutions my aim is to highlight some issues related to compositional function modelling and aim to stimulate a discussion around these topics. I will however provide some initial results on compositional uncertainty to highlight some of the paths that we are currently exploring.

16:45 – 17:30
Data-driven learning and control for operational resilience in large-scale networked cyberphysical systems
Abstract : Networked cyberphysical systems such as infrastructure networks, supply chains, and social networks are central to our lives. Yet, they often fail catastrophically when faced with unexpected disturbances or extreme events that push these networks far from equilibrium. Further, traditional control techniques do not scale well to such large-scale networks due to complexities arising from the network size, and multi-layered dynamical interactions between the physical networks, computing, communication, and human participants. On the other hand, purely data-driven and learning-based approaches to operating these networks do not provide guarantees on stability, safety, and robustness that are crucial in such safety-critical systems. In this talk, I will present frameworks that bridge data-driven models and learning-based control algorithms with domain-specific properties drawn from network physics, to guarantee operational resilience of large-scale networked dynamical systems under large disturbances. Specifically, I will discuss (i) physics-informed approaches to rapidly learn models of these systems that capture control-relevant properties like dissipativity, and (ii) scalable, compositional, and risk-tunable learning-based control designs that leverage these properties to provably guarantee operational resilience.

Day 2 ▼
08:00 – 08:30
Coffee

Session 4
Safa Messaoud | Session Chair

08:30 – 09:15
Risk Informed Decisions
Abstract : Scientific discovery is an interplay between observation and experimentation, and this talk looks at how machine learning can guide scientists towards better experiments. We discuss our experience in CSIRO, where we are researching, developing, and applying machine learning for scientific discovery. We consider the goal of designing an experiment such that the measured output is maximised, and illustrate it with an example from genome biology. Many approaches to adaptive experimental design trade off exploration and exploitation by considering the risk or uncertainty of predictive models, hence it is important to expand the class of efficient predictive distributions. We briefly cover some recent work on a flexible class of probability densities, called squared neural families, which have closed form normalization. We conclude by discussing opportunities and challenges in machine learning for scientific discovery.

09:15 – 10:00
Robust Learning Ideas for AI Engineering
Abstract : TBD

10:00 – 10:30
Coffee Break

10:30 – 11:15
Towards Formal Verification and Robustification of Neural Systems in Aviation
Abstract : A major challenge in moving ML-based systems, such as ML-based computer vision, from R&D to production is the difficulty in understanding and ensuring their performance on the operational design domain. The standard ML approach is to extensively test models for various inputs. However, testing is inherently limited in coverage, and it is expensive in aviation. In this talk I will present novel verification technologies developed at Imperial College London as part of the recently concluded DARPA Assured Autonomy program and other UK and EU funded efforts.

Verification methods provide guarantees that a model meets its specifications in dense neighbourhood of selected inputs. For example, by using verification methods we can establish whether a model is robust with respect to infinite noise patterns, or infinite lighting perturbations applied to an input. Verification methods can also be tailored to specifications in the latent space and establish the robustness of models against semantic perturbations not definable in the input space (3D pose changes, background changes, etc). Additionally, verification methods can be paired with learning to obtain robust learning methods capable of generating models inherently more robust than those that may be derived with standard methods.

In the presentation I will succinctly cover the key theoretical results leading to some of the existing ML verification technology, illustrate the resulting toolsets and capabilities, and describe some of the use cases developed with our colleagues at Boeing, including centerline distance estimation, object detection, and runway detection.

I will argue that verification and robust learning can be used to obtain models that are inherently more robust, more performant and betterunderstood than present learning and testing approaches.

11:15 – 12:00
Interpretable AI for scientific discovery using symbolic regression
Abstract : We overview the emerging area of symbolic regression (SR) for discovering concise mathematical expression directly from data. Mathematical expressions are directly interpretable and are not only good predictors but can also be used for inferring causal behavior. SR reduces to discovering a unary-binary tree of mathematical symbols that are compatible with data. We overview the current state of the art techniques including the use of transformers for SR. We will conclude with highlight the current shortcomings of SR and suggest directions for future research.

12:00 – 13:30
Lunch

Session 5
Ferda Ofli | Session Chair

13:30 – 14:15
Frontiers of Foundation Models for Time Series Modeling and Analysis
Abstract : Recent development in deep learning has spurred research advances in time series modeling and analysis. Practical applications of time series raise a series of new challenges, such as multi-resolution, multimodal, missing value, distributeness, and interpretability. In this talk, I will discuss possible paths to foundation models for time series data and future directions for time series research.

14:15 – 15:00
Securing edge workloads for mission critical applications
Abstract : Today, with its near infinite resources, cloud computing can be used for a large number of use cases and scenarios. The value of public cloud has been realized by many organizations and businesses, however there are use cases which require cloud services to have near-real time response with limited or intermittent dependency on the public internet. This introduces the notion of edge computing, where parts of cloud services (aka workloads) are moved to on-prem infrastructure so that these workloads can run with limited or intermittent connectivity to the cloud. One prominent use case for edge computing is Industrial IoT (IIoT) which consists of internet-connected machinery and advanced analytics platforms executing AI and ML workloads and processing data close to where it is produced. Additionally, edge technologies hold a lot of promise for a diverse range of industries, including agriculture, healthcare, financial services, retail, and advertising, etc. In this talk I will highlight some of the work Microsoft is doing across multiple verticals to achieve this goal. I will also present some of the hard technical and research challenges which remain to be solved to make edge computing for mission critical applications a reality.

15:00 – 15:45
Beyond Traditional Threat Hunting: Leveraging Deep Learning for Log Analysis
Abstract : This presentation delves into the transformative potential of deep learning-based AI techniques in the realm of cybersecurity, particularly highlighting the complexities of threat hunting. Identifying threat behaviors within computer systems remains a crucial yet complex task, largely because the process is expert-driven, labor-intensive, and prone to errors. To address this, we will introduce a system designed to search for and pinpoint known threat behaviors within extensive system security logs. Our methodology involves converting security logs into a graph representation, that captures the temporal and causal relations between different types of system entities entities like processes, files, and network sockets. For the search of threat behaviors, our system fosters abilities of graph neural networks, enabling efficient search of expansive graphs. Complementing this, our system draws upon the capabilities of advanced language models to convert textual descriptions of threat behaviors into query behavior graphs.

15:45 – 16:15
Coffee Break

16:15 – 17:30
Panel Discussion: Safe and Responsible AI
Participants: All invited speakers. Moderator: Safa Messaoud (QCRI, HBKU)


SPEAKERS

Märt Aro



Märt Aro’s involvement in the field of education development dates back to his secondary school years which saw Märt organising educational events for peers on a national level. Since 2004, Märt has established numerous organisations and companies in the area of education development. Märt is the co-founder of DreamApply.com Student Application Management Platform, launched in 2011. DreamApply is used by universities from more than 40 countries, serving millions of users annually. Märt is serving as Chairman of the Board at NGO EdTech Estonia that brings together 50+ founders of education innovation initiatives from Estonia with the aim of sharing experience and fostering cooperation. In order to drive education forward Märt regularly collaborates with: United Nations, European Union, Estonian Ministry of Education and Research, and many others.

Jesse Davis



Jesse Davis is a Professor in the Department of Computer Science at KU Leuven, Belgium. His research focuses on developing novel artificial intelligence, data science, and machine learning techniques, with a particular emphasis on analyzing structured data. Jesse’s passions lie in using these techniques to make sense of lifestyle data, address problems in (elite) athlete monitoring and detect anomalies. He is particularly well known for his research on sports analytics, where he has focused on a broad range of sports including soccer, running, volleyball, and basketball while often working in close collaboration with practitioners. The media frequently covers his research on this topic with his work being featured in venues such as The Athletic, The New York Times, fivethirtyeight.com, and ESPN.com among others. Prior to joining KU Leuven, he obtained his bachelor’s degree from Williams College, his PhD from the University of Wisconsin, and completed a post-doc at the University of Washington. Jesse has co-founded and serves on the board of directors for runeasi, a KU Leuven spinoff. runeasi offers an app and wearable that provides real-time biomechanical feedback about running and walking. In this context, runeasi works with physiotherapists, several elite athletes and professional sports clubs to support training and rehab.

Sebastian Scherer



Sebastian Scherer is an Associate Research Professor at the Robotics Institute (RI) at Carnegie Mellon University (CMU). His research focuses on enabling autonomy in challenging environments and previously led CMU’s entry to the SubT challenge. He and his team have shown several firsts for autonomy for flying robots and off-road driving . Dr. Scherer received his B.S. in Computer Science, M.S. and Ph.D. in Robotics from CMU in 2004, 2007, and 2010.

Rajeev Rastogi



Rajeev Rastogi is the Vice President of Machine Learning (ML) for Amazon’s International Stores business. He leads the development of ML solutions in the areas of Search, Advertising, Deals, Catalog Quality, Payments, Forecasting, Question Answering, Grocery Grading, etc. Previously, he was Vice President of Yahoo! Labs Bangalore and the founding Director of the Bell Labs Research Center in Bangalore, India. Rajeev is an ACM Fellow and a Bell Labs Fellow. He has published over 125 papers, and holds over 100 patents. He currently serves on the editorial board of the CACM, and has been an Associate editor for IEEE Transactions on Knowledge and Data Engineering in the past. Rajeev received his B. Tech degree from IIT Bombay, and a PhD degree in Computer Science from the University of Texas, Austin.

Bernard Ghanem



Bernard Ghanem is currently a Professor in the CEMSE division, a theme leader at the Visual Computing Center (VCC), and the Deputy Director of the AI Initiative at KAUST. His research interests lie in computer vision and machine learning with emphasis on topics in video understanding, 3D recognition, and foundations of deep learning. He received his bachelor’s degree from the American University of Beirut (AUB) in 2005 and his MS/PhD from the University of Illinois at Urbana-Champaign (UIUC) in 2010. His work has received several awards and honors, including six Best Paper Awards for workshops in CVPR, ECCV, and ICCV, a Google Faculty Research Award in 2015 (1st in MENA for Machine Perception), and a Abdul Hameed Shoman Arab Researchers Award for Big Data and Machine Learning in 2020. He has co-authored more than 150 papers in his field as well as three issued patents. He serves as an Associate Editor for IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) and has served as Area Chair (AC) for the main computer vision and machine learning conferences including CVPR, ICCV, ECCV, NeurIPS, ICLR, and AAAI.

Preslav Nakov



Preslav Nakov is Professor and Department Chair for NLP at the Mohamed bin Zayed University of Artificial Intelligence. Previously, he was Principal Scientist at the Qatar Computing Research Institute, HBKU, where he led the Tanbih mega-project, developed in collaboration with MIT, which aims to limit the impact of “fake news”, propaganda and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. He received his PhD degree in Computer Science from the University of California at Berkeley, supported by a Fulbright grant. He is Chair-Elect of the Association for Computational Linguistics (ACL), Secretary of ACL SIGSLAV, and Secretary of the Truth and Trust Online board of trustees. Formerly, he was PC chair of ACL 2022, and President of ACL SIGLEX. He is also member of the editorial board of several journals including Computational Linguistics, TACL, ACM TOIS, IEEE TASL, IEEE TAC, CS&L, NLE, AI Communications, and Frontiers in AI. He authored a Morgan & Claypool book on Semantic Relations between Nominals, two books on computer algorithms, and 250+ research papers. He received a Best Paper Award at ACM WebSci’2022, a Best Long Paper Award at CIKM’2020, a Best Demo Paper Award (Honorable Mention) at ACL’2020, a Best Task Paper Award (Honorable Mention) at SemEval’2020, a Best Poster Award at SocInfo’2019, and the Young Researcher Award at RANLP’2011. He was also the first to receive the Bulgarian President’s John Atanasoff award, named after the inventor of the first automatic electronic digital computer. His research was featured by over 100 news outlets, including Reuters, Forbes, Financial Times, CNN, Boston Globe, Aljazeera, DefenseOne, Business Insider, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, WIRED, and Engadget, among others.

Carl Henrik Ek



Dr. Carl Henrik Ek is an Associate Professor at the University of Cambridge. Together with Prof. Neil Lawrence, Jessica Montgomery and Dr. Ferenc Huzar he leads the machine learning research group in the Cambridge computer lab. He is interested in building models that allows for principled treatment of uncertainty that provides interpretable handles to introduce strong prior knowledge. He has worked extensively on Bayesian non-parametrics in specific Gaussian processes.

Sivaranjani Seetharaman



Sivaranjani Seetharaman is an Assistant Professor in the School of Industrial Engineering at Purdue University. Previously, she was a postdoctoral researcher in the Department of Electrical Engineering at Texas A&M University, and the Texas A&M Research Institute for Foundations of Interdisciplinary Data Science (FIDS). She received her PhD in Electrical Engineering from the University of Notre Dame, and her Master’s and undergraduate degrees, also in Electrical Engineering, from the Indian Institute of Science, and PES Institute of Technology, respectively. Sivaranjani has been a recipient of the Schlumberger Foundation Faculty for the Future fellowship, the Zonta International Amelia Earhart fellowship, and the Notre Dame Ethical Leaders in STEM fellowship. She was named among MIT Technology Review’s Innovators Under 35 (TR35) in 2023, and the MIT Rising Stars in EECS in 2018. Her research interests lie at the intersection of control and machine learning in large-scale networked systems, with applications to energy systems, transportation networks, and human-autonomous systems.

Cheng Soon Ong



Cheng Soon Ong is a senior principal research scientist at the Statistical Machine Learning Group, Data61, CSIRO, and is the director of the machine learning and artificial intelligence future science platform at CSIRO. He is also an adjunct associate professor at the Australian National University. His research interest is to enable scientific discovery with machine learning, and has collaborated widely with biologists and astronomers. He is co-author of the textbook Mathematics for Machine Learning, and his career has spanned multiple roles in Malaysia, Germany, Switzerland, and Australia.

Dragos Margineantu



Dragos Margineantu is a Boeing Senior Technical Fellow and Artificial Intelligence (AI) Chief Technologist leading AI research and engineering in Boeing. At Boeing, he developed machine learning (ML)-based solutions for autonomous flight, manufacturing, airplane maintenance, airplane performance, surveillance, and security. Margineantu serves as the Boeing principal investigator (PI) of multiple Defense Advanced Research Projects Agency (DARPA) projects and chaired major AI and data science conferences.
He mentored graduate students at Massachusetts Institute of Technology (MIT) and KU Leuven in Belgium, served on Canada Research Chair committees, and on NSF review panels. He was one of the initiators of the Machine Learning Data Analytics Symposia (MLDAS) with Qatar Computing Research Institute in 2014.
Margineantu has a Master’s Degree from the University Politechnica of Timisoara, and a Doctorate Degree in Computer Science/Machine Learning from Oregon State University.

Alessio Lomuscio



Alessio Lomuscio is Professor of Safe Artificial Intelligence in the Department of Computing at Imperial College London (UK), where he leads the Verification of Autonomous Systems Lab . He is a Distinguished
ACM member, a Fellow of the European Association of Artificial Intelligence and currently holds a Royal Academy of Engineering Chair in Emerging Technologies. He is founding co-director of the UKRI Doctoral Training Centre in Safe and Trusted Artificial Intelligence.
Alessio’s research interests concern the development of verification methods for artificial intelligence. Since 2000 he has worked on the development of formal methods for the verification of autonomous systems and multi-agent systems, both symbolic and ML-based. He has published approximately 200 papers in AI conferences (including IJCAI, KR, AAAI, CVPR, AAMAS, ECAI), verification and formal methods conferences (CAV, SEFM, ATVA), and international journals (AIJ, JAIR, ACM ToCL, JAAMAS, Information and Computation). He sits on the Editorial Board member for AIJ, JAIR, and JAAMAS, and recently served as general chair for AAMAS21.
He is founder and CEO of Safe Intelligence, a VC-backed Imperial College London startup developing tools to help users verify and build robust ML.

Nour Makke



Dr. Nour Makke is currently focused on the research and development of interpretable AI solutions for the physical sciences. Her prior experience includes active involvement in high-energy physics experiments conducted at CERN. Dr. Nour completed her PhD at the University of Paris-Sud (now Paris-Saclay) in France.

Yan Liu



Yan Liu is a Professor in the Computer Science Department and the Director of Machine Learning Center at the University of Southern California. She received her Ph.D. degree from Carnegie Mellon University. Her research interest is machine learning and its applications to climate science, health care, and sustainability. She has received several awards, including NSF CAREER Award, Okawa Foundation Research Award, New Voices of Academies of Science, Engineering, and Medicine, Best Paper Award in SIAM Data Mining Conference. She serves as general chair for KDD 2020 and ICLR 2023, and program chairs for WSDM 2018, SDM 2020, KDD 2022 and ICLR 2022.

Arjmand Samuel



Dr Arjmand Samuel is the head of product for Azure Edge Workload Security at Microsoft where he leads the design and development of security technologies for edge workloads. In his past roles Arjmand has led the design and development of Azure IoT cloud services and edge platforms, including Azure IoT Edge and Azure IoT Device SDKs. He has also led the Azure IoT security product and engineering teams. Prior to that, Arjmand led external academic collaborations around devices and services research for Microsoft Research, where he developed collaborative programs and research initiatives to harness the power of the Internet of Things. His other research projects included software architectures and programming paradigms for devices of all shapes and forms. He has published in a variety of publications on topics of security, privacy, location aware access control and innovative use of mobile technology. Arjmand has a Ph.D. in Information Security from Purdue University, USA.

Husrev Taha Sencar



Husrev Taha Sencar serves as a principal scientist with the Cybersecurity group at QCRI. Before joining QCRI in 2019, he held the position of associate professor in the Computer Engineering Department at TOBB University in Ankara, Turkey. Between 2012 and 2015, Taha led the CCS-AD, a research extension of NYU New York’s cybersecurity center based in NYUAD. His research has been centered on developing sophisticated forensic solutions concerning digital evidence retrieval, search, attribution, and verification. Currently, Taha is leveraging AI-driven solutions to design security systems and tools tailored for scenarios with a pronounced automation gap.