General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health(Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments basedon both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of otur method.
Creates bar and pie charts on Wikipedia without need for external tools Many spreadsheet, drawing, and desktop publishing programs allow you to create graphs and export them as images
semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeleddata for training – typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy.
Although Electronic Health Records (EHRs) have attracted increasing research attention in the data mining and machine learning communities. The approach is limited to a binary classification problem (using alive/deceased labels) and consequently it is not informative about the specific disease area in which a person is at risk. Unlabeled data classification are commonly handled via Semi-Supervised Learning (SSL) that learns from both labeled and unlabeled data, and Positive and Unlabeled (PU) learning, a special case of SSL that learns from positive and unlabeled data alone.
In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
Mining health examination data is challenging especially due to its heterogeneity, intrinsic noise, and particularly the large volume of unlabeled data. In this paper, we introduced an effective and efficient graph-based semi-supervised algorithm namely SHG-Health to meet these challenges. Our proposed graph-based classification approach on mining health examination records has a few significant advantages.
Firstly, health examination records are represented as a graph that associates all relevant cases together. This is especially useful for modeling abnormal results that are often sparse. Secondly, multi-typed relationships of data items can be captured and naturally mapped into a heterogeneous graph. Particularly, the health examination items are represented as different types of nodes on a graph, which enables our method to exploit the underlying heterogeneous subgraph structures of individual classes to achieve higher performance. Thirdly, features can be weighted in their own type through a label propagation process on a heterogeneous graph. These in-class weighted features then contribute to the effective classification in an iterative convergence process. Our work shows a new way of predicting risks for participants based on their annual health examinations.