| Title: | Data Leakage Detection Tools for Machine Learning |
|---|---|
| Description: | Provides utilities to detect common data leakage patterns including train/test contamination, temporal leakage, and data duplication, enhancing model reliability and reproducibility in machine learning workflows. Generates diagnostic reports and visual summaries to support data validation. Methods based on best practices from Hastie, Tibshirani, and Friedman (2009, ISBN:978-0387848570). |
| Authors: | Cheryl Isabella Lim [aut, cre] |
| Maintainer: | Cheryl Isabella Lim <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-05 07:44:56 UTC |
| Source: | https://github.com/cherylisabella/leakr |
This function compiles a report with enhanced sorting, severity scoring, and detailed metadata, including configuration information.
compile_report( results, audit_data, config, show_config = FALSE, top_n = 10, report = "default" )compile_report( results, audit_data, config, show_config = FALSE, top_n = 10, report = "default" )
results |
A list containing detection results. |
audit_data |
The audit data used for the report. |
config |
Configuration settings, including whether to use numeric severity scores. |
show_config |
Logical, whether to display the configuration used for report generation. Defaults to FALSE. |
top_n |
Numeric, the number of top results to display in the report. Defaults to 10. |
report |
A string indicating the type of report to generate. Defaults to "default". |
A leakr_report object containing the summary, evidence, and metadata for the report.
Format detector names by converting them to title case and separating words by spaces.
format_detector_name(detector_name)format_detector_name(detector_name)
detector_name |
A string to format, typically a detector name with underscores. |
A title-cased, space-separated string.
Null-coalescing operator for clean default value handling
x %||% yx %||% y
x |
First value to check |
y |
Fallback value if x is NULL |
x if not NULL, otherwise y
leakr: Data Leakage Detection for Machine Learning in R
The leakr package provides tools to automatically detect common data leakage patterns in machine learning workflows for tabular data. It identifies train/test contamination, target leakage, and duplicate rows with clear diagnostic reports and visualisations.
Train/Test Contamination: Detects ID overlaps and distributional shifts between training and test sets
Target Leakage: Identifies features with suspicious correlations to the target variable
Duplication Detection: Finds exact and near-duplicate rows
Clear Reports: Generates severity-ranked diagnostics with actionable recommendations
Visualisations: Creates diagnostic plots to highlight issues
leakr_audit: Main function for comprehensive leakage detection
leakr_summarise: Generate human-readable summaries
leakr_plot: Create diagnostic visualisations
train_test_contamination: Checks for overlap between train/test sets
target_leakage: Identifies suspicious feature-target relationships
duplication_detection: Finds duplicate rows in datasets
Accepts data.frame, tibble, and data.table objects.
# Audit a dataset for leakage library(leakr) report <- leakr_audit(my_data, target = "outcome") # View summary of issues found leakr_summarise(report) # Create diagnostic plots leakr_plot(report)
Maintainer: Cheryl Isabella Lim [email protected]
Report bugs at https://github.com/cherylisabella/leakr/issues
This function audits a dataset for potential data leakage, running a series of predefined detectors and generating a comprehensive report with detailed findings.
leakr_audit( data, target = NULL, split = NULL, id = NULL, detectors = NULL, config = list() )leakr_audit( data, target = NULL, split = NULL, id = NULL, detectors = NULL, config = list() )
data |
The dataset to be audited (data frame or tibble). |
target |
The target variable (optional). If NULL, no target variable is assumed. |
split |
The split variable used for training/test split (optional). If NULL, no split is assumed. |
id |
The unique identifier for each row (optional). If NULL, no id is used. |
detectors |
A vector of detector names to run (optional). If NULL, all available detectors will be used. |
config |
A list of configuration parameters for the audit. Defaults to an empty list. |
A leakr_report object containing the audit results, including summary, evidence, and metadata.
# Basic audit on iris dataset report <- leakr_audit(iris, target = "Species") print(report)# Basic audit on iris dataset report <- leakr_audit(iris, target = "Species") print(report)
Save data and metadata for reproducible leakage analysis with optimised performance.
leakr_create_snapshot( data, output_dir = file.path(tempdir(), "leakr_snapshots"), snapshot_name = NULL, metadata = list(), sample_for_hash = TRUE )leakr_create_snapshot( data, output_dir = file.path(tempdir(), "leakr_snapshots"), snapshot_name = NULL, metadata = list(), sample_for_hash = TRUE )
data |
Data.frame to snapshot |
output_dir |
Directory for snapshot files |
snapshot_name |
Name for this snapshot |
metadata |
Additional metadata to store |
sample_for_hash |
Whether to sample large datasets for faster hashing |
Path to snapshot directory
Save processed data to different file formats with consistent behaviour.
leakr_export_data(data, file_path, format = "csv", verbose = TRUE, ...)leakr_export_data(data, file_path, format = "csv", verbose = TRUE, ...)
data |
Data.frame to export |
file_path |
Output file path |
format |
Output format: "csv", "excel", "rds", "json", "parquet" |
verbose |
Whether to show export messages |
... |
TODO: Add description |
Path to exported file (invisibly)
Extract data from caret train objects for leakage analysis.
leakr_from_caret(train_obj, original_data = NULL, target_name = "target")leakr_from_caret(train_obj, original_data = NULL, target_name = "target")
train_obj |
caret train object |
original_data |
Original training data (if available) |
target_name |
Custom name for target variable (default: "target") |
List with data and metadata
Extract data from mlr3 Task objects for leakage analysis.
leakr_from_mlr3(task, include_target = TRUE)leakr_from_mlr3(task, include_target = TRUE)
task |
mlr3 Task object (TaskClassif, TaskRegr, etc.) |
include_target |
Whether to include target variable in output |
List with data, target, and metadata
Extract data from tidymodels workflows for leakage analysis.
leakr_from_tidymodels(workflow, data)leakr_from_tidymodels(workflow, data)
workflow |
tidymodels workflow object |
data |
Original training data |
List with data and metadata
Flexible data import function supporting multiple formats with automatic format detection and preprocessing for leakage analysis.
leakr_import( source, format = "auto", preprocessing = list(), encoding = "UTF-8", sheet = NULL, verbose = TRUE, ... )leakr_import( source, format = "auto", preprocessing = list(), encoding = "UTF-8", sheet = NULL, verbose = TRUE, ... )
source |
Path to data file, data.frame, or other supported object. |
format |
Data format: "auto", "csv", "excel", "rds", "json", "parquet", "tsv". If "auto", the format will be detected from the file extension. |
preprocessing |
List of preprocessing options to apply after import. |
encoding |
Character encoding for reading files. Default is "UTF-8". |
sheet |
Sheet name or index to read (for Excel files). Default is NULL. |
verbose |
Logical indicating whether to print progress messages. Default TRUE. |
... |
Additional arguments passed to specific import functions. |
Standardised data.frame suitable for leakage analysis
A standardized data.frame suitable for leakage analysis.
Display comprehensive information about available data snapshots.
leakr_list_snapshots( snapshots_dir = file.path(tempdir(), "leakr_snapshots"), include_metadata = TRUE )leakr_list_snapshots( snapshots_dir = file.path(tempdir(), "leakr_snapshots"), include_metadata = TRUE )
snapshots_dir |
Directory containing snapshots |
include_metadata |
Whether to load detailed metadata for each snapshot |
Data.frame with snapshot information
Restore data from a previously created snapshot with integrity checking.
leakr_load_snapshot(snapshot_path, format = "rds", verify_integrity = TRUE)leakr_load_snapshot(snapshot_path, format = "rds", verify_integrity = TRUE)
snapshot_path |
Path to snapshot directory |
format |
Format to load: "rds" (recommended), "csv" |
verify_integrity |
Whether to verify data integrity using hash |
Data.frame from snapshot
Plot leakage detection results
leakr_plot(x, ...)leakr_plot(x, ...)
x |
Results from leakr_audit |
... |
TODO: Add description Plot leakage detection results |
A ggplot object
Minimal quick import for typical user workflows. Uses leakr_import internally.
leakr_quick_import(source, ...)leakr_quick_import(source, ...)
source |
File path or data.frame |
... |
TODO: Add description |
Standardised data.frame
This function provides a formatted summary of the leakage audit report. It displays a summary of the leakage issues, including the severity and top issues detected. Optionally, it can also display configuration details used for the audit.
leakr_summarise( report, top_n = 10, show_config = FALSE, config = NULL, audit_data = NULL, detectors = NULL, libname = NULL, pkgname = NULL )leakr_summarise( report, top_n = 10, show_config = FALSE, config = NULL, audit_data = NULL, detectors = NULL, libname = NULL, pkgname = NULL )
report |
A |
top_n |
Maximum number of issues to display in the summary. Defaults to 10. |
show_config |
Whether to display the configuration details used for the audit. Defaults to |
config |
(Optional) A configuration list. This argument is not used directly in the function,
but is referenced in the report metadata. Defaults to |
audit_data |
(Optional) The data used for auditing. This argument is not used directly in the function,
but is part of the report metadata. Defaults to |
detectors |
(Optional) A vector of detectors used for the audit. This argument is not used directly in
the function but is part of the report metadata. Defaults to |
libname |
(Optional) The name of the library. This is included for internal package functionality. |
pkgname |
(Optional) The name of the package. This is included for internal package functionality. |
An invisible data.frame summarizing the top n issues detected.
# Create and summarise a report report <- leakr_audit(iris, target = "Species") leakr_summarise(report, top_n = 5)# Create and summarise a report report <- leakr_audit(iris, target = "Species") leakr_summarise(report, top_n = 5)
Returns the names of all detectors currently registered in the system. This is useful for checking which detectors are available.
list_registered_detectors()list_registered_detectors()
A character vector containing the names of all registered detectors.
list_registered_detectors()list_registered_detectors()
Create a new temporal detector
new_temporal_detector(time_col, lookahead_window = 1)new_temporal_detector(time_col, lookahead_window = 1)
time_col |
Character. Name of the time column |
lookahead_window |
Numeric. Lookahead window size (default 1) Create a new temporal detector |
A temporal_detector object
A temporal_detector object
Create a new train-test detector
new_train_test_detector(threshold = 0.1)new_train_test_detector(threshold = 0.1)
threshold |
TODO: Document Create a new train-test detector |
A train_test_detector object
Plot a detector_result object
Plot a detector_result object
## S3 method for class 'detector_result' plot(x, palette = NULL, ...)## S3 method for class 'detector_result' plot(x, palette = NULL, ...)
x |
TODO: Document |
palette |
TODO: Document |
... |
TODO: Document |
A ggplot object, invisibly. Printed if interactive
A ggplot object, invisibly. Printed if interactive
This function generates a bar plot of leakage issues detected by different detectors.
The plot displays the count of issues by severity level for each detector in a udld_report object.
## S3 method for class 'udld_report' plot(x, palette = NULL, ...)## S3 method for class 'udld_report' plot(x, palette = NULL, ...)
x |
A |
palette |
Optional. A |
... |
Additional arguments passed to |
A ggplot object, invisibly. The plot is printed if the session is interactive.
Register a new data leakage detector function
register_detector(name, fun, description = "")register_detector(name, fun, description = "")
name |
Name of the detector |
fun |
TODO: Add description |
description |
TODO: Add description |
Invisibly returns registration status
Run a detector on data
run_detector(detector, data, split = NULL, id = NULL, config = list())run_detector(detector, data, split = NULL, id = NULL, config = list())
detector |
A detector object |
data |
Data frame to analyze |
split |
Split vector indicating train/test assignment (optional) |
id |
Optional ID column name |
config |
Optional configuration list |
A detector result object
A detector result object