Releases: lanl/T-ELF
v0.0.34
Fast-tracking to v0.0.34 from v0.0.20
Enhancements
Pruning Support:
- Enabled pruning in
bnmf
,wnmf
, andnmf_recommender
. - Added pruning of additional matrices, e.g.,
MASK
, based onX
. - Included
pruned_cols
andpruned_rows
in saved outputs.
Matrix Factorization:
- Introduced new submodule
BNMFk
underNMFk
withnmf_method='bnmf'
. - Added
WEIGHT
andMASK
keys forWNMFk
andBNMFk
. - Implemented matrix deletion in subroutines to reduce memory consumption.
- Added
factor_thresholding
parameter to perform thresholding overNMFk
factors, making them boolean. Options include:coord_desc_thresh
WH_thresh
- Introduced
factor_thresholding_obj_params
for configuring thresholding subroutines. - Added
clustering_method
parameter with options:kmeans
bool
orboolean
(both are equivalent).
- Introduced
clustering_obj_params
to configure clustering subroutines. - Added new perturbation type for boolean matrices:
perturb_type='boolean'
orperturb_type='bool'
. - Updated examples to reflect new boolean-specific features.
- Path compatibility using
os.path.join
.
Thresholding and Clustering:
- Added
factor_thresholding_H_regression
with options:otsu_thresh
coord_desc_thresh
kmeans_thresh
- Default
factor_thresholding_H_regression
set tokmeans_thresh
. - Default
factor_thresholding
set tootsu_thresh
. - Introduced
factor_thresholding_H_regression_obj_params
to configure parameters. - Added K-means-based boolean thresholding for
W
andH
matrices:- Clusters values in each row of
W
andH
into two groups; then the boolean threshold is the midpoint of cluster centroids.
- Clusters values in each row of
Hardware and Device Management:
- Added
device
parameter toNMFk
for GPU management:device=-1
: Use all GPUs.device=0
: Use the GPU with ID 0.device=[0,1,...]
: Use a specific list of GPUs.- Negative values other than
-1
: Use(number of GPUs + device + 1)
.
Hierarchical NMFk (HNMFk) Improvements:
- Added new variables for nodes:
parent_node_factors_path
parent_node_k
factors_path
- Enabled dynamic renaming of paths when loading HNMFk models from different directories.
- Improved decomposition behavior:
- Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
- Added signature, centroid, and probabilities from parent nodes to child nodes.
- Introduced graph iterator methods for navigating to specific nodes by name.
- Updated node naming conventions to use ancestor-based indexing.
Result Storage:
- Added
W_all
to saved outputs ofNMFk
.
Installation and Documentation
- Migrated to a new installation system using pip and Poetry.
- Added a post-installation script for simplifying setup on different systems.
- Updated documentation for:
- New installation methods on Chicoma and Darwin.
Bug Fixes
- Corrected HNMFk behavior to return total data indices instead of indices of indices.
- Corrected naming inconsistencies in pruning variables in
NMFk
. - Fixed error calculation to consider only known locations when masking is applied.
- Resolved GPU transfer conflicts when using
MASK
. - Fixed default
device
parameter inNMFk
to be-1
(use all devices). - Addressed issues in
WNMFk
andBNMFk
examples. - Fixed checkpointing bugs:
- Made saving checkpoints true by default.
- Resolved issues when loading an HNMFk model during an ongoing process.
- Fixed scalar addition error with sparse matrices in
kl_mu
. - Resolved dependency conflicts with
numpy
andnumba
. - Updated HPC documentation for T-ELF installation.
v0.0.20
Fixes a bug on HNMFk where the original indices were wrong.
v0.0.19
- Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
- Fixes a bug with BST post-order search where the order was incorrect.
- Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".
* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
v0.0.18
- Fixes a bug where Ks were not organized correctly for BST post and pre order.
- Fixes a bug for H_sill_thresh, now allowing for being able to set threshold at negative values as well.
- Adds option to use either W sill for k prediction, H sill for k prediction, or both. Selection of the
predict_k_method
also changes how the BST search is done withk_search_method
. Below hyper-parameters for NMFk are modified accordingly:
predict_k_method : str, optional
Method to use when performing automatic k prediction. Default is "WH_sill".
predict_k_method='pvalue' # will use L-Statistics with column-wise error for automatically estimating the number of latent factors.
predict_k_method='WH_sill' # will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.
predict_k_method='W_sill' # will use Silhouette scores from W latent factor for estimating the number of latent factors.
predict_k_method='H_sill' # will use Silhouette scores from H latent factor for estimating the number of latent factors.
predict_k_method='sill' # will default to ``predict_k_method='WH_sill'``.
v0.0.17
New Features
-
Introduces a new Vulture subclass
VocabularyConsolidator
, underTELF.pre_processing.Vulture.tokens_analysis
, designed to consolidate vocabularies and textual terms. -
Refactors NMFk, RESCALk, HNMFk, and SymNMFk to enhance modularity. Helper functions are created under
TELF.factorization.utilities
to modularize the code. -
Adds a new search criterion for identifying the optimal rank, or K, to NMFk, HNMFk, WNMFk, and RNMFk. This enhancement introduces a significant speedup to each algorithm. The new criterion utilizes a Binary Search Tree to streamline the process of determining the optimal rank, drastically reducing the search space and the time needed for factorization. Additionally, this K search feature is compatible with High Performance Computing (HPC) systems, ensuring that changes in the K search space by any node are synchronized across all nodes. NMFk has been updated to include new hyper-parameters tailored to these search settings.
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".k_search_method='linear'
will linearly visit each K given inKs
hyper-parameter of thefit()
function.k_search_method='bst_post'
will perform post-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.k_search_method='bst_pre'
will perform pre-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.
H_sill_thresh : float, optional
Setting for removing higher ranks from the search space. The default is -1.When searching for the optimal rank with binary search using
k_search='bst_post'
ork_search='bst_pre'
, this hyper-parameter can be used to cut off higher ranks from search space.
The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette belowH_sill_thresh
is found for a given rank or K, all higher ranks are removed from the search space.
IfH_sill_thresh=-1
, it is not used.
Bugs
- Fixes a bug in RESCALk plotting where plotting function was expecting W and H silhouettes.
- Fixes a bug where k predict would not work if none of the
W
orH
silhouettes are above thesill_thresh
hyper-parameter. New fix selects newsill_thresh
based on the rule:self.sill_thresh = min([max(sils_min_W), max(sils_min_H)])
when none of theW
orH
silhouettes are above thesill_thresh
hyper-parameter. - Fixes a bug in document substitutions of Vulture where an error is raised if no corpus substitutions are passed.
v0.0.16
- Fixes a bug for HPC HNMFk capability when checkpointing would not save if using custom callback functionality.
- Fixes a bug in the stopwords option in Vulture Clean that excludes hyphens from stop word checks, a boolean in iterable’s place bug.
- Fixes a bug to flatten the output dictionary in the Vulture Acronyms module, a dictionary iteration bug.
- Fixes a bug where
itertools
was missing in permutation import in Vulture material permutations. - Fixes a bug in Vulture materials permutations for the
save_path
definition. - Adds Ks range and X shape checks for HNMFk to make sure the decomposition can still be done if using a callback functionality.
- Adds a feature to include lowercased materials in permutations.
- Adds future for material permutations.
- Adds multithread string consolidation in levenshtein.
- Levenshtein consolidation criteria change from shorest string to most common string.
- Moves HNMFk leaf node termination, based on sample threshold, to after factorization to obtain the latent factors W and H even for nodes where number of samples are less than the threshold.
v0.0.15
- Fixes a bug where Vulture Acronym Operator edge case producing wrong results when using substitutions.
- Fixes a bug where Vulture cleaning operations for stop words would not remove hyphenated words if they contain a stop word.
- Fixes minor bugs where conda environment activation was done wrong in hpc example scripts.
- Vulture Acronym Operator example notebook to be organized to show when the cleaning is done and when the acronym operation is done with substitutions.
- Acronym warning message printing class attribute instead of data.
- Adds HPC capability to HNMFk.
- Adds checkpointing capability for HNMFk.
- Adds online node operations for HNMFk, reducing the space taken by graph nodes.
- Adds per document based substitutions operator feature to Vulture.
- Adds Levenstein distance based acronym consolidation for post-processing of acronyms.
v0.0.14
- Adds callback functionality to HNMFk for generating new data matrix X at each NMFk application. This allows Semantic HNMFk by re-generating TF-IDF matrix at each node.
- Adds capability to HNMFk for saving custom user data in each node when using
generate_X_callback
. - Adds taking note for after pruning X shape and Ks range, and if decomposition is no longer possible after pruning by noting prune status.
- HNMFk now uses Path library to generate sub-directories automatically.
- Fixed bug where max(Ks) is more than min(X.shape) after pruning in NMFk.
- Fixed a bug where HNMFk is loading wrong factors when k=2 is True.
- Fixed a bug where NMFk would try to decompose data after pruning even if not possible (for example if the number of samples left is 1, or K range is empty based on the rule
k < min(X.shape)
. - Fixed a bug where
Beaver.get_vocabulary()
was not consistent with the vocabulary that is generated in the other matrix creation routines.
v0.0.13
- Adds HNMFk. Hierarchical Non-negative matrix factorization with automatic model determination with custom settings including missing value prediction. HNMFk has multi-processing capabilities for both CPU and GPU systems. HPC capabilities for HNMFk is planned to be added later.
- Fixes a bug on HPC example for WNMFk where number of nodes was not correct in the hyper-parameters.
v0.0.12
- Added ability to plot both silhouttes of latent patterns (W matrix) and the latent clusters (H matrix) to assist selecting the number of hidden patterns and the corresponding number of hidden clusters.
predict_k_method
default is changed to"sill"
.- NMFk plot will no longer include the blue relative error line when
calculate_error=False
. - New
predict_k_method="sill"
will predict k based on:- The maximum k where W silhoutte is above the threshold
sill_thresh
: Wk - The maximum k where H silhoutte is above the threshold
sill_thresh
: Hk - Final k, or number of hidden signals, will be
k=min(Wk, Hk)
.
- The maximum k where W silhoutte is above the threshold