Skip to content

Releases: lanl/T-ELF

v0.0.34

07 Jan 01:20
ff683c8
Compare
Choose a tag to compare

Fast-tracking to v0.0.34 from v0.0.20

Enhancements

Pruning Support:

  • Enabled pruning in bnmf, wnmf, and nmf_recommender.
  • Added pruning of additional matrices, e.g., MASK, based on X.
  • Included pruned_cols and pruned_rows in saved outputs.

Matrix Factorization:

  • Introduced new submodule BNMFk under NMFk with nmf_method='bnmf'.
  • Added WEIGHT and MASK keys for WNMFk and BNMFk.
  • Implemented matrix deletion in subroutines to reduce memory consumption.
  • Added factor_thresholding parameter to perform thresholding over NMFk factors, making them boolean. Options include:
    • coord_desc_thresh
    • WH_thresh
  • Introduced factor_thresholding_obj_params for configuring thresholding subroutines.
  • Added clustering_method parameter with options:
    • kmeans
    • bool or boolean (both are equivalent).
  • Introduced clustering_obj_params to configure clustering subroutines.
  • Added new perturbation type for boolean matrices: perturb_type='boolean' or perturb_type='bool'.
  • Updated examples to reflect new boolean-specific features.
  • Path compatibility using os.path.join.

Thresholding and Clustering:

  • Added factor_thresholding_H_regression with options:
    • otsu_thresh
    • coord_desc_thresh
    • kmeans_thresh
  • Default factor_thresholding_H_regression set to kmeans_thresh.
  • Default factor_thresholding set to otsu_thresh.
  • Introduced factor_thresholding_H_regression_obj_params to configure parameters.
  • Added K-means-based boolean thresholding for W and H matrices:
    • Clusters values in each row of W and H into two groups; then the boolean threshold is the midpoint of cluster centroids.

Hardware and Device Management:

  • Added device parameter to NMFk for GPU management:
    • device=-1: Use all GPUs.
    • device=0: Use the GPU with ID 0.
    • device=[0,1,...]: Use a specific list of GPUs.
    • Negative values other than -1: Use (number of GPUs + device + 1).

Hierarchical NMFk (HNMFk) Improvements:

  • Added new variables for nodes:
    • parent_node_factors_path
    • parent_node_k
    • factors_path
  • Enabled dynamic renaming of paths when loading HNMFk models from different directories.
  • Improved decomposition behavior:
    • Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
  • Added signature, centroid, and probabilities from parent nodes to child nodes.
  • Introduced graph iterator methods for navigating to specific nodes by name.
  • Updated node naming conventions to use ancestor-based indexing.

Result Storage:

  • Added W_all to saved outputs of NMFk.

Installation and Documentation

  • Migrated to a new installation system using pip and Poetry.
  • Added a post-installation script for simplifying setup on different systems.
  • Updated documentation for:
    • New installation methods on Chicoma and Darwin.

Bug Fixes

  • Corrected HNMFk behavior to return total data indices instead of indices of indices.
  • Corrected naming inconsistencies in pruning variables in NMFk.
  • Fixed error calculation to consider only known locations when masking is applied.
  • Resolved GPU transfer conflicts when using MASK.
  • Fixed default device parameter in NMFk to be -1 (use all devices).
  • Addressed issues in WNMFk and BNMFk examples.
  • Fixed checkpointing bugs:
    • Made saving checkpoints true by default.
    • Resolved issues when loading an HNMFk model during an ongoing process.
  • Fixed scalar addition error with sparse matrices in kl_mu.
  • Resolved dependency conflicts with numpy and numba.
  • Updated HPC documentation for T-ELF installation.

v0.0.20

24 Jul 19:59
581cceb
Compare
Choose a tag to compare

Fixes a bug on HNMFk where the original indices were wrong.

v0.0.19

04 May 22:10
309eb02
Compare
Choose a tag to compare
  • Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
  • Fixes a bug with BST post-order search where the order was incorrect.
  • Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:

k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".

* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.

v0.0.18

29 Apr 23:42
a0b2be7
Compare
Choose a tag to compare
  • Fixes a bug where Ks were not organized correctly for BST post and pre order.
  • Fixes a bug for H_sill_thresh, now allowing for being able to set threshold at negative values as well.
  • Adds option to use either W sill for k prediction, H sill for k prediction, or both. Selection of the predict_k_method also changes how the BST search is done with k_search_method. Below hyper-parameters for NMFk are modified accordingly:

predict_k_method : str, optional
Method to use when performing automatic k prediction. Default is "WH_sill".

predict_k_method='pvalue' # will use L-Statistics with column-wise error for automatically estimating the number of latent factors.
predict_k_method='WH_sill' # will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.
predict_k_method='W_sill' # will use Silhouette scores from W latent factor for estimating the number of latent factors.
predict_k_method='H_sill' # will use Silhouette scores from H latent factor for estimating the number of latent factors.
predict_k_method='sill' # will default to ``predict_k_method='WH_sill'``.

v0.0.17

27 Apr 00:34
014900f
Compare
Choose a tag to compare

New Features

  • Introduces a new Vulture subclass VocabularyConsolidator, under TELF.pre_processing.Vulture.tokens_analysis, designed to consolidate vocabularies and textual terms.

  • Refactors NMFk, RESCALk, HNMFk, and SymNMFk to enhance modularity. Helper functions are created under TELF.factorization.utilities to modularize the code.

  • Adds a new search criterion for identifying the optimal rank, or K, to NMFk, HNMFk, WNMFk, and RNMFk. This enhancement introduces a significant speedup to each algorithm. The new criterion utilizes a Binary Search Tree to streamline the process of determining the optimal rank, drastically reducing the search space and the time needed for factorization. Additionally, this K search feature is compatible with High Performance Computing (HPC) systems, ensuring that changes in the K search space by any node are synchronized across all nodes. NMFk has been updated to include new hyper-parameters tailored to these search settings.

    k_search_method : str, optional
    Which approach to use when searching for the rank or k. The default is "linear".

    • k_search_method='linear' will linearly visit each K given in Ks hyper-parameter of the fit() function.
    • k_search_method='bst_post' will perform post-order binary search. When an ideal rank is found with min(W silhouette, H silhouette) >= sill_thresh, all lower ranks are pruned from the search space.
    • k_search_method='bst_pre' will perform pre-order binary search. When an ideal rank is found with min(W silhouette, H silhouette) >= sill_thresh, all lower ranks are pruned from the search space.

    H_sill_thresh : float, optional
    Setting for removing higher ranks from the search space. The default is -1.

    When searching for the optimal rank with binary search using k_search='bst_post' or k_search='bst_pre', this hyper-parameter can be used to cut off higher ranks from search space.
    The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette below H_sill_thresh is found for a given rank or K, all higher ranks are removed from the search space.
    If H_sill_thresh=-1, it is not used.

Bugs

  • Fixes a bug in RESCALk plotting where plotting function was expecting W and H silhouettes.
  • Fixes a bug where k predict would not work if none of the W or H silhouettes are above the sill_thresh hyper-parameter. New fix selects new sill_thresh based on the rule: self.sill_thresh = min([max(sils_min_W), max(sils_min_H)]) when none of the W or H silhouettes are above the sill_thresh hyper-parameter.
  • Fixes a bug in document substitutions of Vulture where an error is raised if no corpus substitutions are passed.

v0.0.16

22 Apr 18:28
3207768
Compare
Choose a tag to compare
  • Fixes a bug for HPC HNMFk capability when checkpointing would not save if using custom callback functionality.
  • Fixes a bug in the stopwords option in Vulture Clean that excludes hyphens from stop word checks, a boolean in iterable’s place bug.
  • Fixes a bug to flatten the output dictionary in the Vulture Acronyms module, a dictionary iteration bug.
  • Fixes a bug where itertools was missing in permutation import in Vulture material permutations.
  • Fixes a bug in Vulture materials permutations for the save_path definition.
  • Adds Ks range and X shape checks for HNMFk to make sure the decomposition can still be done if using a callback functionality.
  • Adds a feature to include lowercased materials in permutations.
  • Adds future for material permutations.
  • Adds multithread string consolidation in levenshtein.
  • Levenshtein consolidation criteria change from shorest string to most common string.
  • Moves HNMFk leaf node termination, based on sample threshold, to after factorization to obtain the latent factors W and H even for nodes where number of samples are less than the threshold.

v0.0.15

19 Apr 00:21
2ed58e6
Compare
Choose a tag to compare
  • Fixes a bug where Vulture Acronym Operator edge case producing wrong results when using substitutions.
  • Fixes a bug where Vulture cleaning operations for stop words would not remove hyphenated words if they contain a stop word.
  • Fixes minor bugs where conda environment activation was done wrong in hpc example scripts.
  • Vulture Acronym Operator example notebook to be organized to show when the cleaning is done and when the acronym operation is done with substitutions.
  • Acronym warning message printing class attribute instead of data.
  • Adds HPC capability to HNMFk.
  • Adds checkpointing capability for HNMFk.
  • Adds online node operations for HNMFk, reducing the space taken by graph nodes.
  • Adds per document based substitutions operator feature to Vulture.
  • Adds Levenstein distance based acronym consolidation for post-processing of acronyms.

v0.0.14

15 Apr 18:19
b936163
Compare
Choose a tag to compare
  • Adds callback functionality to HNMFk for generating new data matrix X at each NMFk application. This allows Semantic HNMFk by re-generating TF-IDF matrix at each node.
  • Adds capability to HNMFk for saving custom user data in each node when using generate_X_callback.
  • Adds taking note for after pruning X shape and Ks range, and if decomposition is no longer possible after pruning by noting prune status.
  • HNMFk now uses Path library to generate sub-directories automatically.
  • Fixed bug where max(Ks) is more than min(X.shape) after pruning in NMFk.
  • Fixed a bug where HNMFk is loading wrong factors when k=2 is True.
  • Fixed a bug where NMFk would try to decompose data after pruning even if not possible (for example if the number of samples left is 1, or K range is empty based on the rule k < min(X.shape).
  • Fixed a bug where Beaver.get_vocabulary() was not consistent with the vocabulary that is generated in the other matrix creation routines.

v0.0.13

09 Apr 19:10
0179b57
Compare
Choose a tag to compare
  • Adds HNMFk. Hierarchical Non-negative matrix factorization with automatic model determination with custom settings including missing value prediction. HNMFk has multi-processing capabilities for both CPU and GPU systems. HPC capabilities for HNMFk is planned to be added later.
  • Fixes a bug on HPC example for WNMFk where number of nodes was not correct in the hyper-parameters.

v0.0.12

01 Apr 17:08
943a50c
Compare
Choose a tag to compare
  • Added ability to plot both silhouttes of latent patterns (W matrix) and the latent clusters (H matrix) to assist selecting the number of hidden patterns and the corresponding number of hidden clusters.
  • predict_k_method default is changed to "sill".
  • NMFk plot will no longer include the blue relative error line when calculate_error=False.
  • New predict_k_method="sill" will predict k based on:
    • The maximum k where W silhoutte is above the threshold sill_thresh: Wk
    • The maximum k where H silhoutte is above the threshold sill_thresh: Hk
    • Final k, or number of hidden signals, will be k=min(Wk, Hk).