Regex improvement and split into seperate package (#132)

- Rewrite of the regex model, with a new interface. - Regex model is now its own separate package. - Also add a new legacy system to deprecate old distributions.
sodascience · Sep 18, 2023 · bf7a79a · bf7a79a
1 parent 8284dbd
commit bf7a79a
Show file tree

Hide file tree

Showing 20 changed files with 1,224 additions and 1,660 deletions.
diff --git a/docs/source/api/metasynth.distribution.regex.rst b/docs/source/api/metasynth.distribution.regex.rst
diff --git a/docs/source/api/metasynth.distribution.rst b/docs/source/api/metasynth.distribution.rst
@@ -1,14 +1,6 @@
 metasynth.distribution package
 ==============================
 
-Subpackages
------------
-
-.. toctree::
-   :maxdepth: 4
-
-   metasynth.distribution.regex
-
 Submodules
 ----------
 
@@ -60,6 +52,15 @@ metasynth.distribution.faker module
    :undoc-members:
    :show-inheritance:
 
+metasynth.distribution.regex module
+-----------------------------------
+
+.. automodule:: metasynth.distribution.regex
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 Module contents
 ---------------
 

diff --git a/examples/advanced_tutorial.ipynb b/examples/advanced_tutorial.ipynb
@@ -12,13 +12,13 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## Step 0: Install the metasynth package and import required packages"
-   ],
+   "id": "f5c6597b",
    "metadata": {
     "collapsed": false
    },
-   "id": "f5c6597b"
+   "source": [
+    "## Step 0: Install the metasynth package and import required packages"
+   ]
   },
   {
    "cell_type": "code",
@@ -105,22 +105,22 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "outputs": [],
-   "source": [
-    "df.describe()"
-   ],
+   "id": "c72c2acb55fca193",
    "metadata": {
     "collapsed": false
    },
-   "id": "c72c2acb55fca193"
+   "outputs": [],
+   "source": [
+    "df.describe()"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [],
+   "id": "5df3f1e974e84da4",
    "metadata": {
     "collapsed": false
    },
-   "id": "5df3f1e974e84da4"
+   "source": []
   },
   {
    "cell_type": "markdown",
@@ -162,26 +162,26 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Alternatively, we can preview the MetaFrame as it would be output to a file"
-   ],
+   "id": "aeb3edae1eedf4b8",
    "metadata": {
     "collapsed": false
    },
-   "id": "aeb3edae1eedf4b8"
+   "source": [
+    "Alternatively, we can preview the MetaFrame as it would be output to a file"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cbb6f59f1d439189",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "json_preview = repr(mf)\n",
     "print(json_preview)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "cbb6f59f1d439189"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -226,25 +226,25 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c5eac7eeb3326f03",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "#load previously exported MetaFrame (.json) file\n",
     "mf = MetaFrame.from_json(file_path)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "c5eac7eeb3326f03"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
-   ],
+   "id": "85201666a67a73fd",
    "metadata": {
     "collapsed": false
    },
-   "id": "85201666a67a73fd"
+   "source": [
+    "Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
+   ]
   },
   {
    "cell_type": "code",
@@ -423,7 +423,7 @@
     "# To create a regex distribution, you need a list of tuples, where each tuple is an element.\n",
     "# The first part of the tuple is a string representation of the regex, while the second is the proportion of the\n",
     "# time the regex element is used.\n",
-    "cabin_distribution = RegexDistribution(r\"[ABCDEF]\\d{2,3}\")  # Add the r so that it becomes a literal string.\n",
+    "cabin_distribution = RegexDistribution(r\"[ABCDEF][0-9]{2,3}\")  # Add the r so that it becomes a literal string.\n",
     "# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\\d{2,3})?\n",
     "\n",
     "var_spec = {\n",
@@ -519,7 +519,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.11.1"
   },
   "vscode": {
    "interpreter": {

diff --git a/examples/getting_started.ipynb b/examples/getting_started.ipynb
@@ -12,14 +12,14 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## Step 0: Install the metasynth package and import required packages\n",
-    "First, install the metasynth package in your session:"
-   ],
+   "id": "f5c6597b",
    "metadata": {
     "collapsed": false
    },
-   "id": "f5c6597b"
+   "source": [
+    "## Step 0: Install the metasynth package and import required packages\n",
+    "First, install the metasynth package in your session:"
+   ]
   },
   {
    "cell_type": "code",
@@ -152,26 +152,26 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Alternatively, we can preview the MetaFrame as it would be output to a file"
-   ],
+   "id": "572128238aa26d67",
    "metadata": {
     "collapsed": false
    },
-   "id": "572128238aa26d67"
+   "source": [
+    "Alternatively, we can preview the MetaFrame as it would be output to a file"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "42470906a0d25575",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "json_preview = repr(mf)\n",
     "print(json_preview)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "42470906a0d25575"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -226,26 +226,26 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
-   ],
+   "id": "48bc803ba4e76a1c",
    "metadata": {
     "collapsed": false
    },
-   "id": "48bc803ba4e76a1c"
+   "source": [
+    "Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4ccf451c",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "# generate synthetic data\n",
     "mf.synthesize(5)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "4ccf451c"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -275,7 +275,7 @@
     "    # Manually set a distribution for age \n",
     "    \"Age\": {\"distribution\": DiscreteUniformDistribution(20, 40)},\n",
     "    # Manually set a regex distribution for cabin\n",
-    "    \"Cabin\": {\"distribution\": RegexDistribution(r\"[ABCDEF]\\d{2,3}\")}\n",
+    "    \"Cabin\": {\"distribution\": RegexDistribution(r\"[ABCDEF][0-9]{2,3}\")}\n",
     "}\n",
     "\n",
     "# create the high-quality metadata\n",
@@ -313,6 +313,14 @@
    "source": [
     "syn_df.describe()"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f1e2244c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -331,7 +339,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.11.1"
   },
   "vscode": {
    "interpreter": {