What's the best practice way to remove data from a partitioned dataset?

Options

We have an ETL set up to a partitioned output. Now that we are far enough into 2024 I'm being told that we can reduce the size of this dataset by removing all 2022 data from it. What would be the best way to go about doing this?

Tagged:

Best Answer

  • david_cunningham
    edited April 26 Answer โœ“
    Options

    @Julianna_Potter there is a configuration option in the output dataset that let's you specify which partitions you want to keep.

    Exercise caution: ย If a partition filter expression is specified, all partitions are evaluated against it and any that do not pass are deleted.

    If you're talking about on the input side, you can set up that as well in the configuration

    If this answered your question, please 'like' and 'accept' my answer ๐Ÿ˜

    David Cunningham

    ** Was this post helpful? Click Agree ๐Ÿ˜€, Like ๐Ÿ‘๏ธ, or Awesome โค๏ธ below **
    ** Did this solve your problem? Accept it as a solution! โœ”๏ธ**

Answers

  • david_cunningham
    edited April 26 Answer โœ“
    Options

    @Julianna_Potter there is a configuration option in the output dataset that let's you specify which partitions you want to keep.

    Exercise caution: ย If a partition filter expression is specified, all partitions are evaluated against it and any that do not pass are deleted.

    If you're talking about on the input side, you can set up that as well in the configuration

    If this answered your question, please 'like' and 'accept' my answer ๐Ÿ˜

    David Cunningham

    ** Was this post helpful? Click Agree ๐Ÿ˜€, Like ๐Ÿ‘๏ธ, or Awesome โค๏ธ below **
    ** Did this solve your problem? Accept it as a solution! โœ”๏ธ**

  • DataMaven
    Options

    @Julianna_Potter - Do an export of the dataset before you take action if it's not too big to do so!

    DataMaven
    Breaking Down Silos - Building Bridges
    **Say "Thanks" by clicking a reaction in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Julianna_Potter
    Options

    @DataMaven thanks, but definitely too large for that. It's in the billions which is why I want to remove 2022 data from it.

  • Julianna_Potter
    Options

    @david_cunningham thanks for your response. I actually knew about that feature and totally spaced last week when I was trying to remember the best way to go about this. ๐Ÿคฆโ€โ™€๏ธ

  • DataMaven
    Options

    @Julianna_Potter - I figured that may be the case! Seeing who it was, I was pretty sure it had to be something like that, but I didn't want to assume.

    DataMaven
    Breaking Down Silos - Building Bridges
    **Say "Thanks" by clicking a reaction in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Julianna_Potter
    Options

    @DataMaven and @david_cunningham thank you both! I used the configuration setting to filter 2022 data out (after testing in another test partition) and it worked perfectly.