添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Let's say I have defined a dataset in this way:

filename_dataset = tf.data.Dataset.list_files("{}/*.png".format(dataset))

how can I get the number of elements that are inside the dataset (hence, the number of single elements that compose an epoch)?

I know that tf.data.Dataset already knows the dimension of the dataset, because the repeat() method allows repeating the input pipeline for a specified number of epochs. So it must be a way to get this information.

Do you need to have this information before the first epoch completed, or is it okay to compute it after? – P-Gn Jun 7, 2018 at 9:21 Working as an iterator, I don't think a Dataset knows the total number of elements before reaching the last one - then it starts repeating over if requested (c.f. source repeat_dataset_op.cc) – benjaminplanche Jun 7, 2018 at 9:29 Can't you just list the files in "{}/*.png".format(dataset) before (say via glob or os.listdir), get the length of that and then pass the list to a Dataset? Datasets don't have (natively) access to the number of items they contain (knowing that number would require a full pass on the dataset, and you still have the case of unlimited datasets coming from streaming data or generators) – GPhilo Jun 7, 2018 at 9:39 @GPhilo understood, thank you for the explanation! However the answer of user1735003 perfectly fits my needs – nessuno Jun 7, 2018 at 10:03 It defeats the purpose of it being an iterator. Calling list() runs the entire thing in a single shot. It works for smaller amounts of data, but can likely take too many resources for larger datasets. – yrekkehs Jan 6, 2020 at 10:44
dataset.cardinality().numpy()

Note that the .cardinality() method was integrated into the main package (before it was in the experimental package).

NOTE that when applying the filter() operation this operation can return -2.

len(dataset) also works for Tensorflow >=2.3 (without having to convert to a list first). The main difference with calling dataset.cardinality() directly seems to be that len(dataset) converts the result into a native Python int. – apdnu Jun 4 at 19:52

Take a look here: https://github.com/tensorflow/tensorflow/issues/26966

It doesn't work for TFRecord datasets, but it works fine for other types.

TL;DR:

num_elements = tf.data.experimental.cardinality(dataset).numpy()

Use tf.data.experimental.cardinality(dataset) - see here.

In case of tensorflow datasets you can use _, info = tfds.load(with_info=True). Then you may call info.splits['train'].num_examples. But even in this case it doesn't work properly if you define your own split.

So you may either count your files or iterate over the dataset (like described in other answers):

num_training_examples = 0
num_validation_examples = 0
for example in training_set:
    num_training_examples += 1
for example in validation_set:
    num_validation_examples += 1
                Given how often cardinality has failed for me without obvious filters I would suggest using this response!
– n8yoder
                Oct 13, 2022 at 16:04

tf.data.Dataset.list_files creates a tensor called MatchingFiles:0 (with the appropriate prefix if applicable).

You could evaluate

tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]

to get the number of files.

Of course, this would work in simple cases only, and in particular if you have only one sample (or a known number of samples) per image.

In more complex situations, e.g. when you do not know the number of samples in each file, you can only observe the number of samples as an epoch ends.

To do this, you can watch the number of epochs that is counted by your Dataset. repeat() creates a member called _count, that counts the number of epochs. By observing it during your iterations, you can spot when it changes and compute your dataset size from there.

This counter may be buried in the hierarchy of Datasets that is created when calling member functions successively, so we have to dig it out like this.

d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround 
RepeatDataset = type(tf.data.Dataset().repeat())
  while not isinstance(d, RepeatDataset):
    d = d._input_dataset
except AttributeError:
  warnings.warn('no epoch counter found')
  epoch_counter = None
else:
  epoch_counter = d._count

Note that with this technique, the computation of your dataset size is not exact, because the batch during which epoch_counter is incremented typically mixes samples from two successive epochs. So this computation is precise up to your batch length.

Unfortunately, I don't believe there is a feature like that yet in TF. With TF 2.0 and eager execution however, you could just iterate over the dataset:

num_elements = 0
for element in dataset:
    num_elements += 1

This is the most storage efficient way I could come up with

This really feels like a feature that should have been added a long time ago. Fingers crossed they add this a length feature in a later version.

Alternatively, a more concise way to add things up in TF 2.0: count = dataset.reduce(0, lambda x, _: x + 1) – Happy Gene Oct 28, 2019 at 21:56 I found you have to call numpy() on count to get the actual value otherwise count is a tensor. i.e: count = dataset.reduce(0, lambda x, _: x + 1).numpy() – CSharp Nov 25, 2019 at 9:31

I saw many methods of getting the number of samples, but actually you can easily do it by in keras:

len(dataset) * BATCH_SIZE

For early Tensorflow versions (2.1 or higher):

sum(dataset.map(lambda x: 1).as_numpy_iterator())

That way you don't have to load each object in your dataset to your run memory, instead you put 1's and sum it.

For some datasets like COCO, cardinality function does not return a size. One way to compute size of a dataset fast is to use map reduce, like so:

ds.map(lambda x: 1, num_parallel_calls=tf.data.experimental.AUTOTUNE).reduce(tf.constant(0), lambda x,_: x+1)

In TensorFlow 2.6.0 (I am not sure if it was possible in earlier versions or no):

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#__len__

Dataset.__len__()

Bit late to the party but for a large dataset stored in TFRecord datasets I used this (TF 1.15)

import tensorflow as tf
tf.compat.v1.enable_eager_execution()
dataset = tf.data.TFRecordDataset('some_path')
# Count 
n = 0
take_n = 200000
for samples in dataset.batch(take_n):
  n += take_n
  print(n)

Let's say you want to find out the number of the training split in the oxford-iiit-pet dataset:

ds, info = tfds.load('oxford_iiit_pet', split='train', shuffle_files=True, as_supervised=True, with_info=True)
print(info.splits['train'].num_examples)
                I think your solution is incorrent. The return object, ds, is not the same as what split['train'] represents. You can see what I mean by this: (train, val), info = tfds.load('oxford_iiit_pet', split=['train[:70%]','train[70%:]'], shuffle_files=True, as_supervised=True). The sizes of subdatasets train and val change as we modify the percentage specified in split= argument. However, info.splits['train'].num_examples is fixed at 3680.
– Li-Pin Juan
                Feb 19, 2021 at 16:23
                Hi, I think you are wrong. len() is not applicable to tf.data.dataset object. Based on the discussion of this thread, it's unlikely to have this feature in the near future.
– Li-Pin Juan
                Feb 16, 2021 at 4:19
                Hey, I would not describe it as not applicable. I had a dataset of 391 images and it returned exactly that.
– alzoubi36
                Feb 20, 2021 at 6:05
                I knew it works in some cases but generally it doesn't work. len() is unable to be applied on a Dataset object like this one, for example, tfds.load('tf_flowers')['train'].repeat() because the size of it is infinite.
– Li-Pin Juan
                Feb 21, 2021 at 0:36

I am very surprised that this problem does not have an explicit solution, because this was such a simple feature. When I iterate over the dataset through TQDM, I find that TQDM finds the data size. How does this work?

for x in tqdm(ds['train']):
  //Something
-> 1%|          | 15643/1281167 [00:16<07:06, 2964.90it/s]v
t=tqdm(ds['train'])
t.total
-> 1281167
                @Yatin I found a very fast solution(the second code snippet), but I also want to understand how this works behind the scenes, and how to clean it up.
– krenerd
                Mar 5, 2021 at 6:57
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.